METHOD FOR AUTOMATICALLY GENERATING REGULAR EXPRESSIONS FOR RELAXED MATCHING OF TEXT PATTERNS
First Claim
1. A computer-implemented method of automatically generating regular expressions for relaxed matching of text patterns, comprising:
- loading, by a computing system, a predefined set of rules from a rule file in a repository coupled to said computing system, wherein each rule of said predefined set of rules is expressed in an Extensible Markup Language (XML) format;
receiving, by a computing system, an input phrase expressed in a natural language;
determining, by said computing system, that said input phrase is a plain text pattern, wherein said determining that said input phrase is said plain text pattern includes determining that said input phrase is not a regular expression;
automatically tokenizing, by said computing system, said plain text pattern, wherein said automatically tokenizing includes automatically generating a first token list;
automatically applying, by said computing system, one or more rules to said first token list, wherein said automatically applying includes applying said one or more rules in an order specified by said predefined set of rules, automatically modifying said first token list and automatically generating a modified token list in response to said automatically modifying said first token list, wherein said one or more rules are included in said predefined set of rules, wherein said automatically modifying said first token list includes applying a predefined modification operator to said first token list, wherein said predefined modification operator is included in a rule of said one or more rules, wherein said predefined modification operator is an operator selected from the group consisting of a replace word operator, a split-at-character operator, and a whitespace operator, wherein said automatically modifying said first token list further includes;
replacing a sequence of one or more tokens in said first token list with a replacement regular expression specified by said rule if said predefined modification operator is said replace word operator,detecting a character specified by said rule and splitting a token of said first token list into two tokens in response to said detecting said character if said predefined modification operator is said split-at-character operator, andreplacing whitespace in said first token list with a replacement regular expression specified by said rule if said predefined modification operator is said whitespace operator; and
automatically converting, by said computing system, said modified token list into a regular expression, wherein said regular expression matches said plain text pattern and one or more variations of said plain text pattern.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for automatically generating regular expressions for relaxed matching of text patterns. A received input phrase expressed in a natural language is determined to be a plain text pattern. The plain text pattern is automatically tokenized, thereby generating a first token list. Rules loaded from a predefined rule set are automatically applied to the first token list in an order specified by the predefined rule set to automatically modify a token list by applying a replace word, split-at-character or whitespace operator. The modified token list is automatically converted into a regular expression that matches the plain text pattern and one or more variations of the plain text pattern. A utilization of the regular expression for an information extraction facilitates a recall and a precision of the information extraction.
66 Citations
2 Claims
-
1. A computer-implemented method of automatically generating regular expressions for relaxed matching of text patterns, comprising:
-
loading, by a computing system, a predefined set of rules from a rule file in a repository coupled to said computing system, wherein each rule of said predefined set of rules is expressed in an Extensible Markup Language (XML) format; receiving, by a computing system, an input phrase expressed in a natural language; determining, by said computing system, that said input phrase is a plain text pattern, wherein said determining that said input phrase is said plain text pattern includes determining that said input phrase is not a regular expression; automatically tokenizing, by said computing system, said plain text pattern, wherein said automatically tokenizing includes automatically generating a first token list; automatically applying, by said computing system, one or more rules to said first token list, wherein said automatically applying includes applying said one or more rules in an order specified by said predefined set of rules, automatically modifying said first token list and automatically generating a modified token list in response to said automatically modifying said first token list, wherein said one or more rules are included in said predefined set of rules, wherein said automatically modifying said first token list includes applying a predefined modification operator to said first token list, wherein said predefined modification operator is included in a rule of said one or more rules, wherein said predefined modification operator is an operator selected from the group consisting of a replace word operator, a split-at-character operator, and a whitespace operator, wherein said automatically modifying said first token list further includes; replacing a sequence of one or more tokens in said first token list with a replacement regular expression specified by said rule if said predefined modification operator is said replace word operator, detecting a character specified by said rule and splitting a token of said first token list into two tokens in response to said detecting said character if said predefined modification operator is said split-at-character operator, and replacing whitespace in said first token list with a replacement regular expression specified by said rule if said predefined modification operator is said whitespace operator; and automatically converting, by said computing system, said modified token list into a regular expression, wherein said regular expression matches said plain text pattern and one or more variations of said plain text pattern.
-
-
2-20. -20. (canceled)
Specification