System and method for normalization of a string of words
First Claim
1. In a computer system, a method for use in a predetermined categorization scheme, comprising:
- normalizing a string of words utilizing a computer configured to perform the steps of;
receiving an input string of text;
tagging the string of text by annotating a string of words with labels marking the start and end of relevant portions of text;
comparing said tagged strings of text to a literal index, the literal index including a plurality of predetermined text sequences;
determining if the string of text matches at least one of the plurality of predetermined text sequences within the literal index;
if the string of words does not match at least one of the plurality of predetermined text sequences;
determining a baseform transform of the input string, said baseform transform derived by removing of noise words and stemming the remaining words using de-derivation and uninflection, said baseform transform including at least one baseform associated with the input string;
preparing a sorted version of the baseform transform;
comparing the at least one baseform to a baseform index, the baseform index including a plurality of predetermined baseform sequences;
determining a score for each of the plurality of predetermined baseform sequences that substantially match the at least one baseform and outputting feedback for any baseforms that exceed a predetermined threshold score;
if no baseforms exceed the predetermined threshold score;
computing a feature transformation of the input string, the feature transform including at least one feature associated with the input string;
comparing the at least one feature to a feature index, the feature index including a plurality of predetermined feature sequences;
determining a score for each of the plurality of predetermined feature sequences that substantially match the at least one feature; and
outputting a hit list of candidate sequence matches based on the input string, and if no feature sequences are found based on the input string, outputting an indication that no predetermined text sequences were found within the predetermined categorization scheme wherein the method is performed by a computer executing stored instructions.
9 Assignments
0 Petitions
Accused Products
Abstract
The present invention relates generally to a system and method for categorization of strings of words. More specifically, the present invention relates to a system and method for normalizing a string of words for use in a system for categorization of words in a predetermined categorization scheme. A method for adaptive categorization of words in a predetermined categorization scheme may include receiving a string of text, tagging the string of text, and normalizing the string of text. Normalization may be performed with a three-stage algorithm including a literal match processing stage, an approximation match processing stage, and a nearest neighbor match processing stage. The normalized string of text can be compared to a number of sequences of text in the predetermined categorization scheme.
115 Citations
2 Claims
-
1. In a computer system, a method for use in a predetermined categorization scheme, comprising:
normalizing a string of words utilizing a computer configured to perform the steps of; receiving an input string of text; tagging the string of text by annotating a string of words with labels marking the start and end of relevant portions of text; comparing said tagged strings of text to a literal index, the literal index including a plurality of predetermined text sequences; determining if the string of text matches at least one of the plurality of predetermined text sequences within the literal index; if the string of words does not match at least one of the plurality of predetermined text sequences; determining a baseform transform of the input string, said baseform transform derived by removing of noise words and stemming the remaining words using de-derivation and uninflection, said baseform transform including at least one baseform associated with the input string; preparing a sorted version of the baseform transform; comparing the at least one baseform to a baseform index, the baseform index including a plurality of predetermined baseform sequences; determining a score for each of the plurality of predetermined baseform sequences that substantially match the at least one baseform and outputting feedback for any baseforms that exceed a predetermined threshold score; if no baseforms exceed the predetermined threshold score; computing a feature transformation of the input string, the feature transform including at least one feature associated with the input string; comparing the at least one feature to a feature index, the feature index including a plurality of predetermined feature sequences; determining a score for each of the plurality of predetermined feature sequences that substantially match the at least one feature; and outputting a hit list of candidate sequence matches based on the input string, and if no feature sequences are found based on the input string, outputting an indication that no predetermined text sequences were found within the predetermined categorization scheme wherein the method is performed by a computer executing stored instructions.
-
2. An apparatus for normalizing a string of words for use in a predetermined categorization scheme, comprising:
a processor and a memory encoded with instructions, for execution by the processor, to receive an input string of text, to tag relevant portions of the input string by marking the beginning and the end of said relevant portions of the input string and by marking said relevant portions of the input string with semantic labels based on the predetermined categorization scheme, to compare the tagged portions of said input string to a literal index, where the literal index includes a plurality of predetermined text sequences, and to determine if the string of text matches at least one of the plurality of predetermined text sequences within the literal index;
whereinif the string of words does not match at least one of the plurality of predetermined text sequences; determining a baseform transform of the input string, said baseform transform derived by removing of noise words and stemming the remaining words using de-derivation and uninflection, said baseform transform including at least one baseform associated with the input string; preparing a sorted version of the baseform transform; comparing the at least one baseform to a baseform index, the baseform index including a plurality of predetermined baseform sequences; determining a score for each of the plurality of predetermined baseform sequences that substantially match the at least one baseform and outputting feedback for any baseforms that exceed a predetermined threshold score; if no baseforms exceed the predetermined threshold score; computing a feature transformation of the input string, the feature transform including at least one feature associated with the input string; comparing the at least one feature to a feature index, the feature index including a plurality of predetermined feature sequences; determining a score for each of the plurality of predetermined feature sequences that substantially match the at least one feature; and outputting a hit list of candidate sequence matches based on the input string, and if no feature sequences are found based on the input string, outputting an indication that no predetermined text sequences were found within the predetermined categorization scheme.
Specification