System and method for tokenization of text using classifier models
First Claim
1. A non-transitory computer readable storage medium encoded with processor-readable code that, when executed, implements:
- a featurizer that transforms input text into a plurality of token structures each comprising a token and attributes corresponding to said token, the token being a smallest meaningful unit of text;
a comparator that receives said token structures from said featurizer and performs a comparison of each token structure of said plurality of token structures to at least one language model lexicon to determine if said at least one language model lexicon contains said token structure and, if so, adds said token structure to a candidate list associated with the token structure;
a classifier that receives said plurality of token structures from said comparator, said classifier operatively configured to determine, for each token structure in the plurality of token structures, if a classifier model from a list of available classifier models is suitable for the token structure and, if so, to apply said token structure to said suitable classifier model to determine classification information and to store the determined classification information in the token structure; and
a finalizer that receives said plurality of token structures from said classifier and candidate lists associated with said plurality of token structures, said finalizer being configured to, for each token structure in the plurality of token structures;
determine if said token structure includes classification information;
convert the text associated with the respective token of the token structure according to the classification information if the token structure includes classification information and output a converted token structure with the converted text to an output token list; and
select a top candidate token structure from the candidate list associated with the token structure if the token structure does not include classification information and output said top candidate token structure to the output token list.
7 Assignments
0 Petitions
Accused Products
Abstract
The present invention pertains to a system and method for the tokenization of text. The featurizer may be configured to receive input text and convert the input text into tokens. According to one aspect of the invention, the tokens may include only one type of character, the characters selected from the group consisting of letters, numbers, and punctuation. The tokenizer may also include a classifier. The classifier may be configured to receive the tokens from the featurizer. Furthermore, the classifier may be configured to analyze the tokens received from the featurizer to determine if the tokens may be input into a predetermined classification model using a preclassifier. If one of the tokens passes the preclassifier, then the token is classified using the predetermined classification model. Additionally, according to a first aspect of the invention, the tokenizer may also include a finalizer. The finalizer may be configured to receive the tokens and may be configured to produce a final output.
-
Citations
14 Claims
-
1. A non-transitory computer readable storage medium encoded with processor-readable code that, when executed, implements:
-
a featurizer that transforms input text into a plurality of token structures each comprising a token and attributes corresponding to said token, the token being a smallest meaningful unit of text; a comparator that receives said token structures from said featurizer and performs a comparison of each token structure of said plurality of token structures to at least one language model lexicon to determine if said at least one language model lexicon contains said token structure and, if so, adds said token structure to a candidate list associated with the token structure; a classifier that receives said plurality of token structures from said comparator, said classifier operatively configured to determine, for each token structure in the plurality of token structures, if a classifier model from a list of available classifier models is suitable for the token structure and, if so, to apply said token structure to said suitable classifier model to determine classification information and to store the determined classification information in the token structure; and a finalizer that receives said plurality of token structures from said classifier and candidate lists associated with said plurality of token structures, said finalizer being configured to, for each token structure in the plurality of token structures; determine if said token structure includes classification information; convert the text associated with the respective token of the token structure according to the classification information if the token structure includes classification information and output a converted token structure with the converted text to an output token list; and select a top candidate token structure from the candidate list associated with the token structure if the token structure does not include classification information and output said top candidate token structure to the output token list. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for tokenizing text, the method comprising:
-
receiving input text by a featurizer that transforms said input text into a plurality of token structures each comprising a token and attributes corresponding to said token, the token being a smallest meaningful unit of text; providing said plurality of token structures to a comparator that performs a comparison of each token structure of said plurality of token structures to at least one language model lexicon to determine if said at least one language model lexicon contains said token structure and, if so, adds said token structure to a candidate list associated with the token structure; providing the plurality of token structures to a classifier that determines, for each token structure of the plurality of token structures, if a classifier model from a list of available classifier models is suitable for the token structure and, if so, applying said token structure to said suitable classifier model to determine classification information and storing the determined classification information in the token structure; and providing the plurality of token structures to a finalizer that, for each token structure in the plurality of token structures; determines if said token structure includes classification information; converts the text associated with the respective token of the token structure according to the classification information if the token structure includes classification information and outputs a converted token structure with the converted text to an output token list; and selects a top candidate token structure from the candidate list associated with the token structure if the token structure does not include classification information and outputs said top candidate token structure to the output token list. - View Dependent Claims (7, 8, 9, 10, 11)
-
-
12. A non-transitory computer readable storage medium encoded with computer executable instructions coded to:
-
receive input text; transform the input text into a plurality of token structures each comprising a token and attributes corresponding to said token, the token being a smallest meaningful unit of text; perform a comparison of each token structure of said plurality of token structures to at least one language model lexicon to determine if said at least one language model lexicon contains said token structure and, if so, add said token structure to a candidate list associated with the token structure; determine, for each token structure of the plurality of token structures, if a classifier model from a list of classifier models is suitable for the token structure and, if so, apply said token structure to said suitable classifier model to determine classification information and store the determined classification information in the token structure; provide the plurality of token structures to a finalizer that, for each token structure in the plurality of token structures; determines if said token structure comprises classification information; converts the text associated with the respective token of the token structure according to the classification information if the token structure includes classification information and outputs a converted token structure with the converted text to an output token list; and selects a top candidate token structure from the candidate list associated with the token structure if the token structure does not include classification information and outputs said top candidate token structure to the output token list. - View Dependent Claims (13, 14)
-
Specification