×

System and method for tokenization of text using classifier models

  • US 7,937,263 B2
  • Filed: 12/01/2004
  • Issued: 05/03/2011
  • Est. Priority Date: 12/01/2004
  • Status: Expired due to Fees
First Claim
Patent Images

1. A non-transitory computer readable storage medium encoded with processor-readable code that, when executed, implements:

  • a featurizer that transforms input text into a plurality of token structures each comprising a token and attributes corresponding to said token, the token being a smallest meaningful unit of text;

    a comparator that receives said token structures from said featurizer and performs a comparison of each token structure of said plurality of token structures to at least one language model lexicon to determine if said at least one language model lexicon contains said token structure and, if so, adds said token structure to a candidate list associated with the token structure;

    a classifier that receives said plurality of token structures from said comparator, said classifier operatively configured to determine, for each token structure in the plurality of token structures, if a classifier model from a list of available classifier models is suitable for the token structure and, if so, to apply said token structure to said suitable classifier model to determine classification information and to store the determined classification information in the token structure; and

    a finalizer that receives said plurality of token structures from said classifier and candidate lists associated with said plurality of token structures, said finalizer being configured to, for each token structure in the plurality of token structures;

    determine if said token structure includes classification information;

    convert the text associated with the respective token of the token structure according to the classification information if the token structure includes classification information and output a converted token structure with the converted text to an output token list; and

    select a top candidate token structure from the candidate list associated with the token structure if the token structure does not include classification information and output said top candidate token structure to the output token list.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×