System and method for tokenization of text using classifier models

US 7,937,263 B2
Filed: 12/01/2004
Issued: 05/03/2011
Est. Priority Date: 12/01/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A non-transitory computer readable storage medium encoded with processor-readable code that, when executed, implements:

a featurizer that transforms input text into a plurality of token structures each comprising a token and attributes corresponding to said token, the token being a smallest meaningful unit of text;

a comparator that receives said token structures from said featurizer and performs a comparison of each token structure of said plurality of token structures to at least one language model lexicon to determine if said at least one language model lexicon contains said token structure and, if so, adds said token structure to a candidate list associated with the token structure;

a classifier that receives said plurality of token structures from said comparator, said classifier operatively configured to determine, for each token structure in the plurality of token structures, if a classifier model from a list of available classifier models is suitable for the token structure and, if so, to apply said token structure to said suitable classifier model to determine classification information and to store the determined classification information in the token structure; and

a finalizer that receives said plurality of token structures from said classifier and candidate lists associated with said plurality of token structures, said finalizer being configured to, for each token structure in the plurality of token structures;

determine if said token structure includes classification information;

convert the text associated with the respective token of the token structure according to the classification information if the token structure includes classification information and output a converted token structure with the converted text to an output token list; and

select a top candidate token structure from the candidate list associated with the token structure if the token structure does not include classification information and output said top candidate token structure to the output token list.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention pertains to a system and method for the tokenization of text. The featurizer may be configured to receive input text and convert the input text into tokens. According to one aspect of the invention, the tokens may include only one type of character, the characters selected from the group consisting of letters, numbers, and punctuation. The tokenizer may also include a classifier. The classifier may be configured to receive the tokens from the featurizer. Furthermore, the classifier may be configured to analyze the tokens received from the featurizer to determine if the tokens may be input into a predetermined classification model using a preclassifier. If one of the tokens passes the preclassifier, then the token is classified using the predetermined classification model. Additionally, according to a first aspect of the invention, the tokenizer may also include a finalizer. The finalizer may be configured to receive the tokens and may be configured to produce a final output.

Citations

14 Claims

1. A non-transitory computer readable storage medium encoded with processor-readable code that, when executed, implements:
- a featurizer that transforms input text into a plurality of token structures each comprising a token and attributes corresponding to said token, the token being a smallest meaningful unit of text;
  
  a comparator that receives said token structures from said featurizer and performs a comparison of each token structure of said plurality of token structures to at least one language model lexicon to determine if said at least one language model lexicon contains said token structure and, if so, adds said token structure to a candidate list associated with the token structure;
  
  a classifier that receives said plurality of token structures from said comparator, said classifier operatively configured to determine, for each token structure in the plurality of token structures, if a classifier model from a list of available classifier models is suitable for the token structure and, if so, to apply said token structure to said suitable classifier model to determine classification information and to store the determined classification information in the token structure; and
  
  a finalizer that receives said plurality of token structures from said classifier and candidate lists associated with said plurality of token structures, said finalizer being configured to, for each token structure in the plurality of token structures;
  
  determine if said token structure includes classification information;
  
  convert the text associated with the respective token of the token structure according to the classification information if the token structure includes classification information and output a converted token structure with the converted text to an output token list; and
  
  select a top candidate token structure from the candidate list associated with the token structure if the token structure does not include classification information and output said top candidate token structure to the output token list.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The computer readable storage medium according to claim 1, wherein at least one token comprises only one type of character selected from the group consisting of letters, numbers, and punctuation.
  - 3. The computer readable storage medium according to claim 1, wherein said classifier determines said suitable classifier model by sequentially applying said token structure to a series of prefilters, each said prefilter corresponding to one of said classifier models available from said list of classifier models.
  - 4. The computer readable medium of claim 1, wherein the comparator further determines a plurality of token variants of at least one token structure of the plurality of token structures, the plurality of token variants describing a respective plurality of variations of formats of text associated with the respective at least one token structure.
  - 5. The computer readable medium according to claim 4, wherein the comparator further performs a comparison of each of the plurality of token variants of the at least one token structure to the at least one language model lexicon to determine if the at least one language model lexicon contains said token variants of said token structure, and when a token variant of the plurality of token variants is found in the language model lexicon, the comparator further adds said token variant to the candidate list associated with the token structure.

6. A method for tokenizing text, the method comprising:
- receiving input text by a featurizer that transforms said input text into a plurality of token structures each comprising a token and attributes corresponding to said token, the token being a smallest meaningful unit of text;
  
  providing said plurality of token structures to a comparator that performs a comparison of each token structure of said plurality of token structures to at least one language model lexicon to determine if said at least one language model lexicon contains said token structure and, if so, adds said token structure to a candidate list associated with the token structure;
  
  providing the plurality of token structures to a classifier that determines, for each token structure of the plurality of token structures, if a classifier model from a list of available classifier models is suitable for the token structure and, if so, applying said token structure to said suitable classifier model to determine classification information and storing the determined classification information in the token structure; and
  
  providing the plurality of token structures to a finalizer that, for each token structure in the plurality of token structures;
  
  determines if said token structure includes classification information;
  
  converts the text associated with the respective token of the token structure according to the classification information if the token structure includes classification information and outputs a converted token structure with the converted text to an output token list; and
  
  selects a top candidate token structure from the candidate list associated with the token structure if the token structure does not include classification information and outputs said top candidate token structure to the output token list.
- View Dependent Claims (7, 8, 9, 10, 11)
- - 7. The method according to claim 6, where the tokens include only one type of character selected from the group consisting of letters, numbers, and punctuation.
  - 8. The method according to claim 6, wherein said classifier determines said suitable classifier model by sequentially applying said token structure to a series of prefilters, each said prefilter corresponding to one of said classifier models available from said list of classifier models.
  - 9. The method according to claim 6, where the token comprises a set of characters.
  - 10. The method of claim 6, wherein the method further comprises, by the comparator, determining a plurality of token variants of at least one token structure of the plurality of token structures, the plurality of token variants describing a respective plurality of variations of formats of text associated with the respective at least one token structure.
  - 11. The method according to claim 10, wherein the method further comprises, by the comparator, performing a comparison of each of the plurality of token variants of the at least one token structure to the at least one language model lexicon to determine if the at least one language model lexicon contains said token variants of said token structure, and when a token variant of the plurality of token variants is found in the language model lexicon, adding said token variant to the candidate list associated with the token structure.

12. A non-transitory computer readable storage medium encoded with computer executable instructions coded to:
- receive input text;
  
  transform the input text into a plurality of token structures each comprising a token and attributes corresponding to said token, the token being a smallest meaningful unit of text;
  
  perform a comparison of each token structure of said plurality of token structures to at least one language model lexicon to determine if said at least one language model lexicon contains said token structure and, if so, add said token structure to a candidate list associated with the token structure;
  
  determine, for each token structure of the plurality of token structures, if a classifier model from a list of classifier models is suitable for the token structure and, if so, apply said token structure to said suitable classifier model to determine classification information and store the determined classification information in the token structure;
  
  provide the plurality of token structures to a finalizer that, for each token structure in the plurality of token structures;
  
  determines if said token structure comprises classification information;
  
  converts the text associated with the respective token of the token structure according to the classification information if the token structure includes classification information and outputs a converted token structure with the converted text to an output token list; and
  
  selects a top candidate token structure from the candidate list associated with the token structure if the token structure does not include classification information and outputs said top candidate token structure to the output token list.
- View Dependent Claims (13, 14)
- - 13. The computer readable storage medium of claim 12, wherein the computer readable medium is further encoded with computer executable instructions coded to determine a plurality of token variants of at least one token structure of the plurality of token structures, the plurality of token variants describing a respective plurality of variations of formats of text associated with the respective at least one token structure.
  - 14. The computer readable medium according to claim 13, wherein the computer readable medium is further encoded with computer executable instructions coded to perform a comparison of token variants of the at least one token structure to the at least one language model lexicon to determine if the at least one language model lexicon contains said token variants of said token structure, and when a token variant of the plurality of token variants is found in the language model lexicon, add said token variant to the candidate list associated with the token structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Dictaphone Corporation (Microsoft Corporation)
Inventors
Santisteban, Ana, Dowd, John, Del La Femina, Kathryn, Uhrbach, Amy J., Lapshina, Larissa, Rechea, Bernardo, Frankel, Alan, Han, Wensheng(Vincent), Cote, William F., Carus, Alwin B., Carrier, Jill
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
GODBOLD, DOUGLAS

Application Number

US11/001,654
Publication Number

US 20060116862A1
Time in Patent Office

2,344 Days
Field of Search

704/1, 704/9, 704/260, 704/258
US Class Current

704/9
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

System and method for tokenization of text using classifier models

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for tokenization of text using classifier models

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links