Method and apparatus for improved tokenization of natural language text

US 5,890,103 A
Filed: 07/19/1996
Issued: 03/30/1999
Est. Priority Date: 07/19/1995
Status: Expired due to Fees

First Claim

Patent Images

1. A computerized tokenizer for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, the computerized tokenizer comprising:

parsing means for extracting lexical and non-lexical characters from the stream of digitized text,identifying means coupled with said parsing means for identifying a set of tokens, each token being formed of a string of parsed lexical characters bounded by non-lexical characters, andfiltering means coupled with said identifying means for selecting a candidate token from said set of tokens, said candidate token being suitable for linguistic processing beyond the identification of tokens.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This invention improves information retrieval by providing a tokenizing apparatus and method that parses natural language text in a manner that increases the throughput of an information retrieval or natural language analysis system. The tokenizer includes a parser that extracts characters from the stream of text, an identifying element for identifying a token formed of characters in the stream of text that include lexical matter, and a filter for assigning tags to those tokens requiring further linguistic analysis. The tokenizer, in a single pass through the stream of text, determines the further linguistic processing suitable to each particular token contained in the stream of text.

224 Citations

50 Claims

1. A computerized tokenizer for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, the computerized tokenizer comprising:
- parsing means for extracting lexical and non-lexical characters from the stream of digitized text,identifying means coupled with said parsing means for identifying a set of tokens, each token being formed of a string of parsed lexical characters bounded by non-lexical characters, andfiltering means coupled with said identifying means for selecting a candidate token from said set of tokens, said candidate token being suitable for linguistic processing beyond the identification of tokens.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 2. A tokenizer according to claim 1, wherein said filtering means further comprises an associative processing element for associating with said candidate token a tag identifying additional linguistic processing for said candidate token.
  - 3. A tokenizer according to claim 2, wherein said associative processing element further includes a group processing element for associating with a plurality of tokens, as a function of said candidate token, a plurality of tags identifying additional linguistic processing for said plurality of tokens.
  - 4. A tokenizer according to claim 2, further comprising a modifying processor for modifying said candidate token as a function of said tag associated with said candidate token.
  - 5. A tokenizer according to claim 4, wherein said modifying processor includes splitting means for splitting said candidate token into multiple tokens.
  - 6. A tokenizer according to claim 4, wherein said modifying processor includes stripping means for stripping a character from said candidate token.
  - 7. A tokenizer according to claim 4, w herein said modifying processor includes ignoring means for ignoring a non-lexical character surrounding said candidate token.
  - 8. A tokenizer according to claim 4, wherein said modifying processor includes merging means for merging said candidate token with another token in the stream of text.
  - 9. A tokenizer according to claim 1, wherein said filtering selects said candidate token from said set of tokens during a single scan of the parsed stream of text.
  - 10. A tokenizer according to claim 1, wherein said filtering means further comprises a character analyzer for selecting said candidate token from said set of tokens, said character analyzer includingcomparing means for comparing a selected character in the parsed stream of text with entries in a character table, andassociating means for associating a first tag with a first token located proximal to said selected character, when said selected character has an equivalent entry in the character table.
  - 11. A tokenizer according to claim 10, comprising lexical processing means for comparing a selected lexical character with entries in the character table and for associating said first tag with a token including said selected lexical character, when said selected lexical character has an equivalent entry in the character table.
  - 12. A tokenizer according to claim 10, comprising non-lexical processing means for comparing a selected non-lexical character with entries in the character table and for associating said first tag with a token preceding said selected non-lexical character, when said selected non-lexical character has an equivalent entry in the character table.
  - 13. A tokenizer according to claim 10, wherein said character table includes entries representative of a plurality of languages such that said tokenizer operates in the plurality of languages.
  - 14. A tokenizer according to claim 13, wherein the plurality of languages is selected from the group consisting of English, French, Catalan, Spanish, Italian, Portuguese, German, Danish, Norwegian, Swedish, Dutch, Finish, Russian, and Czech.
  - 15. A tokenizer according to claim 1, wherein said filtering means further comprises a contextual processor for selecting said candidate token from said set of tokens as a function of a contextual analysis of the lexical and non-lexical characters surrounding a selected character in the parsed stream of text.
  - 16. A tokenizer according to claim 15, wherein said contextual processor includes a set of rules applicable in a plurality of languages such that said tokenizer operates in the plurality of languages.
  - 17. A tokenizer according to claim 16, wherein the plurality of languages is selected from the group consisting of English, French, Catalan, Spanish, Italian, Portuguese, German, Danish, Norwegian, Swedish, Dutch, Finish, Russian, and Czech.
  - 18. A tokenizer according to claim 1, further comprising a memory element for storing and retrieving the digitized stream of natural language text and for storing and retrieving a data structure that includes parameters for each token.
  - 19. A tokenizer according to claim 18, wherein said parameters include an input stream flag identifying the location of a digitized stream of natural language text in said memory element.
  - 20. A tokenizer according to claim 18, wherein said parameters include a flag identifying the number of lexical characters and non-lexical characters forming a token.
  - 21. A tokenizer according to claim 18, wherein said parameters include an output flag identifying the location of an output signal generated by said tokenizer.
  - 22. A tokenizer according to claim 18, wherein said parameters include a flag identifying the number of lexical characters forming a token.
  - 23. A tokenizer according to claim 18, wherein said parameters include the lexical and non-lexical attributes of a token.
  - 24. A tokenizer according to claim 23, wherein said non-lexical attributes are selected from the group consisting of:
    - contains white space, single new line, and multiple new line.

25. A computerized data processing method for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, said method comprising the steps ofextracting lexical and non-lexical characters from the stream of text,identifying a set of tokens, each token being formed of a string of extracted lexical characters bounded by extracted non-lexical characters, andusing a filter to select a candidate token from said set of tokens, said candidate token being suitable for linguistic processing beyond the identification of tokens.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44)
- - 26. A computerized data processing method according to claim 25, wherein said extracting step further comprises the step of ignoring stoplist characters in the stream of text.
  - 27. A computerized data processing method according to claim 25, wherein said identifying step further comprises the step of identifying a beginning of a first token as a function of the attributes associated with the first token.
  - 28. A computerized data processing method according to claim 25, wherein said identifying step further comprises the step of identifying an end of a first token as a function of a pattern formed by a plurality of characters in the stream of digitized text.
  - 29. A computerized data processing method according to claim 25, wherein said candidate token is selected from said set of tokens during a single scan of the parsed stream of text.
  - 30. A computerized data processing method according to claim 25, wherein said selecting step further comprises the steps ofcomparing a selected character in the parsed stream of text with entries in a character table, andassociating a first tag with a first token located proximal to said selected character, when said selected character has an equivalent entry in the character table.
  - 31. A computerized data processing method according to claim 30, further comprising the steps ofcomparing a selected lexical character with entries in the character table, andassociating said first tag with a token including said selected lexical character, when said selected lexical character has an equivalent entry in the character table.
  - 32. A computerized data processing method according to claim 30, further comprising the steps ofcomparing a selected non-lexical character with entries in the character table, andassociating said first tag with a token preceding said selected non-lexical character, when said selected non-lexical character has an equivalent entry in the character table.
  - 33. A computerized data processing method according to claim 30, further comprising the step of forming a character table having entries representative of a plurality of languages.
  - 34. A computerized data processing method according to claim 25, further comprising the steps of selecting said candidate token from said set of tokens as a function of a contextual analysis of the lexical and non-lexical characters surrounding a selected character in the parsed stream of text.
  - 35. A computerized data processing method according to claim 25, further comprising the step of associating with said candidate token a tag identifying additional linguistic processing for said candidate token.
  - 36. A computerized data processing method according to claim 35, further comprising the step of associating with a plurality of tokens, as a function of said candidate token, a plurality of tags identifying additional linguistic processing for said plurality of tokens.
  - 37. A computerized data processing method according to claim 35, further comprising the step of modifying said candidate token as a function of said tag associated with said candidate token.
  - 38. A computerized data processing method according to claim 37, further comprising the step of splitting said candidate token into multiple tokens.
  - 39. A computerized data processing method according to claim 37, further comprising the step of stripping a character from said candidate token.
  - 40. A computerized data processing method according to claim 37, further comprising the step of ignoring a non-lexical character surrounding said candidate token.
  - 41. A computerized data processing method according to claim 37, further comprising the step of merging said candidate token with another token in the stream of text.
  - 42. A computerized data processing method according to claim 35, further comprising the steps ofstoring in a first location of a memory element attributes of said candidate token, said attributes identifying the additional linguistic processing suitable for said candidate token, andcausing the tag to point to the first location.
  - 43. A computerized data processing method according to claim 42, further comprising the step of storing in the first location attributes selected from the group consisting of lexical attributes and non-lexical attributes.
  - 44. A computerized data processing method according to claim 43, further comprising the step of selecting the non-lexical attributes from the group consisting of:
    - contains white space, single new line, and multiple new line.

45. A computerized tokenizer for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, the computerized tokenizer comprising:
- parsing means for extracting lexical and non-lexical characters from the stream of digitized text,identifying means coupled with said parsing means for identifying a set of tokens, each token being formed of a string of parsed lexical characters bounded by non-lexical characters,filtering means coupled with said identifying means for selecting a candidate token from said set of tokens, said candidate token being suitable for additional linguistic processing, anda memory element for storing and retrieving the digitized stream of natural language text and for storing and retrieving a data structure that includes parameters for each token,wherein said parameters include the lexical and non-lexical attributes of a token,wherein said lexical attributes are selected from the group consisting of internal character attributes, special processing attributes, end of sentence attributes, and noun phrase attributes.
- View Dependent Claims (46, 47, 48, 49)
- - 46. A tokenizer according to claim 45, wherein said internal character attributes are selected from the group consisting of leading apostrophe, internal apostrophe, trailing apostrophe, leading hyphen, internal hyphen, trailing hyphen, internal slash, and internal parentheses.
  - 47. A tokenizer according to claim 45, wherein said special processing attributes are selected from the group consisting of number flags, possible pre-clitic, possible post-clitic, and unicode error.
  - 48. A tokenizer according to claim 45, wherein said end of sentence attributes are selected from the group consisting of probable sentence termination, attached end of word period, stripped end of word period, capcode high, capcode low, and definite non sentence termination.
  - 49. A tokenizer according to claim 45, wherein said noun phrase attributes are selected from the group consisting of probable sentence termination, attached end of word period, stripped end of word period, capcode high, capcode low, definite non sentence termination, pre noun phrase break, and post noun phrase break.

50. A computerized data processing method for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, said method comprising the steps of:
- extracting lexical and non-lexical characters from the stream of text,identifying a set of tokens, each token being formed of a string of extracted lexical characters bounded by extracted non-lexical characters,selecting a candidate token from said set of tokens, said candidate token being suitable for additional linguistic processing,associating with said candidate token a tag identifying additional linguistic processing for said candidate token,storing in a first location of a memory element attributes of said candidate token, said attributes identifying the additional linguistic processing suitable for said candidate token,causing the tag to point to the first location, andselecting the lexical attributes from the group consisting of internal character attributes, special processing attributes, end of sentence attributes, and noun phrase attributes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Vantage Technology Holdings LLC
Original Assignee
Lernout & Hauspie Speech Products NV (Intel Corporation)
Inventors
Carus, Alwin B.
Primary Examiner(s)
Thomas, Joseph

Application Number

US08/684,002
Time in Patent Office

984 Days
Field of Search

704/8, 704/9, 704/1, 704/10, 707/531, 707/532, 707/536, 707/1, 707/2, 707/4, 707/5
US Class Current

704/9
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 40/253   Grammatical analysis; Style...

G06F 40/268   Morphological analysis

G06F 40/284   Lexical analysis, e.g. toke...

G06Q 10/02   Reservations, e.g. for tick...

Method and apparatus for improved tokenization of natural language text

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

224 Citations

50 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for improved tokenization of natural language text

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

224 Citations

50 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links