Method and apparatus for improved tokenization of natural language text
First Claim
1. A computerized tokenizer for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, the computerized tokenizer comprising:
- parsing means for extracting lexical and non-lexical characters from the stream of digitized text,identifying means coupled with said parsing means for identifying a set of tokens, each token being formed of a string of parsed lexical characters bounded by non-lexical characters, andfiltering means coupled with said identifying means for selecting a candidate token from said set of tokens, said candidate token being suitable for linguistic processing beyond the identification of tokens.
7 Assignments
0 Petitions
Accused Products
Abstract
This invention improves information retrieval by providing a tokenizing apparatus and method that parses natural language text in a manner that increases the throughput of an information retrieval or natural language analysis system. The tokenizer includes a parser that extracts characters from the stream of text, an identifying element for identifying a token formed of characters in the stream of text that include lexical matter, and a filter for assigning tags to those tokens requiring further linguistic analysis. The tokenizer, in a single pass through the stream of text, determines the further linguistic processing suitable to each particular token contained in the stream of text.
224 Citations
50 Claims
-
1. A computerized tokenizer for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, the computerized tokenizer comprising:
-
parsing means for extracting lexical and non-lexical characters from the stream of digitized text, identifying means coupled with said parsing means for identifying a set of tokens, each token being formed of a string of parsed lexical characters bounded by non-lexical characters, and filtering means coupled with said identifying means for selecting a candidate token from said set of tokens, said candidate token being suitable for linguistic processing beyond the identification of tokens. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A computerized data processing method for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, said method comprising the steps of
extracting lexical and non-lexical characters from the stream of text, identifying a set of tokens, each token being formed of a string of extracted lexical characters bounded by extracted non-lexical characters, and using a filter to select a candidate token from said set of tokens, said candidate token being suitable for linguistic processing beyond the identification of tokens.
-
45. A computerized tokenizer for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, the computerized tokenizer comprising:
-
parsing means for extracting lexical and non-lexical characters from the stream of digitized text, identifying means coupled with said parsing means for identifying a set of tokens, each token being formed of a string of parsed lexical characters bounded by non-lexical characters, filtering means coupled with said identifying means for selecting a candidate token from said set of tokens, said candidate token being suitable for additional linguistic processing, and a memory element for storing and retrieving the digitized stream of natural language text and for storing and retrieving a data structure that includes parameters for each token, wherein said parameters include the lexical and non-lexical attributes of a token, wherein said lexical attributes are selected from the group consisting of internal character attributes, special processing attributes, end of sentence attributes, and noun phrase attributes. - View Dependent Claims (46, 47, 48, 49)
-
-
50. A computerized data processing method for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, said method comprising the steps of:
-
extracting lexical and non-lexical characters from the stream of text, identifying a set of tokens, each token being formed of a string of extracted lexical characters bounded by extracted non-lexical characters, selecting a candidate token from said set of tokens, said candidate token being suitable for additional linguistic processing, associating with said candidate token a tag identifying additional linguistic processing for said candidate token, storing in a first location of a memory element attributes of said candidate token, said attributes identifying the additional linguistic processing suitable for said candidate token, causing the tag to point to the first location, and selecting the lexical attributes from the group consisting of internal character attributes, special processing attributes, end of sentence attributes, and noun phrase attributes.
-
Specification