Method for language-independent text tokenization using a character categorization
First Claim
Patent Images
1. A computer method for isolating words from an input stream of characters forming natural language text, the method comprising the steps of:
- storing in a data processor a character classification table which describes each character in a character code as being either a delimiter character, an alpha/numeric character or a conditional delimiter character which assumes a function of a delimiter character when it occurs in predefined character contexts;
inputting to said data processor an input stream of characters which are members of said character code, the input stream forming natural language text;
building in said data processor a string of alpha/numeric characters to form a word from said input stream of characters produced by said inputting step;
isolating in said data processor three consecutive characters from said inputting step of said input stream, as a previous character, a current character and a next character;
accessing in said data processor said character classification table in response to said isolating step, to determine if said current character is a delimiter character, an alpha/numeric character or a conditional delimiter character;
appending in said data processor said current character to said string when said character classification table identifies it as an alpha/numeric character in said accessing step;
signaling an output signal from said data processor that said string is a complete word when said character classification table identifies said current character as a delimiter character in said accessing step;
analyzing in said data processor said previous character, said current character and said next character to determine if said current character assumes the function of a delimiter character when said character classification table identifies said current character as a conditional delimiter character in said accessing step.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer method is disclosed to isolate linguistically salient strings ("words") from a natural language text stream. The process is applicable to a variety of computer hardware, to any character encoding scheme, and to the idiosyncrasies of most natural languages.
-
Citations
15 Claims
-
1. A computer method for isolating words from an input stream of characters forming natural language text, the method comprising the steps of:
-
storing in a data processor a character classification table which describes each character in a character code as being either a delimiter character, an alpha/numeric character or a conditional delimiter character which assumes a function of a delimiter character when it occurs in predefined character contexts; inputting to said data processor an input stream of characters which are members of said character code, the input stream forming natural language text; building in said data processor a string of alpha/numeric characters to form a word from said input stream of characters produced by said inputting step; isolating in said data processor three consecutive characters from said inputting step of said input stream, as a previous character, a current character and a next character; accessing in said data processor said character classification table in response to said isolating step, to determine if said current character is a delimiter character, an alpha/numeric character or a conditional delimiter character; appending in said data processor said current character to said string when said character classification table identifies it as an alpha/numeric character in said accessing step; signaling an output signal from said data processor that said string is a complete word when said character classification table identifies said current character as a delimiter character in said accessing step; analyzing in said data processor said previous character, said current character and said next character to determine if said current character assumes the function of a delimiter character when said character classification table identifies said current character as a conditional delimiter character in said accessing step. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
Specification