Word storage table for natural language determination
First Claim
1. A method for identifying a plurality of character strings, comprising the steps of:
- selecting a set of character strings;
storing the set of character strings in a word table as a set of ordered character pairs, wherein each word table is an N×
N bit table, wherein each bit represents a given character pair at a particular place in one of the character strings;
comparing new character strings to the character strings stored in the word tables;
counting a number of matches between the new character strings and the character strings in the word table;
identifying the new character strings as related to the character strings in the word table if there are a sufficient number of matches.
1 Assignment
0 Petitions
Accused Products
Abstract
A language in which a document is written is identified through the use of sets of most frequently used words in each of a plurality of candidate languages. Each set of most frequently used words in a respective set of word tables for a respective candidate language according to letter pairs in each set of most frequently used words. In the preferred embodiment, each word table is an N×N bit table, where each bit represents a given letter pair at a particular place in one of the most frequently used words in one of the candidate languages. Words from the document are compared to the most frequently used words stored in the word tables. A count of the number of matches between the words from the document and the words stored in each respective set of word tables is kept for each respective language. The language of the document as the respective candidate language having the greatest number of matches.
144 Citations
17 Claims
-
1. A method for identifying a plurality of character strings, comprising the steps of:
-
selecting a set of character strings; storing the set of character strings in a word table as a set of ordered character pairs, wherein each word table is an N×
N bit table, wherein each bit represents a given character pair at a particular place in one of the character strings;comparing new character strings to the character strings stored in the word tables; counting a number of matches between the new character strings and the character strings in the word table; identifying the new character strings as related to the character strings in the word table if there are a sufficient number of matches. - View Dependent Claims (2, 4, 5, 6, 7)
-
-
3. A method for identifying a language in which a document is written, comprising the steps of:
-
selecting a set of most frequently used words in each of a plurality of candidate languages; storing each set of most frequently used words in a respective set of word tables for a respective candidate language according to letter pairs in each set of most frequently used words, wherein each word table is an N×
N bit table, wherein each bit represents a given letter pair at a Particular place in one of the most frequently used words in one of the candidate language;comparing words from the document to the most frequently used words stored in the word tables; counting a number of matches between the words from the document and the words stored in each respective set of word tables; identifying the language of the document as the respective candidate language having the greatest number of matches.
-
-
8. A system including processor and memory for identifying a language in which a document is written, comprising:
-
a plurality of word tables, each for a respective candidate language in which a set of most frequently used words are stored according to letter pairs in each of the most frequently used words, wherein each word table is an N×
N bit table, wherein each bit represents a given letter pair at a Particular place in one of the most frequently used words in one of the candidate language;a comparator for comparing words from the document to the most frequently used words stored in the word tables; an accumulator for counting a number of matches between the words from the document and the words stored in each respective set of word tables; means for identifying the language of the document as the respective candidate language having the greatest number of matches. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A computer program product on a computer readable medium for identifying a language in which a document is written, comprising:
-
means for providing a plurality of word tables, each arranged in tableaus for storing words of a respective length and in a respective candidate language according to letter pairs in each of the stored words, wherein each word table is an N×
N bit table, wherein each bit represents a given letter pair at a particular place in one of the most frequently used words in one of the candidate language;means for comparing words from the document to the words stored in the word tables; means for counting a number of matches between the words from the document and the words stored in each respective set of word tables; means for identifying the language of the document as the respective candidate language having the greatest number of matches. - View Dependent Claims (15, 16, 17)
-
Specification