Natural language determination using partial words
First Claim
1. A method for identifying the language in which a document is written, comprising the steps of:
- reading a plurality of words from a document into a computer memory;
truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;
comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language, wherein the most frequently used words which exceed the predetermined length are truncated in the word tables;
accumulating a respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and
identifying the language of the document as the language associated with the count having the highest value.
1 Assignment
0 Petitions
Accused Products
Abstract
Comparing the short and truncated words of a document to word tables of most frequently used words in each of the respective candidate language to identify the language in which the document is written. First, a plurality of words from a document is read into a computer memory. Then, words within the plurality of words which exceed a predetermined length are truncated to produce a set of short and truncated words. The set of short and truncated words are compared to words in a plurality of word tables. Each word table is associated with and contains a selection of most frequently used words in a respective candidate language. Although the most frequently words in most languages tend to be short those which which exceed the predetermined length may be truncated in the word tables. A respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language. In some embodiments, the count may weighted by factors related to the frequency of occurrence of the words in the respective candidate languages. The language of the document is identified as the language associated with the count having the highest value.
-
Citations
18 Claims
-
1. A method for identifying the language in which a document is written, comprising the steps of:
-
reading a plurality of words from a document into a computer memory;
truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;
comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language, wherein the most frequently used words which exceed the predetermined length are truncated in the word tables;
accumulating a respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and
identifying the language of the document as the language associated with the count having the highest value. - View Dependent Claims (2, 3, 4, 5, 6, 7)
storing a frequency of occurrence of each word in the respective candidate language in each word table;
using the frequency of occurences to weight the respective counts for each candidate language.
-
-
3. The method as recited in claim 1 further comprising the steps of:
-
accumulating a respective weighted count for each candidate language equivalent to an aggregate frequency of occurrence for each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language;
determining a number of the short and truncated words compared to the word tables; and
using the respective weighted count to identify the language of the document until the number of short and truncated words exceeds a predetermined number.
-
-
4. The method as recited in claim 1 further comprising the step of selecting most frequently used words for the word tables to reduce strong aliasing between candidate languages.
-
5. The method as recited in claim 1 wherein a first truncated word is a weak alias of a first short word in a word table so that the first truncated word matches the first short word in the comparing step.
-
6. The method as recited in claim 1 wherein a plurality of tableaus of word tables is associated with each candidate language each tableau of word tables for comparing words of a respective length and the method further comprises the steps of:
-
determining the length of each word in the plurality of words;
comparing words of a respective length in the set of short and truncated words with the tableaus for each respective candidate language for words of the respective length;
wherein the truncated words are compared to the tableaus for words of a longest respective length.
-
-
7. The method as recited in claim 1 wherein the words are stored in the tables as bits set for a presence of an ordered letter pair in one of the most frequently used words in the respective candidate language.
-
8. A method for identifying the language in which a document is written, comprising the steps of:
-
reading a plurality of words from a document;
truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;
comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language, wherein the most frequently used words which exceed the predetermined length are truncated in the word tables;
accumulating a respective weighted count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and
identifying the language of the document as the language associated with the weighted count having the highest value. - View Dependent Claims (9)
-
-
10. A system including processor and memory for identifying the language in which a document is written comprising:
-
means for reading a plurality of words from a document into the memory;
means for determining a length of each word in the plurality of words;
means for truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;
means for comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
means for accumulating a respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and
means for identifying the language of the document as the language associated with the count having the highest value. - View Dependent Claims (11, 12, 13, 14, 15)
a plurality of tableaus of word tables each associated with each candidate language, each tableau of word tables for comparing words of a respective length;
wherein the comparing means compares words of a first respective length in the set of short and truncated words with the tableaus for each respective candidate language for words of the first respective length.
-
-
16. A computer program product on a computer readable medium for identifying the language in which a document is written comprising:
-
means for determining a length of each word in a plurality of words from the document;
means for truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;
means for comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
means for accumulating a respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and
means for identifying the language of the document as the language associated with the count having the highest value. - View Dependent Claims (17, 18)
-
Specification