Word counting natural language determination
First Claim
1. A method for identifying a language in which a computer document is written, comprising the steps of:
- comparing a plurality of words from the document to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
accumulating a respective count for each candidate language each time one of the plurality of words from the document is present in the associated word table; and
identifying the language of the document as the language associated with the count having the highest value.
0 Assignments
0 Petitions
Accused Products
Abstract
A technique for identifying a language in which a computer document is written. Words from the document are compared to words in a plurality of word tables. Each of the word tables is associated with a respective candidate language and contains a selection of the most frequently used words in the language. The words in each word table are selected based on the frequency of occurrence in a candidate language so that each word table covers an equivalent percentage of the associated candidate language. A count is accumulated for each candidate language each time one of the plurality of words from the document is present in the associated word table. In the simple counting embodiment of the invention, the count is incremented by one. The language of the document is identified as the language associated with the count having the highest value.
-
Citations
20 Claims
-
1. A method for identifying a language in which a computer document is written, comprising the steps of:
-
comparing a plurality of words from the document to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
accumulating a respective count for each candidate language each time one of the plurality of words from the document is present in the associated word table; and
identifying the language of the document as the language associated with the count having the highest value. - View Dependent Claims (2, 3, 4, 5, 6, 7)
associating each word table with a respective set of accumulators, each accumulator in the set of accumulators for counting the occurrences of a respective word in the word table; and
summing the counts in each set of accumulators once the plurality of words have been compared to the word tables.
-
-
5. The method as recited in claim 4, further comprising the steps of:
-
multiplying a total count of each word in a respective accumulator by the stored frequency of occurrence for the word in the word table to produce a set of weighted counts;
summing the set of weighted counts to produce an aggregate weighted count once the plurality of words have been compared to the word tables; and
identifying the language of the document as the language associated with the aggregate weighted count having the highest value.
-
-
6. The method as recited in claim 1 wherein the plurality of words represent a subset of the total number of words in the document and the method further comprises the steps of:
-
counting the plurality of words as each of the plurality is compared to the words in the word tables; and
responsive to the count of the plurality of words reaching a predetermined number, stopping the comparing and accumulating steps.
-
-
7. The method as recited in claim 5, further comprising the steps of:
-
counting the plurality of words as each of the plurality is compared to the words in the word tables;
using the identified language according to the aggregate weighted count as the identified language of the document as long as the count of the plurality of words is less than a predetermined number; and
using the identified language according to the count as the identified language once the count of the plurality of words reaches the predetermined number.
-
-
8. A system including a processor for identifying a language in which a target computer document is written, comprising:
-
a memory for storing the target document and a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
a comparator for comparing a plurality of words from the document to words in the word tables;
a set of accumulators for accumulating a respective count for each candidate language by one each time one of the plurality of words from the document is present in a word table, each accumulator associated with a respective word table; and
a language identifier for identifying the language of the target document as the language associated with the count having the highest value. - View Dependent Claims (9, 10, 11, 12, 13, 14)
means for scanning a plurality of documents in each candidate language;
means for counting each of a plurality of words in the documents to establish a frequency of occurrence value for each word in each candidate language;
means for storing candidate words having a frequency of occurrence value exceeding a threshold value in each candidate language; and
means for selecting among the candidate words and storing the selected words to form word tables for each of the candidate languages so that each word table covers a substantially equivalent percentage of the associated candidate language.
-
-
10. The system as recited in claim 9 further comprising means to associate the frequency of occurrence value with the selected words in each of the word tables.
-
11. The system as recited in claim 8 wherein special words which occur in only one candidate language are included in a respective word table and wherein when the comparator detects a special word in the target document greater weight is given in the accumulated count for the respective candidate language.
-
12. The system as recited in claim 8 which stops once a predetermined number of words from the target document are compared to the words in the word tables.
-
13. The system as recited in claim 8 which stops once a predetermined amount of divergence is detected in the set of accumulators.
-
14. The system as recited in claim 8 wherein a predetermined minimum number of words from the target document must be compared before identifying the language of the target document.
-
15. A system comprising a memory and a processor for identifying a language in which a computer document is written, wherein a plurality of words from the document are compared to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language, a respective weighted count is accumulated for each candidate language each time one of the plurality of words from the document is present in the associated word table, and identifying the language of the document as the language associated with the count having the highest value, the improvement comprising:
the words in each word table are selected based on frequency of occurrence in a candidate language so that each word table covers an equivalent percentage of the associated candidate language.
-
16. A computer program product on a computer readable medium for identifying a language in which a computer document is written, comprising:
-
a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
means for comparing a plurality of words from the document to the words in the word tables;
means for accumulating a respective count for each candidate language each time one of the plurality of words from the document is present in the associated word table; and
means for identifying the language of the document as the language associated with the count having the highest value. - View Dependent Claims (17, 18, 19, 20)
means for scanning a plurality of documents in each candidate language;
means for counting each of a plurality of words in the documents to establish a frequency of occurrence value for each word in each candidate language;
means for storing candidate words having a frequency of occurrence value exceeding a threshold value in each candidate language; and
means for selecting among the candidate words and storing the selected words to form word tables for each of the candidate languages so that each word table covers a substantially equivalent percentage of the associated candidate language.
-
-
18. The product as recited in claim 16, wherein the frequency of occurrence of each word in each word table is stored in the word table and further comprises:
-
means for individually counting occurrences of each respective word in the word tables in the document;
means for counting a total number of words in the plurality of words compared to words in the word tables;
means responsive to the total number of compared words being less than a predetermined number for multiplying the individual count of each word by the stored frequency of occurrence for the word in the word table to produce a set of weighted counts and for summing the set of weighted counts to produce an aggregate weighted count once the plurality of words have been compared to the word tables; and
means responsive to the total number of compared words being at least the predetermined number for summing the counts in each set of accumulators.
-
-
19. The product as recited in claim 16 which further comprises a set of word tables which represent genres within a candidate language.
-
20. The product as recited in claim 16, further comprising:
-
means for counting the number of words from the document compared to the words in the word tables; and
means for stopping the comparing and accumulating means once a predetermined number of words from the target document are compared to the words in the word tables.
-
Specification