Natural language determination using correlation between common words
First Claim
1. A method for identifying the language of a document in which a computer document is written, comprising the steps of:
- comparing a plurality of words from the document to a word list associated with a candidate language, wherein words in the word list are a selection of a small number of the most frequently used words in the candidate language;
accumulating a count of matches between words in the document and words in the word list for each word in the word list to produce a sample count for each word in the word list;
correlating the sample count to a reference count for each word in the word list for the candidate language to produce a correlation score for the candidate language, wherein the correlation score is a statistical measure of a collective strength of association between the sample counts and reference counts; and
identifying the language of the document based on the correlation score.
1 Assignment
0 Petitions
Accused Products
Abstract
The language in which a computer document is written is identified. A plurality of words from the document are compared to words in a word list associated with a candidate language. The words in the word list are a selection of the most frequently used words in the candidate language. A count of matches between words in the document and words in the word list for each word in the word list to produce a sample count. The sample count is correlated to a reference count for the candidate language to produce a correlation score for the candidate language. The language of the document is identified based on the correlation score. Generally, there are a plurality of candidate languages. Thus, comparing, accumulating, correlating and identifying processes are practiced for each language. The language of the document is identified as the candidate language having a reference count which generates a highest correlation score.
-
Citations
25 Claims
-
1. A method for identifying the language of a document in which a computer document is written, comprising the steps of:
-
comparing a plurality of words from the document to a word list associated with a candidate language, wherein words in the word list are a selection of a small number of the most frequently used words in the candidate language; accumulating a count of matches between words in the document and words in the word list for each word in the word list to produce a sample count for each word in the word list; correlating the sample count to a reference count for each word in the word list for the candidate language to produce a correlation score for the candidate language, wherein the correlation score is a statistical measure of a collective strength of association between the sample counts and reference counts; and identifying the language of the document based on the correlation score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system including processor and memory for identifying the language of a document in which a computer document is written, comprising:
-
means for comparing a plurality of words from the document to a word list associated with a candidate language, wherein words in the word list are a selection of a small number of the most frequently used words in the candidate language; means for accumulating a count of matches between words in the document and words in the word list for each word in the word list to produce a sample count for each word in the word list; means for correlating the sample count to a reference count for each word in the word list for the candidate language to produce a correlation score for the candidate language, wherein the correlation score is a statistical measure of a collective strength of association between the sample counts and reference counts; and means for identifying the language of the document based on the correlation score. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer program product in a computer readable medium for identifying the language of a document in which a computer document is written, comprising:
-
means for comparing a plurality of words from the document to a word list associated with a candidate language, wherein words in the word list are a selection of a small number of the most frequently used words in the candidate language; means for accumulating a count of matches between words in the document and words in the word list for each word in the word list to produce a sample count for each word in the word list; means for correlating the sample count to a reference count for each word in the word list for the candidate language to produce a correlation score for the candidate language, wherein the correlation score is a statistical measure of a collective strength of association between the sample counts and reference counts; and means for identifying the language of the document based on the correlation score. - View Dependent Claims (21, 22, 23, 24, 25)
-
Specification