Natural language determination using correlation between common words

US 6,023,670 A
Filed: 12/20/1996
Issued: 02/08/2000
Est. Priority Date: 08/19/1996
Status: Expired due to Fees

First Claim

Patent Images

1. A method for identifying the language of a document in which a computer document is written, comprising the steps of:

comparing a plurality of words from the document to a word list associated with a candidate language, wherein words in the word list are a selection of a small number of the most frequently used words in the candidate language;

accumulating a count of matches between words in the document and words in the word list for each word in the word list to produce a sample count for each word in the word list;

correlating the sample count to a reference count for each word in the word list for the candidate language to produce a correlation score for the candidate language, wherein the correlation score is a statistical measure of a collective strength of association between the sample counts and reference counts; and

identifying the language of the document based on the correlation score.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The language in which a computer document is written is identified. A plurality of words from the document are compared to words in a word list associated with a candidate language. The words in the word list are a selection of the most frequently used words in the candidate language. A count of matches between words in the document and words in the word list for each word in the word list to produce a sample count. The sample count is correlated to a reference count for the candidate language to produce a correlation score for the candidate language. The language of the document is identified based on the correlation score. Generally, there are a plurality of candidate languages. Thus, comparing, accumulating, correlating and identifying processes are practiced for each language. The language of the document is identified as the candidate language having a reference count which generates a highest correlation score.

Citations

25 Claims

1. A method for identifying the language of a document in which a computer document is written, comprising the steps of:
- comparing a plurality of words from the document to a word list associated with a candidate language, wherein words in the word list are a selection of a small number of the most frequently used words in the candidate language;
  
  accumulating a count of matches between words in the document and words in the word list for each word in the word list to produce a sample count for each word in the word list;
  
  correlating the sample count to a reference count for each word in the word list for the candidate language to produce a correlation score for the candidate language, wherein the correlation score is a statistical measure of a collective strength of association between the sample counts and reference counts; and
  
  identifying the language of the document based on the correlation score.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method as recited in claim 1 wherein the comparing, accumulating, correlating and identifying steps are practiced for a plurality of candidate languages each with a respective word list and a respective reference count for each word in the word list and the language of the document is identified as the candidate language having a reference count which generates a highest correlation score.
  - 3. The method as recited in claim 2, wherein sample counts are produced for each respective candidate language and the sample counts and reference counts consist of counts for individual words in the word list for the respective candidate language.
  - 4. The method as recited in claim 2, wherein one sample count is produced for each matching word in the document and the sample counts and reference counts comprise counts for individual words in a plurality of candidate languages.
  - 5. The method as recited in claim 4, wherein a count for a word is dropped from the correlating step if the count for the word in the reference count and the sample count are both zero.
  - 6. The method as recited in claim 2 wherein the words in each word list have a substantially equivalent aggregate frequency of occurrence within the respective candidate language as the words in the other word lists.
  - 7. The method as recited in claim 2 wherein the process stops when a highest correlation score for a first respective candidate language exceeds a next highest correlation score for a second candidate language by a predetermined amount.
  - 8. The method as recited in claim 1, wherein a single candidate language is compared to the document and the language of the document is identified as the candidate language if the correlation score exceeds a predetermined score.
  - 9. The method as recited in claim 1 wherein the comparing, accumulating, correlating and identifying steps are practiced on all the words in the document.
  - 10. The method as recited in claim 1 wherein the process stops when the correlation score exceeds a predetermined score.
  - 11. The method as recited in claim 1 wherein words from the document greater than a predetermined length are truncated before the comparing step.

12. A system including processor and memory for identifying the language of a document in which a computer document is written, comprising:
- means for comparing a plurality of words from the document to a word list associated with a candidate language, wherein words in the word list are a selection of a small number of the most frequently used words in the candidate language;
  
  means for accumulating a count of matches between words in the document and words in the word list for each word in the word list to produce a sample count for each word in the word list;
  
  means for correlating the sample count to a reference count for each word in the word list for the candidate language to produce a correlation score for the candidate language, wherein the correlation score is a statistical measure of a collective strength of association between the sample counts and reference counts; and
  
  means for identifying the language of the document based on the correlation score.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The system as recited in claim 12 wherein the comparing, accumulating, correlating and identifying means for a plurality of candidate languages each with a respective word list and a respective reference count and the language of the document is identified as the candidate language having a reference count which generates a highest correlation score.
  - 14. The system as recited in claim 13, wherein sample counts are produced for each respective candidate language and the sample counts and reference counts consist of counts for individual words in the word list for the respective candidate language.
  - 15. The system as recited in claim 13 wherein the words in each word list have a substantially equivalent aggregate frequency of occurrence within the respective candidate language as the words in the other word lists.
  - 16. The system as recited in claim 13 wherein the system stops when a highest correlation score for a first respective candidate language exceeds a next highest correlation score for a second candidate language by a predetermined amount.
  - 17. The system as recited in claim 13 wherein words from the document greater than a predetermined length are truncated before the comparing step.
  - 18. The system as recited in claim 12, wherein a single candidate language is compared to the document and the language of the document is identified as the candidate language if the correlation score exceeds a predetermined score.
  - 19. The system as recited in claim 12 wherein the system stops when the correlation score exceeds a predetermined score.

20. A computer program product in a computer readable medium for identifying the language of a document in which a computer document is written, comprising:
- means for comparing a plurality of words from the document to a word list associated with a candidate language, wherein words in the word list are a selection of a small number of the most frequently used words in the candidate language;
  
  means for accumulating a count of matches between words in the document and words in the word list for each word in the word list to produce a sample count for each word in the word list;
  
  means for correlating the sample count to a reference count for each word in the word list for the candidate language to produce a correlation score for the candidate language, wherein the correlation score is a statistical measure of a collective strength of association between the sample counts and reference counts; and
  
  means for identifying the language of the document based on the correlation score.
- View Dependent Claims (21, 22, 23, 24, 25)
- - 21. The product as recited in claim 20 wherein the comparing, accumulating, correlating and identifying means use a plurality of candidate languages each with a respective word list and a respective reference count and the language of the document is identified as the candidate language having a reference count which generates a highest correlation score.
  - 22. The product as recited in claim 21, wherein sample counts are produced for each respective candidate language and the sample counts and reference counts consist of counts for individual words in the word list for the respective candidate language.
  - 23. The product as recited in claim 21, wherein one sample count is produced for each matching word in the document and the sample counts and reference counts comprise counts for individual words in a plurality of candidate languages.
  - 24. The product as recited in claim 21, wherein a single candidate language is compared to the document and the language of the document is identified as the candidate language if the correlation score exceeds a predetermined score.
  - 25. The product as recited in claim 20 wherein words from the document greater than a predetermined length are truncated before the comparing step.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Paulsen, Robert Charles Jr., Martino, Michael John
Primary Examiner(s)
Thomas, Joseph

Application Number

US08/769,842
Time in Patent Office

1,145 Days
Field of Search

704/1, 704/2, 704/3, 704/4, 704/8, 704/9, 704/7, 707/531, 707/533, 707/535, 707/536
US Class Current

704/8
CPC Class Codes

G06F 40/216 using statistical methods

G06F 40/263 Language identification

Natural language determination using correlation between common words

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Natural language determination using correlation between common words

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links