Word counting natural language determination

US 6,704,698 B1
Filed: 08/19/1996
Issued: 03/09/2004
Est. Priority Date: 03/14/1994
Status: Expired due to Fees

First Claim

Patent Images

1. A method for identifying a language in which a computer document is written, comprising the steps of:

comparing a plurality of words from the document to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;

accumulating a respective count for each candidate language each time one of the plurality of words from the document is present in the associated word table; and

identifying the language of the document as the language associated with the count having the highest value.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique for identifying a language in which a computer document is written. Words from the document are compared to words in a plurality of word tables. Each of the word tables is associated with a respective candidate language and contains a selection of the most frequently used words in the language. The words in each word table are selected based on the frequency of occurrence in a candidate language so that each word table covers an equivalent percentage of the associated candidate language. A count is accumulated for each candidate language each time one of the plurality of words from the document is present in the associated word table. In the simple counting embodiment of the invention, the count is incremented by one. The language of the document is identified as the language associated with the count having the highest value.

Citations

20 Claims

1. A method for identifying a language in which a computer document is written, comprising the steps of:
- comparing a plurality of words from the document to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
  
  accumulating a respective count for each candidate language each time one of the plurality of words from the document is present in the associated word table; and
  
  identifying the language of the document as the language associated with the count having the highest value.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as recited in claim 1, further comprising the step of selecting the words in each word table based on frequency of occurrence in a candidate language so that each word table covers an equivalent percentage of the associated candidate language.
  - 3. The method as recited in claim 2, further comprising the step of storing the frequency of occurrence of each word in a respective candidate language in the word table for the respective candidate language.
  - 4. The method as recited in claim 3, wherein the accumulating step comprises the steps of:
5. The method as recited in claim 4, further comprising the steps of:
- multiplying a total count of each word in a respective accumulator by the stored frequency of occurrence for the word in the word table to produce a set of weighted counts;
  
  summing the set of weighted counts to produce an aggregate weighted count once the plurality of words have been compared to the word tables; and
  
  identifying the language of the document as the language associated with the aggregate weighted count having the highest value.
6. The method as recited in claim 1 wherein the plurality of words represent a subset of the total number of words in the document and the method further comprises the steps of:
- counting the plurality of words as each of the plurality is compared to the words in the word tables; and
  
  responsive to the count of the plurality of words reaching a predetermined number, stopping the comparing and accumulating steps.
7. The method as recited in claim 5, further comprising the steps of:
- counting the plurality of words as each of the plurality is compared to the words in the word tables;
  
  using the identified language according to the aggregate weighted count as the identified language of the document as long as the count of the plurality of words is less than a predetermined number; and
  
  using the identified language according to the count as the identified language once the count of the plurality of words reaches the predetermined number.

8. A system including a processor for identifying a language in which a target computer document is written, comprising:
- a memory for storing the target document and a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
  
  a comparator for comparing a plurality of words from the document to words in the word tables;
  
  a set of accumulators for accumulating a respective count for each candidate language by one each time one of the plurality of words from the document is present in a word table, each accumulator associated with a respective word table; and
  
  a language identifier for identifying the language of the target document as the language associated with the count having the highest value.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system as recited in claim 8 further comprising:
10. The system as recited in claim 9 further comprising means to associate the frequency of occurrence value with the selected words in each of the word tables.
11. The system as recited in claim 8 wherein special words which occur in only one candidate language are included in a respective word table and wherein when the comparator detects a special word in the target document greater weight is given in the accumulated count for the respective candidate language.
12. The system as recited in claim 8 which stops once a predetermined number of words from the target document are compared to the words in the word tables.
13. The system as recited in claim 8 which stops once a predetermined amount of divergence is detected in the set of accumulators.
14. The system as recited in claim 8 wherein a predetermined minimum number of words from the target document must be compared before identifying the language of the target document.

15. A system comprising a memory and a processor for identifying a language in which a computer document is written, wherein a plurality of words from the document are compared to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language, a respective weighted count is accumulated for each candidate language each time one of the plurality of words from the document is present in the associated word table, and identifying the language of the document as the language associated with the count having the highest value, the improvement comprising:
- the words in each word table are selected based on frequency of occurrence in a candidate language so that each word table covers an equivalent percentage of the associated candidate language.

16. A computer program product on a computer readable medium for identifying a language in which a computer document is written, comprising:
- a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
  
  means for comparing a plurality of words from the document to the words in the word tables;
  
  means for accumulating a respective count for each candidate language each time one of the plurality of words from the document is present in the associated word table; and
  
  means for identifying the language of the document as the language associated with the count having the highest value.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The product as recited in claim 16 further comprising:
18. The product as recited in claim 16, wherein the frequency of occurrence of each word in each word table is stored in the word table and further comprises:
- means for individually counting occurrences of each respective word in the word tables in the document;
  
  means for counting a total number of words in the plurality of words compared to words in the word tables;
  
  means responsive to the total number of compared words being less than a predetermined number for multiplying the individual count of each word by the stored frequency of occurrence for the word in the word table to produce a set of weighted counts and for summing the set of weighted counts to produce an aggregate weighted count once the plurality of words have been compared to the word tables; and
  
  means responsive to the total number of compared words being at least the predetermined number for summing the counts in each set of accumulators.
19. The product as recited in claim 16 which further comprises a set of word tables which represent genres within a candidate language.
20. The product as recited in claim 16, further comprising:
- means for counting the number of words from the document compared to the words in the word tables; and
  
  means for stopping the comparing and accumulating means once a predetermined number of words from the target document are compared to the words in the word tables.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Paulsen, Robert Charles Jr., Martino, Michael John
Primary Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US08/699,412
Time in Patent Office

2,759 Days
Field of Search

707/531, 707/533, 707/535, 707/536, 704/1, 704/8, 704/9
US Class Current

704/1
CPC Class Codes

G06F 40/216 using statistical methods

G06F 40/263 Language identification

Word counting natural language determination

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Word counting natural language determination

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links