Natural language determination using partial words

US 6,216,102 B1
Filed: 09/30/1996
Issued: 04/10/2001
Est. Priority Date: 08/19/1996
Status: Expired due to Fees

First Claim

Patent Images

1. A method for identifying the language in which a document is written, comprising the steps of:

reading a plurality of words from a document into a computer memory;

truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;

comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language, wherein the most frequently used words which exceed the predetermined length are truncated in the word tables;

accumulating a respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and

identifying the language of the document as the language associated with the count having the highest value.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Comparing the short and truncated words of a document to word tables of most frequently used words in each of the respective candidate language to identify the language in which the document is written. First, a plurality of words from a document is read into a computer memory. Then, words within the plurality of words which exceed a predetermined length are truncated to produce a set of short and truncated words. The set of short and truncated words are compared to words in a plurality of word tables. Each word table is associated with and contains a selection of most frequently used words in a respective candidate language. Although the most frequently words in most languages tend to be short those which which exceed the predetermined length may be truncated in the word tables. A respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language. In some embodiments, the count may weighted by factors related to the frequency of occurrence of the words in the respective candidate languages. The language of the document is identified as the language associated with the count having the highest value.

Citations

18 Claims

1. A method for identifying the language in which a document is written, comprising the steps of:
- reading a plurality of words from a document into a computer memory;
  
  truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;
  
  comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language, wherein the most frequently used words which exceed the predetermined length are truncated in the word tables;
  
  accumulating a respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and
  
  identifying the language of the document as the language associated with the count having the highest value.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as recited in claim 1 further comprising the steps of:
3. The method as recited in claim 1 further comprising the steps of:
- accumulating a respective weighted count for each candidate language equivalent to an aggregate frequency of occurrence for each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language;
  
  determining a number of the short and truncated words compared to the word tables; and
  
  using the respective weighted count to identify the language of the document until the number of short and truncated words exceeds a predetermined number.
4. The method as recited in claim 1 further comprising the step of selecting most frequently used words for the word tables to reduce strong aliasing between candidate languages.
5. The method as recited in claim 1 wherein a first truncated word is a weak alias of a first short word in a word table so that the first truncated word matches the first short word in the comparing step.
6. The method as recited in claim 1 wherein a plurality of tableaus of word tables is associated with each candidate language each tableau of word tables for comparing words of a respective length and the method further comprises the steps of:
- determining the length of each word in the plurality of words;
  
  comparing words of a respective length in the set of short and truncated words with the tableaus for each respective candidate language for words of the respective length;
  
  wherein the truncated words are compared to the tableaus for words of a longest respective length.
7. The method as recited in claim 1 wherein the words are stored in the tables as bits set for a presence of an ordered letter pair in one of the most frequently used words in the respective candidate language.

8. A method for identifying the language in which a document is written, comprising the steps of:
- reading a plurality of words from a document;
  
  truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;
  
  comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language, wherein the most frequently used words which exceed the predetermined length are truncated in the word tables;
  
  accumulating a respective weighted count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and
  
  identifying the language of the document as the language associated with the weighted count having the highest value.
- View Dependent Claims (9)
- - 9. The method as recited in claim 8 wherein a weight added to the respective weighted count is proportionate to a frequency of occurrence for the matched word in the respective candidate language.

10. A system including processor and memory for identifying the language in which a document is written comprising:
- means for reading a plurality of words from a document into the memory;
  
  means for determining a length of each word in the plurality of words;
  
  means for truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;
  
  means for comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
  
  means for accumulating a respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and
  
  means for identifying the language of the document as the language associated with the count having the highest value.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system as recited in claim 10, wherein the most frequently used words which exceed the predetermined length are truncated in the word tables.
  - 12. The system as recited in claim 10 further comprising the step of selecting most frequently used words for the word tables to reduce strong aliasing between candidate languages.
  - 13. The system as recited in claim 10 wherein a first truncated word is a weak alias of a first short word in a word table so that the first truncated word matches the first short word in the comparing step.
  - 14. The system as recited in claim 11 wherein a first truncated word is a weak alias of a second truncated word in a word table so that the first truncated word matches the second truncated word in the comparing step.
  - 15. The system as recited in claim 10 further comprising:

16. A computer program product on a computer readable medium for identifying the language in which a document is written comprising:
- means for determining a length of each word in a plurality of words from the document;
  
  means for truncating words within the plurality of words which exceed a predetermined length to produce a set of short and truncated words;
  
  means for comparing the set of short and truncated words to words in a plurality of word tables, each word table associated with and containing a selection of most frequently used words in a respective candidate language;
  
  means for accumulating a respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language; and
  
  means for identifying the language of the document as the language associated with the count having the highest value.
- View Dependent Claims (17, 18)
- - 17. The product as recited in claim 16, wherein the most frequently used words which exceed the predetermined length are truncated in the word tables.
  - 18. The product as recited in claim 16 further comprising means for weighting the respective counts for each candidate language according to frequencies of occurrence of the most frequently used words in each respective candidate language to produce weighted counts wherein the weighted counts can be used by the identifying means.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Paulsen, Robert Charles Jr., Martino, Michael John
Primary Examiner(s)
Isen, Forester W.
Assistant Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US08/723,815
Time in Patent Office

1,653 Days
Field of Search

704/1-2, 704/5, 704/8-9, 704/10, 707/530-532, 707/536
US Class Current

704/9
CPC Class Codes

G06F 40/216 using statistical methods

G06F 40/263 Language identification

Natural language determination using partial words

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Natural language determination using partial words

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links