Word storage table for natural language determination

US 6,009,382 A
Filed: 09/30/1996
Issued: 12/28/1999
Est. Priority Date: 08/19/1996
Status: Expired due to Fees

First Claim

Patent Images

1. A method for identifying a plurality of character strings, comprising the steps of:

selecting a set of character strings;

storing the set of character strings in a word table as a set of ordered character pairs, wherein each word table is an N×

N bit table, wherein each bit represents a given character pair at a particular place in one of the character strings;

comparing new character strings to the character strings stored in the word tables;

counting a number of matches between the new character strings and the character strings in the word table;

identifying the new character strings as related to the character strings in the word table if there are a sufficient number of matches.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A language in which a document is written is identified through the use of sets of most frequently used words in each of a plurality of candidate languages. Each set of most frequently used words in a respective set of word tables for a respective candidate language according to letter pairs in each set of most frequently used words. In the preferred embodiment, each word table is an N×N bit table, where each bit represents a given letter pair at a particular place in one of the most frequently used words in one of the candidate languages. Words from the document are compared to the most frequently used words stored in the word tables. A count of the number of matches between the words from the document and the words stored in each respective set of word tables is kept for each respective language. The language of the document as the respective candidate language having the greatest number of matches.

144 Citations

17 Claims

1. A method for identifying a plurality of character strings, comprising the steps of:
- selecting a set of character strings;
  
  storing the set of character strings in a word table as a set of ordered character pairs, wherein each word table is an N×
  
  N bit table, wherein each bit represents a given character pair at a particular place in one of the character strings;
  
  comparing new character strings to the character strings stored in the word tables;
  
  counting a number of matches between the new character strings and the character strings in the word table;
  
  identifying the new character strings as related to the character strings in the word table if there are a sufficient number of matches.
- View Dependent Claims (2, 4, 5, 6, 7)
- - 2. The method as recited in claim 1 wherein the character strings are words, the set of character strings in the word table are words from a natural language, the new character strings are words from a document and if there are a sufficient number of matches the new character strings are identified as being in the natural language of the word table.
  - 4. The method as recited in claim 1 wherein special characters are also represented in the word tables.
  - 5. The method as recited in claim 4 wherein one of the special characters is a blank character so that most frequently used words of different lengths can be stored in the same set of word tables.
  - 6. The method as recited in claim 1 wherein each respective set of tables contains tableaus of tables for words of a respective length so that all the words in a given tableau of tables are a given length.
  - 7. The method as recited in claim 6 wherein some of the most frequently used words in at least one candidate language are truncated.

3. A method for identifying a language in which a document is written, comprising the steps of:
- selecting a set of most frequently used words in each of a plurality of candidate languages;
  
  storing each set of most frequently used words in a respective set of word tables for a respective candidate language according to letter pairs in each set of most frequently used words, wherein each word table is an N×
  
  N bit table, wherein each bit represents a given letter pair at a Particular place in one of the most frequently used words in one of the candidate language;
  
  comparing words from the document to the most frequently used words stored in the word tables;
  
  counting a number of matches between the words from the document and the words stored in each respective set of word tables;
  
  identifying the language of the document as the respective candidate language having the greatest number of matches.

8. A system including processor and memory for identifying a language in which a document is written, comprising:
- a plurality of word tables, each for a respective candidate language in which a set of most frequently used words are stored according to letter pairs in each of the most frequently used words, wherein each word table is an N×
  
  N bit table, wherein each bit represents a given letter pair at a Particular place in one of the most frequently used words in one of the candidate language;
  
  a comparator for comparing words from the document to the most frequently used words stored in the word tables;
  
  an accumulator for counting a number of matches between the words from the document and the words stored in each respective set of word tables;
  
  means for identifying the language of the document as the respective candidate language having the greatest number of matches.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The system as recited in claim 8 wherein the plurality of word tables further comprises a plurality of tableaus of word tables for each respective candidate language, each tableau of word tables for storing words of a respective length.
  - 10. The system as recited in claim 9 further comprising:
    - length determining means for determining lengths of words in the document;
      
      word routing means for sending each word in the document to the tableaus according to the determined length of the word.
  - 11. The system as recited in claim 10 further comprising:
    - means for truncating words from the document which exceed the length of the words stored in the tableaus storing words of the longest respective length;
      
      wherein the word routing means routes the truncated words to the tableaus storing words of the longest respective length.
  - 12. The system as recited in claim 8 wherein the words in the word tables were chosen to avoid strong aliasing between respective candidate languages.
  - 13. The system as recited in claim 8 wherein the words in the word tables for each respective candidate language cover a substantially equivalent portion of each candidate language.

14. A computer program product on a computer readable medium for identifying a language in which a document is written, comprising:
- means for providing a plurality of word tables, each arranged in tableaus for storing words of a respective length and in a respective candidate language according to letter pairs in each of the stored words, wherein each word table is an N×
  
  N bit table, wherein each bit represents a given letter pair at a particular place in one of the most frequently used words in one of the candidate language;
  
  means for comparing words from the document to the words stored in the word tables;
  
  means for counting a number of matches between the words from the document and the words stored in each respective set of word tables;
  
  means for identifying the language of the document as the respective candidate language having the greatest number of matches.
- View Dependent Claims (15, 16, 17)
- - 15. The program as recited in claim 14 wherein the words stored in the word tables are a set of most frequently used words in a respective candidate language.
  - 16. The product as recited in claim 14 further comprising means for stopping the comparing and counting means when the number of matches between respective candidate languages reaches a predetermined degree of divergence.
  - 17. The product as recited in claim 14 further comprising means for transmitting the product over a network to a computer system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Paulsen, Robert Charles Jr., Martino, Michael John
Primary Examiner(s)
Thomas, Joseph

Application Number

US08/723,813
Time in Patent Office

1,184 Days
Field of Search

704/1, 704/8, 704/9, 704/10, 707/531, 707/532, 707/535, 707/536, 707/101
US Class Current

704/1
CPC Class Codes

G06F 40/216 using statistical methods

G06F 40/263 Language identification

Word storage table for natural language determination

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

144 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Word storage table for natural language determination

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

144 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links