Method for identifying the language of individual words

US 6,292,772 B1
Filed: 12/01/1998
Issued: 09/18/2001
Est. Priority Date: 12/01/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented method of determining if a word is from a target language comprising the steps of:

decomposing the word into a plurality of non-overlapping n-grams covering the entire word without gaps and without crossing word boundaries and including a first n-gram, one or more following n-grams, if present, and a last n-gram, determining if the first n-gram, one or more of the following n-grams, if present, and the last n-gram match non-overlapping n-gram patterns characteristic of words in the target language, and identifying the word as from the target language if the plurality of non-overlapping n-grams match the non-overlapping n-gram patterns characteristic of words in the target language.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The method of recognizing the language of a single word as to spelling and grammar correction (e.g., identifying the appropriate language resources on a document, paragraph, sentence or even individual word basis), the automatic invocation of transliteration software based on the language of the words (e.g., automatic ASCII to Kanji substitution without requiring the user to explicitly switch into a Kanji mode), the automatic invocation of appropriate machine translation tools when the document'"'"'s language is different from the user'"'"'s native tongue(s), the use of document language identification to eliminate from database or web search results any documents which are not written in the user'"'"'s native language and the automatic identification of user-appropriate languages for the user interface.

Citations

12 Claims

1. A computer implemented method of determining if a word is from a target language comprising the steps of:
- decomposing the word into a plurality of non-overlapping n-grams covering the entire word without gaps and without crossing word boundaries and including a first n-gram, one or more following n-grams, if present, and a last n-gram, determining if the first n-gram, one or more of the following n-grams, if present, and the last n-gram match non-overlapping n-gram patterns characteristic of words in the target language, and identifying the word as from the target language if the plurality of non-overlapping n-grams match the non-overlapping n-gram patterns characteristic of words in the target language.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 wherein the plurality of n-grams is determined to match non-overlapping n-gram patterns using regular expressions or finite state automata.
  - 3. The method of claim 2 wherein the word is decomposed by treating the word as a sequence of non-overlapping n-grams without gaps.
  - 4. The method of claim 1 wherein the word is decomposed by treating the word as a sequence of n-grams without gaps.
  - 5. The method of claim 1 wherein the word is decomposed by treating the word as a sequence of n-grams with position restrictions.
  - 6. The method of claim 1, further including the step of determining a most probable language of the word where more than one language is suggested by processing neighboring words such that if both neighboring words are of a first language and the word is recognized as being of both the first and a second language, then the word is deemed of the first language.
  - 7. The method of claim 1, further including the step of determining a language of a sequence of words where if more than a given ratio of words in the sequence of words is found characteristic of the target language, deeming the sequence of words to be in the target language.
  - 8. The method of claim 1, further including the step of determining a language of the word in a sequence of words where if a given word in the sequence of words is not found to be of a given language of a substantial number of remaining words in the sequence of words and is not set off by quotation marks, italicized, or otherwise marked as unusual, then considering the word to be a misspelled variant of a word in the given language.
  - 9. The method of claim 1, further including the steps of:
10. The method of claim 1, further including the steps of:
- repeating the steps for each word in a sequence of words of not more than about five words, and selecting a language of a computer user interface if at least one word in the sequence of words is identified as being of the target language.
11. The method of claim 1, further including the steps of:
- repeating the steps for each word in a sequence of words of not more than about five words, and selecting a source language of a computer translation program if at least one word in the sequence of words is identified as being of the target language.
12. The method of claim 1, further including the steps of:
- repeating the steps for each word in a sequence of words of not more than about five words in a document query, and selecting a language of documents to be retrieved in an information retrieval system if at least one word from the document query is identified as being of the target language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Justsystems Corporation
Original Assignee
Justsystems Corporation
Inventors
Kantrowitz, Mark
Primary Examiner(s)
Edouard, Patrick N.

Application Number

US09/204,623
Time in Patent Office

1,022 Days
Field of Search

704/1, 704/8-10, 704/2, 707/530, 707/537, 707/532, 707/536, 707/101, 382/229, 382/230
US Class Current

704/9
CPC Class Codes

G06F 40/263   Language identification

G06F 40/279   Recognition of textual enti...

G10L 15/005   Language recognition

G10L 15/197   Probabilistic grammars, e.g...

Method for identifying the language of individual words

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method for identifying the language of individual words

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links