Text language identification

US 20040138869A1
Filed: 12/11/2003
Published: 07/15/2004
Est. Priority Date: 12/17/2002
Status: Active Grant

First Claim

Patent Images

1. A device for automatically identifying the language of a digital text, comprising:

means for prestoring first character strings that occur frequently anywhere respectively in words of a plurality of predetermined languages and characterize said predetermined languages, means for prestoring second character strings that are a typical anywhere respectively in words of said predetermined languages, means for analyzing words extracted from said digital text thereby constructing for each extracted word all character strings contained in said extracted word and having lengths lying between one character and the number of characters in said extracted word, means for comparing character strings contained in extracted words to prestored character strings in order to determine scores associated with said predetermined languages, means for comparing each of all character strings contained in each said extracted word individually to said first and second prestored character strings of a determined language so that whenever a first character string is found in said extracted word a score associated with said determined language is increased by a first coefficient depending on the position of said first character string found in said extracted word and whenever a second character string is found in said extracted word said score is decreased by a respective second coefficient that is associated with said found second character string and that increases as the probability of said found second character string in said determined language decreases, and means for comparing said scores for said text associated with said predetermined languages in order to determine the highest of said scores, which identifies the language of said text.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

After prestoring first character strings that occur frequently in words of languages and second character strings that are a typical therein, a device for automatically identifying the language of a text from a plurality of languages extracts words from the text and constructs all of the character strings contained in each extracted word. Each string in an extracted word is compared to the first and second strings of a particular language. If the word contains a first string, a score of the language is increased by a coefficient depending in particular on the position of the first string in the word. If the word contains a second string, the score is decreased by a coefficient associated with the second string. The highest of the scores corresponding to the predetermined languages identifies the language of the text.

275 Citations

7 Claims

1. A device for automatically identifying the language of a digital text, comprising:
- means for prestoring first character strings that occur frequently anywhere respectively in words of a plurality of predetermined languages and characterize said predetermined languages, means for prestoring second character strings that are a typical anywhere respectively in words of said predetermined languages, means for analyzing words extracted from said digital text thereby constructing for each extracted word all character strings contained in said extracted word and having lengths lying between one character and the number of characters in said extracted word, means for comparing character strings contained in extracted words to prestored character strings in order to determine scores associated with said predetermined languages, means for comparing each of all character strings contained in each said extracted word individually to said first and second prestored character strings of a determined language so that whenever a first character string is found in said extracted word a score associated with said determined language is increased by a first coefficient depending on the position of said first character string found in said extracted word and whenever a second character string is found in said extracted word said score is decreased by a respective second coefficient that is associated with said found second character string and that increases as the probability of said found second character string in said determined language decreases, and means for comparing said scores for said text associated with said predetermined languages in order to determine the highest of said scores, which identifies the language of said text.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The device claimed in claim 1, wherein a first character string in an extracted word consists of one of the following character strings:
    - a prefix, a pseudo-prefix, a suffix, a pseudo-suffix, an infix, a pseudo-infix.
  - 3. The device claimed in claim 1, wherein said first coefficient of a first character string in said extracted word depends on the frequency of said character string in said determined language.
  - 4. The device claimed in claim 1, wherein said first coefficient of a first character string in said extracted word depends on the length of said character string.
  - 5. The device claimed in claim 1, wherein said first coefficient of a first character string in said extracted word is equal to:
    - PO (FR+LON), where PO is a coefficient depending on the position of said first character string in said extracted word, FR is a coefficient depending on the frequency of said first character string in a determined language, and LON is a coefficient depending on the length of said first character string.
  - 6. The device claimed in claim 1, comprising comparator means for comparing each of said extracted words from said text with frequent words in said determined language and initially listed in storage means so that whenever a frequent word is found in said text said score for said determined language is increased only by a coefficient depending on the frequency of said extracted word in said determined language
  - 7. The device claimed in claim 1, comprising comparator means for comparing each of said extracted words from said text with frequent words in said determined language and initially listed in storage means so that whenever a frequent word is found in said text said score for said determined language is increased only by a coefficient depending on the length of said frequent word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Orange S.A.
Original Assignee
Orange S.A.
Inventors
Heinecke, Johannes

Granted Patent

US 7,689,409 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/1
CPC Class Codes

G06F 40/216   using statistical methods

G06F 40/263   Language identification

G06F 40/268   Morphological analysis

Text language identification

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

275 Citations

7 Claims

Specification

Use Cases

Quick Links

Others

Text language identification

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

275 Citations

7 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others