×

Text language identification

  • US 20040138869A1
  • Filed: 12/11/2003
  • Published: 07/15/2004
  • Est. Priority Date: 12/17/2002
  • Status: Active Grant
First Claim
Patent Images

1. A device for automatically identifying the language of a digital text, comprising:

  • means for prestoring first character strings that occur frequently anywhere respectively in words of a plurality of predetermined languages and characterize said predetermined languages, means for prestoring second character strings that are a typical anywhere respectively in words of said predetermined languages, means for analyzing words extracted from said digital text thereby constructing for each extracted word all character strings contained in said extracted word and having lengths lying between one character and the number of characters in said extracted word, means for comparing character strings contained in extracted words to prestored character strings in order to determine scores associated with said predetermined languages, means for comparing each of all character strings contained in each said extracted word individually to said first and second prestored character strings of a determined language so that whenever a first character string is found in said extracted word a score associated with said determined language is increased by a first coefficient depending on the position of said first character string found in said extracted word and whenever a second character string is found in said extracted word said score is decreased by a respective second coefficient that is associated with said found second character string and that increases as the probability of said found second character string in said determined language decreases, and means for comparing said scores for said text associated with said predetermined languages in order to determine the highest of said scores, which identifies the language of said text.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×