Text language identification

  • US 7,689,409 B2
  • Filed: 12/11/2003
  • Issued: 03/30/2010
  • Est. Priority Date: 12/17/2002
  • Status: Active Grant
  • ×
    • Pin Icon | RPX Insight
    • Pin
First Claim
Patent Images

1. A device for automatically identifying the language of a digital text, comprising:

  • means for prestoring first character strings, including prefixes, suffixes and infixes, of different lengths from words of a plurality of predetermined languages, that occur frequently anywhere respectively in said words of said plurality of predetermined languages,means for prestoring second character strings of different lengths, that are A typical anywhere respectively in said words of said predetermined languages,means for analyzing words extracted from said digital text, thereby constructing for each extracted word all the character strings contained in said extracted word, including all the prefixes, suffixes and infixes in said extracted word, with overlap and different lengths lying between one character and the number of characters in said extracted word,means for comparing each of said character strings contained in each said extracted word to said first prestored character strings and second prestored character strings of said predetermined languages,means for calculating scores respectively associated with said predetermined languages, a score associated with one determined language being calculated by adding to said score a first coefficient whenever a prestored first character string of said one determined language is found in said extracted word, said first coefficient depending on the position of said found prestored first character string of said one determined language in said extracted word, and, by subtracting from said score a second coefficient whenever a prestored second character string of said one determined language is found in said extracted word, said second coefficient increasing as the probability of said found prestored second character string in said one determined language decreases, andmeans for comparing said scores for said text associated with said predetermined languages in order to determine the highest of said scores, which identities the language of said text.

View all claims
    ×
    ×

    Thank you for your feedback

    ×
    ×