×

Trigram-based method of language identification

  • US 5,062,143 A
  • Filed: 02/23/1990
  • Issued: 10/29/1991
  • Est. Priority Date: 02/23/1990
  • Status: Expired due to Term
First Claim
Patent Images

1. A method of determining in what language a body of text is written comprising the steps of:

  • (a) parsing said body of text into a plurality of trigrams so that at least some of the trigrams overlap adjacent words, each trigram comprising the contents of three successive character/space positions of said body of text;

    (b) comparing each of the trigrams that has been parsed from said body of text in step (a) with a plurality of trigram key sets, each respective trigram key set being associated with a respectively different language and containing those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language; and

    (c) in response to the ratio of the number of trigrams of said body of text compared in step (b), that correspond to trigrams of a respective key set, to the total number of trigrams of said body of text being at least equal to a prescribed value and greater than such ratios for alternative languages, identifying the body of text as being written in the language associated with said respective key set.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×