×

Method of identifying the language of a textual passage using short word and/or n-gram comparisons

  • US 7,359,851 B2
  • Filed: 01/14/2004
  • Issued: 04/15/2008
  • Est. Priority Date: 01/14/2004
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method of determining the language of a textual passage, the method comprising the steps of:

  • (a) parsing said textual passage into plural n-grams;

    (b) comparing each of said plural n-grams with a plurality of databases, wherein each of said databases comprises multiple n-grams associated with a specific language;

    (c) determining an initial weight for each of said plural n-grams, per language, by calculating the frequency with which each of said plural n-grams appears in each of said databases and dividing said frequency by the total number of the multiple n-grams in said respective database;

    (d) determining the number of said databases within which each of said plural n-grams appear;

    (e) altering said initial weight for each of said plural n-grams by multiplying said initial weight with the inverse of said number of databases within which each of said plural n-grams appear to produce an altered weight for each of said plural n-grams;

    (f) multiplying, for each of the plural n-grams, a number of times that n-gram appears in the textual passage by the altered weight of that n-gram from step (e) to produce a product for each n-gram, per language, and summing those products to produce a language passage weight for each language for the textual passage;

    (g) determining, based upon a comparison of the language passage weights from step (f) the most likely language or languages for the textual passage.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×