×

Method of identifying the language of a textual passage using short word and/or n-gram comparisons

  • US 20050154578A1
  • Filed: 01/14/2004
  • Published: 07/14/2005
  • Est. Priority Date: 01/14/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method of determining the language of a textual passage, the method comprising the steps of:

  • (a) parsing said textual passage into a plurality of n-grams;

    (b) comparing each of said n-grams with a plurality of databases, wherein each of said databases comprises a list of n-grams associated with a specific language;

    (c) determining an initial weight for each of said n-grams, per language, by calculating the frequency with which each of said n-grams appears in each of said databases and dividing said frequency by the total number of n-grams in said respective database;

    (d) determining the number of said databases within which each of said n-grams appear;

    (e) altering said initial weight for each of said n-grams by multiplying said initial weight with the inverse of said number of databases within which each of said n-grams appear;

    (f) producing the weight of each language over the text passage by calculating, per language, the sum over each n-gram in the text passage of the products of the number of times that that n-gram appears in the text passage and the language-specific altered weight calculated in step (e) for that n-gram;

    (g) sorting the list of per language passage weights from step (f) in decreasing order, returning the most likely language for the text passage as the first element (highest weight) in the list.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×