×

System and method for identifying the language of written text having a plurality of different length n-gram profiles

  • US 6,272,456 B1
  • Filed: 03/19/1998
  • Issued: 08/07/2001
  • Est. Priority Date: 03/19/1998
  • Status: Expired due to Term
First Claim
Patent Images

1. A computer-implemented method for using a plurality of n-gram profiles each associated with one of a plurality of languages so as to identify one of the plurality of languages in a sample input, comprising:

  • (1) placing a window of X length over X letters of the sample input to produce a set of windowed letters;

    (2) determining if there is at least one match based upon at least one comparison between any subset of letters ranging from length X to length Y of the windowed letters and a plurality of reference letter sequences ranging from length X to length Y in each of the n-gram profiles for each of the languages;

    (3) determining which of the at least one matches is associated with a longest letter sequence in each of the n-gram profiles for each of the languages;

    (4) obtaining a score associated with the longest letter sequence in each of the n-gram profiles for each of the languages determined in sub-process (3);

    (5) adding each score obtained in sub-process (4) to an associated cumlative score associated with each one of the plurality of languages;

    (6) shifting the X-length window within the sample input and repeating sub-processes (2) through (5) until the window has been shifted through the sample input; and

    (7) identifying the one of the languages based upon which of the languages has a highest cumulative score.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×