System and method for identifying the language of written text having a plurality of different length n-gram profiles
First Claim
1. A computer-implemented method for using a plurality of n-gram profiles each associated with one of a plurality of languages so as to identify one of the plurality of languages in a sample input, comprising:
- (1) placing a window of X length over X letters of the sample input to produce a set of windowed letters;
(2) determining if there is at least one match based upon at least one comparison between any subset of letters ranging from length X to length Y of the windowed letters and a plurality of reference letter sequences ranging from length X to length Y in each of the n-gram profiles for each of the languages;
(3) determining which of the at least one matches is associated with a longest letter sequence in each of the n-gram profiles for each of the languages;
(4) obtaining a score associated with the longest letter sequence in each of the n-gram profiles for each of the languages determined in sub-process (3);
(5) adding each score obtained in sub-process (4) to an associated cumlative score associated with each one of the plurality of languages;
(6) shifting the X-length window within the sample input and repeating sub-processes (2) through (5) until the window has been shifted through the sample input; and
(7) identifying the one of the languages based upon which of the languages has a highest cumulative score.
2 Assignments
0 Petitions
Accused Products
Abstract
A window of letters is identified within a text sample input. If the window contains matches to reference letter sequences (RLS) contained in multiple sets of n-gram language profiles (profiles), then the longest match is kept and scored for each language. Scoring each language is based on frequency parameters of the matched RLS in profiles for each language. The window is incrementally shifted through the sample and the matching and scoring is done on the letters within the window. At the end of the sample input, the language having the highest cumulative score is identified as the sample'"'"'s language. Scoring may be improved by restricting the RLS within longer profiles to be full words, using two passes where the second pass disregards languages that are not scored near the highest scoring language during the first pass, favoring matched RLS within profiles of complete words during scoring, favoring longer matched RLS within profiles during scoring, and increasing a score of a match that does not frequently appear in many languages. The profiles may be enhanced by removing some of the RLS if the frequency of the RLS does not meet a predefined threshold and a variable threshold.
343 Citations
24 Claims
-
1. A computer-implemented method for using a plurality of n-gram profiles each associated with one of a plurality of languages so as to identify one of the plurality of languages in a sample input, comprising:
-
(1) placing a window of X length over X letters of the sample input to produce a set of windowed letters;
(2) determining if there is at least one match based upon at least one comparison between any subset of letters ranging from length X to length Y of the windowed letters and a plurality of reference letter sequences ranging from length X to length Y in each of the n-gram profiles for each of the languages;
(3) determining which of the at least one matches is associated with a longest letter sequence in each of the n-gram profiles for each of the languages;
(4) obtaining a score associated with the longest letter sequence in each of the n-gram profiles for each of the languages determined in sub-process (3);
(5) adding each score obtained in sub-process (4) to an associated cumlative score associated with each one of the plurality of languages;
(6) shifting the X-length window within the sample input and repeating sub-processes (2) through (5) until the window has been shifted through the sample input; and
(7) identifying the one of the languages based upon which of the languages has a highest cumulative score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
identifying as a close language each language having its associated cumulative score within a predefined percentage of the language having the highest score; repeating sub-process (1) through (6) while considering only the reference letter sequences in the n-gram profiles associated with each close language; and
wherein sub-process (7) further comprises identifying the one of the languages based upon which close language has the highest score.
-
-
10. The method of claim 1 further comprising the sub-process of increasing the cumulative score for one of the languages if a frequency parameter indicates that the longest of the matches is found in less than a predetermined number of the languages.
-
11. The method of claim 10 further comprising the sub-process of decreasing the cumulative score for the one of the languages if the frequency parameter indicates that the longest of the matches is found in more than a predefined number of the languages.
-
12. A computer-readable medium on which is stored a computer program for using a plurality of n-gram profiles to identify one of a plurality of languages for a sample input, the computer program comprising instructions, which when executed by a computer, perform the steps of:
-
(1) placing a window of X length over X letters of the sample input to produce a set of windowed letters;
(2) determining if there is at least one match based upon at least one comparison between any subset of letters ranging from length X to length Y of the windowed letters and a plurality of reference letter sequences ranging from length X to length Y in each of the n-gram profiles for each of the languages;
(3) determining which of the at least one matches is associated with a longest letter sequence in each of the n-gram profiles for each of the languages;
(4) obtaining a score associated with the longest letter sequence in each of the n-gram profiles for each of the languages determined in sub-process (3);
(5) adding each score obtained in sub-process (4) to an associated cumulative score associated with each one of the plurality of languages;
(6) shifting the X-length window within the sample input and repeating sub-processes (2) through (5) until the window has been shifted through the sample input; and
(7) identifying the one of the languages based upon which of the languages has a highest cumulative score. - View Dependent Claims (13, 14, 15, 16, 17, 18)
identifying as close languages each of the languages having its associated cumulative score within a predefined percentage of the language having the highest one of the cumulative scores;
repeating sub-processes (1) through (7) while considering only the reference letter sequences in the n-gram profiles associated with each close language; and
wherein sub-process (7) comprises identifying the one of the languages based upon which close language has the highest one of the cumulative scores.
-
-
19. A computer-implemented method for using a plurality of different length n-gram profiles to identify one of a plurality of languages for a sample input, comprising:
-
(1) placing a window of X length over X letters of the sample input to produce a set of windowed letters;
(2) determining if there is at least one match based upon at least one comparison between any subset of letters ranging from length X to length Y of the windowed letters and a plurality of reference letter sequences ranging from length X to length Y in each of the n-grain profiles for each of the languages;
(3) determining which of the at least one matches is associated with a longest letter sequence in each of the n-gram profiles for each of the languages;
(4) obtaining a score associated with the longest letter sequence in each of the n-gram profiles for each of the languages determined in sub-process (3);
(5) increasing the score obtained if a frequency parameter indicates that the match is found in less than a predetermined number of the languages;
(6) shifting the window over the sample input and repeating steps (2) through (5) until the sample input has been completely shifted through the window; and
(7) identifying the one of the languages based upon which of the languages has the highest of the scores. - View Dependent Claims (20, 21, 22)
-
-
23. A computer-implemented method for using a plurality of different length n-gram profiles to identify one of a plurality of languages for a sample input, comprising:
-
(1) identifying a window of letters within the sample input;
(2) if the window of letters contains a match of the letters within the window when compared to a plurality of reference letter sequences maintained in the n-gram profile for each of the languages, scoring the match as a score for each of the languages based upon a frequency parameter maintained in the n-gram profile for each of the languages, the frequency parameter being related to the match;
(3) normalizing the score across the languages;
(4) increasing the normalized score based upon the value of a logarithm of the score;
(5) adding the normalized score to a cumulative score for each of the languages;
(6) shifting the window over the sample input and repeating sub-processes (2) and (5) until the sample input has been completely shifted through the window; and
(7) identifying the one of the languages based upon which of the languages has the highest of the scores. - View Dependent Claims (24)
-
Specification