System and method for identifying the language of written text having a plurality of different length n-gram profiles

US 6,272,456 B1
Filed: 03/19/1998
Issued: 08/07/2001
Est. Priority Date: 03/19/1998
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented method for using a plurality of n-gram profiles each associated with one of a plurality of languages so as to identify one of the plurality of languages in a sample input, comprising:

(1) placing a window of X length over X letters of the sample input to produce a set of windowed letters;

(2) determining if there is at least one match based upon at least one comparison between any subset of letters ranging from length X to length Y of the windowed letters and a plurality of reference letter sequences ranging from length X to length Y in each of the n-gram profiles for each of the languages;

(3) determining which of the at least one matches is associated with a longest letter sequence in each of the n-gram profiles for each of the languages;

(4) obtaining a score associated with the longest letter sequence in each of the n-gram profiles for each of the languages determined in sub-process (3);

(5) adding each score obtained in sub-process (4) to an associated cumlative score associated with each one of the plurality of languages;

(6) shifting the X-length window within the sample input and repeating sub-processes (2) through (5) until the window has been shifted through the sample input; and

(7) identifying the one of the languages based upon which of the languages has a highest cumulative score.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A window of letters is identified within a text sample input. If the window contains matches to reference letter sequences (RLS) contained in multiple sets of n-gram language profiles (profiles), then the longest match is kept and scored for each language. Scoring each language is based on frequency parameters of the matched RLS in profiles for each language. The window is incrementally shifted through the sample and the matching and scoring is done on the letters within the window. At the end of the sample input, the language having the highest cumulative score is identified as the sample'"'"'s language. Scoring may be improved by restricting the RLS within longer profiles to be full words, using two passes where the second pass disregards languages that are not scored near the highest scoring language during the first pass, favoring matched RLS within profiles of complete words during scoring, favoring longer matched RLS within profiles during scoring, and increasing a score of a match that does not frequently appear in many languages. The profiles may be enhanced by removing some of the RLS if the frequency of the RLS does not meet a predefined threshold and a variable threshold.

343 Citations

24 Claims

1. A computer-implemented method for using a plurality of n-gram profiles each associated with one of a plurality of languages so as to identify one of the plurality of languages in a sample input, comprising:
- (1) placing a window of X length over X letters of the sample input to produce a set of windowed letters;
  
  (2) determining if there is at least one match based upon at least one comparison between any subset of letters ranging from length X to length Y of the windowed letters and a plurality of reference letter sequences ranging from length X to length Y in each of the n-gram profiles for each of the languages;
  
  (3) determining which of the at least one matches is associated with a longest letter sequence in each of the n-gram profiles for each of the languages;
  
  (4) obtaining a score associated with the longest letter sequence in each of the n-gram profiles for each of the languages determined in sub-process (3);
  
  (5) adding each score obtained in sub-process (4) to an associated cumlative score associated with each one of the plurality of languages;
  
  (6) shifting the X-length window within the sample input and repeating sub-processes (2) through (5) until the window has been shifted through the sample input; and
  
  (7) identifying the one of the languages based upon which of the languages has a highest cumulative score.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein each of the matches includes a first of the letters within the windowed letters.
  - 3. The method of claim 1, wherein sub-process (4) comprises scoring the longest of the matches only if the longest of the matches is a complete word.
  - 4. The method of claim 1, wherein sub-process (4) comprises scoring the longest of the matches if the longest of the matches is longer than a predetermined threshold and if the longest of the matches is one of a predefined set of words.
  - 5. The method of claim 4, wherein the predefined set of words comprises complete words having at least a predefined frequency of use for one of the languages.
  - 6. The method of claim 1, wherein sub-process (4) comprises scoring the longest of the matches by increasing the score if the longest of the matches is a complete word.
  - 7. The method of claim 1, wherein sub-process (4) comprises scoring the longest of the matches by increasing the score if the longest of the matches has a length greater than a predetermined threshold.
  - 8. The method of claim 7, wherein sub-process (4) comprises scoring the longest of the matches by increasing the score relative to the length of the longest of the matches.
  - 9. The method of claim 1, further comprising, after sub-process (6),
10. The method of claim 1 further comprising the sub-process of increasing the cumulative score for one of the languages if a frequency parameter indicates that the longest of the matches is found in less than a predetermined number of the languages.
11. The method of claim 10 further comprising the sub-process of decreasing the cumulative score for the one of the languages if the frequency parameter indicates that the longest of the matches is found in more than a predefined number of the languages.

12. A computer-readable medium on which is stored a computer program for using a plurality of n-gram profiles to identify one of a plurality of languages for a sample input, the computer program comprising instructions, which when executed by a computer, perform the steps of:
- (1) placing a window of X length over X letters of the sample input to produce a set of windowed letters;
  
  (2) determining if there is at least one match based upon at least one comparison between any subset of letters ranging from length X to length Y of the windowed letters and a plurality of reference letter sequences ranging from length X to length Y in each of the n-gram profiles for each of the languages;
  
  (3) determining which of the at least one matches is associated with a longest letter sequence in each of the n-gram profiles for each of the languages;
  
  (4) obtaining a score associated with the longest letter sequence in each of the n-gram profiles for each of the languages determined in sub-process (3);
  
  (5) adding each score obtained in sub-process (4) to an associated cumulative score associated with each one of the plurality of languages;
  
  (6) shifting the X-length window within the sample input and repeating sub-processes (2) through (5) until the window has been shifted through the sample input; and
  
  (7) identifying the one of the languages based upon which of the languages has a highest cumulative score.
- View Dependent Claims (13, 14, 15, 16, 17, 18)
- - 13. The computer-readable medium of claim 12, wherein sub-process (4) comprises scoring the longest match if the longest match is longer than a predetermined threshold and if the longest match is one of a predefined set of words.
  - 14. The computer-readable medium of claim 13, wherein the predefined set of words comprises complete words having at least a predefined frequency of use for one of the languages.
  - 15. The computer-readable medium of claim 12, wherein sub-process (4) comprises scoring the longest match by increasing the score for one of the languages if the longest match is a complete word in the one of the languages.
  - 16. The computer-readable medium of claim 12, wherein sub-process (4) comprises scoring the longest match by decreasing the score for each of the languages if the longest match is smaller in length than a predetermined threshold.
  - 17. The computer-readable medium of claim 16, wherein sub-process (4) comprises scoring the longest match by increasing the score for each of the languages relative to the length of the longest match.
  - 18. The computer-readable medium of claim 12 further comprising instructions, which when executed by the computer, after sub-process (7) perform:

19. A computer-implemented method for using a plurality of different length n-gram profiles to identify one of a plurality of languages for a sample input, comprising:
- (1) placing a window of X length over X letters of the sample input to produce a set of windowed letters;
  
  (2) determining if there is at least one match based upon at least one comparison between any subset of letters ranging from length X to length Y of the windowed letters and a plurality of reference letter sequences ranging from length X to length Y in each of the n-grain profiles for each of the languages;
  
  (3) determining which of the at least one matches is associated with a longest letter sequence in each of the n-gram profiles for each of the languages;
  
  (4) obtaining a score associated with the longest letter sequence in each of the n-gram profiles for each of the languages determined in sub-process (3);
  
  (5) increasing the score obtained if a frequency parameter indicates that the match is found in less than a predetermined number of the languages;
  
  (6) shifting the window over the sample input and repeating steps (2) through (5) until the sample input has been completely shifted through the window; and
  
  (7) identifying the one of the languages based upon which of the languages has the highest of the scores.
- View Dependent Claims (20, 21, 22)
- - 20. The method of claim 19, further comprising the sub-process of decreasing the score if the frequency parameter indicates that the that the match is found in greater than a predefined number of the languages.
  - 21. The method of claim 19, wherein the score has a logarithmic relationship to the frequency parameter.
  - 22. The method of claim 19, wherein sub-process (5) comprises normalizing the score relative to the frequency parameter maintained in the n-gram profile for each of the languages when compared to a highest score for the match if the frequency parameter indicates that the match is found in less than the predetermined number of the languages.

23. A computer-implemented method for using a plurality of different length n-gram profiles to identify one of a plurality of languages for a sample input, comprising:
- (1) identifying a window of letters within the sample input;
  
  (2) if the window of letters contains a match of the letters within the window when compared to a plurality of reference letter sequences maintained in the n-gram profile for each of the languages, scoring the match as a score for each of the languages based upon a frequency parameter maintained in the n-gram profile for each of the languages, the frequency parameter being related to the match;
  
  (3) normalizing the score across the languages;
  
  (4) increasing the normalized score based upon the value of a logarithm of the score;
  
  (5) adding the normalized score to a cumulative score for each of the languages;
  
  (6) shifting the window over the sample input and repeating sub-processes (2) and (5) until the sample input has been completely shifted through the window; and
  
  (7) identifying the one of the languages based upon which of the languages has the highest of the scores.
- View Dependent Claims (24)
- - 24. The method of claim 23, wherein step (3) comprises normalizing the score based upon the frequency of the match in each of the n-gram profiles for each of the languages.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
de Campos, Miguel Cardoso
Primary Examiner(s)
Thomas, Joseph

Application Number

US09/044,752
Time in Patent Office

1,237 Days
Field of Search

704/1, 704/8, 704/9, 704/10, 707/530, 707/531, 707/532, 707/533, 707/535, 707/536
US Class Current

704/8
CPC Class Codes

G06F 40/263 Language identification

System and method for identifying the language of written text having a plurality of different length n-gram profiles

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

343 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for identifying the language of written text having a plurality of different length n-gram profiles

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

343 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links