-Gram-based language prediction

US 9,535,895 B2
Filed: 03/17/2011
Issued: 01/03/2017
Est. Priority Date: 03/17/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

under control of a device comprising one or more processors configured with executable instructions, receiving at a graphical user interface of the device, user selection of a sample electronic text;

identifying multiple sample n-grams of the sample electronic text;

for a first language;

identifying a first set of n-grams that occur in a first language reference corresponding to the first language;

calculating a first set of Bayesian probabilities, including calculating a first Bayesian probability based at least in part on a frequency of occurrence, in the first set of n-grams, of a first sample n-gram of the multiple sample n-grams; and

calculating a first average of the first set of Bayesian probabilities;

for a second language;

identifying a second set of n-grams that occur in the second language reference corresponding to the second language;

calculating a second set of Bayesian probabilities, including calculating a second Bayesian probability based at least in part on a frequency of occurrence, in the second set of n-grams, of a second sample n-gram of the multiple sample n-grams; and

calculating a second average of the second set of Bayesian probabilities;

comparing at least the first average and the second average;

determine a language of the sample electronic text based at least in part on the comparing at least the first average and the second average;

determining a meaning of a word of the sample electronic text in a dictionary of the language; and

presenting the meaning of the word on a display of the device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are described for predicting the language of a text excerpt. The language prediction is accomplished by comparing n-grams of the text excerpt with n-grams of different language references. A probability is calculated for each n-gram of the text excerpt with respect to each of the language references. The calculated probabilities corresponding to a single language are then averaged to yield an overall probability corresponding to that language, and the resulting overall probabilities are compared to find the most likely language of the sample text.

29 Citations

View as Search Results

23 Claims

1. A computer-implemented method, comprising:
- under control of a device comprising one or more processors configured with executable instructions, receiving at a graphical user interface of the device, user selection of a sample electronic text;
  
  identifying multiple sample n-grams of the sample electronic text;
  
  for a first language;
  
  identifying a first set of n-grams that occur in a first language reference corresponding to the first language;
  
  calculating a first set of Bayesian probabilities, including calculating a first Bayesian probability based at least in part on a frequency of occurrence, in the first set of n-grams, of a first sample n-gram of the multiple sample n-grams; and
  
  calculating a first average of the first set of Bayesian probabilities;
  
  for a second language;
  
  identifying a second set of n-grams that occur in the second language reference corresponding to the second language;
  
  calculating a second set of Bayesian probabilities, including calculating a second Bayesian probability based at least in part on a frequency of occurrence, in the second set of n-grams, of a second sample n-gram of the multiple sample n-grams; and
  
  calculating a second average of the second set of Bayesian probabilities;
  
  comparing at least the first average and the second average;
  
  determine a language of the sample electronic text based at least in part on the comparing at least the first average and the second average;
  
  determining a meaning of a word of the sample electronic text in a dictionary of the language; and
  
  presenting the meaning of the word on a display of the device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 21, 22)
- - 2. The computer-implemented method of claim 1, wherein calculating the first average of the first set of Bayesian probabilities comprises calculating a product of (a) the first Bayesian probability, and (b) a number of times the first sample n-gram occurs in the sample electronic text.
  - 3. The computer-implemented method of claim 1, wherein each sample n-gram of the multiple sample n-grams is an ordered string of n characters.
  - 4. The computer-implemented method of claim 1, wherein each sample n-gram of the multiple sample n-grams is a contiguous string of n characters, and n equals three.
  - 5. The computer-implemented method of claim 1, wherein calculating the first Bayesian probability is based at least in part on:
    - a first relative frequency with which the first sample n-gram occurs in the first set of n-grams; and
      
      a second relative frequency with which the first sample n-gram occurs in a combined collection of language references that include the first language reference and the second language reference.
  - 6. The computer-implemented method of claim 1, wherein calculating the first Bayesian probability is based at least in part on:
    - a first relative frequency with which the first sample n-gram occurs in the first set of n-grams;
      
      a second relative frequency with which the first sample n-gram occurs in a combined collection of language references that include the first language reference and the second language reference; and
      
      a number of short words of the first language that occur in the sample electronic text.
  - 7. The computer-implemented method of claim 1, wherein calculating the first Bayesian probability comprises calculating the Bayesian probability P(A|B) of the first sample n-gram corresponding to the first language based at least in part on:
  - 8. The computer-implemented method of claim 1, wherein calculating the first Bayesian probability comprises calculating the Bayesian probability P(A|B) of the first sample n-gram corresponding to the first language based at least in part on:
  - 21. The computer-implemented method of claim 1, further comprising storing n-gram frequency data in an n-gram frequency table.
  - 22. The computer-implemented method of claim 1, wherein the frequency of occurrence of the first sample n-gram is a percentage of a total number of all n-grams in the first language reference that consists of the first sample n-gram.

9. A computer-implemented method, comprising:
- under control of a device comprising one or more processors configured with executable instructions, receiving at a graphical user interface of the device user selection of a sample electronic text;
  
  identifying multiple sample n-grams of the sample electronic text;
  
  for a first language;
  
  calculating a first probability based at least in part on a frequency of occurrence, in the first language, of a first sample n-gram of the multiple n-grams;
  
  calculating a second probability based at least in part on a frequency of occurrence, in the first language, of a second sample n-gram of the multiple n-grams;
  
  generating a first average based at least in part on the first probability and the second probability;
  
  for a second language;
  
  calculating a third probability based at least in part on a frequency of occurrence, in the second language, of the first sample n-gram of the multiple sample n-grams;
  
  calculating a fourth probability based at least in part on a frequency of occurrence, in the second language, of the second sample n-gram of the multiple n-grams;
  
  generating a second average based at least in part on the third probability and the fourth probability;
  
  determining a language of the sample electronic text based at least in part on comparing at least the first average and the second average;
  
  displaying, via the graphical user interface, an indication of the language;
  
  performing, via the device, a language-dependent operation based at least in part on the language of the sample electronic text; and
  
  displaying, via the graphical user interface, information associated with the language-dependent operation.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The computer-implemented method of claim 9, wherein generating the first average comprisescalculating a first product of the first probability and a number of times the first sample n-gram occurs in the sample electronic text;
    - calculating a second product of the second probability and a number of times the second sample n-gram occurs in the sample electronic text; and
      
      summing a set of calculated products that include the first product and the second product.
  - 11. The computer-implemented method of claim 9, wherein a given n-gram of the multiple sample n-grams is an ordered string of n characters, and n equals three.
  - 12. The computer-implemented method of claim 9, wherein calculating the first probability is based at least in part on relative occurrence frequencies of the first sample n-gram within reference texts of different languages.
  - 13. The computer-implemented method of claim 9, wherein calculating the first probability comprises calculating a Bayesian probability of the first sample n-gram occurring in the first language.
  - 14. The computer-implemented method of claim 9, wherein calculating the first probability comprises calculating a Bayesian probability P(A|B) that the first sample n-gram corresponds to the first language based at least in part on:
  - 15. The computer-implemented method of claim 9, wherein calculating the first probability comprises calculating a Bayesian probability P(A|B) that the first sample n-gram corresponds to the language based at least in part on:

16. An electronic book reader, comprising:
- a display upon which to display electronic content of different languages;
  
  one or more processors;
  
  memory containing instructions that are executable by the one or more processors to perform actions comprising;
  
  displaying electronic content on the display, the electronic content including text;
  
  identifying multiple n-grams of at least a portion of the electronic content;
  
  for a first language;
  
  calculating a first probability based at least in part on a frequency of occurrence, in the first language, of a first sample n-gram of the multiple n-grams;
  
  calculating a second probability based at least in part on a frequency of occurrence, in the first language, of a second sample n-gram of the multiple n-grams;
  
  generating a first average based at least in part on the first probability and the second probability;
  
  for a second language;
  
  calculating a third probability based at least in part on a frequency of occurrence, in the second language, of the first sample n-gram of the multiple sample n-grams;
  
  calculating a fourth probability based at least in part on a frequency of occurrence, in the second language, of the second sample n-gram of the multiple n-grams;
  
  generating a second average based at least in part on the third probability and the fourth probability;
  
  determining a language of the sample electronic text based at least in part on comparing at least the first average and the second average;
  
  receiving designation of a first word within the electronic content;
  
  looking up a meaning of the designated first word in a dictionary of the determined language; and
  
  presenting the meaning of the designated word to the user.
- View Dependent Claims (17, 18, 19, 20, 23)
- - 17. The electronic book reader of claim 16, wherein the at least a portion of the electronic content comprises text surrounding the designated first word.
  - 18. The electronic book reader of claim 16, wherein the at least a portion of the electronic content comprises text adjacent to the designated first word.
  - 19. The electronic book reader of claim 16, wherein the at least a portion of the electronic content comprises at least a paragraph that contains the designated first word.
  - 20. The electronic book reader of claim 16, wherein the at least a portion of the electronic content comprises the text of the electronic content.
  - 23. The electronic book reader of claim 16, wherein generating the first average comprises:
    - calculating a first product of the first probability and a number of times the first sample n-gram occurs in the electronic content;
      
      calculating a second product of the second probability and a number of times the second sample n-gram occurs in the electronic content; and
      
      summing a set of calculated products that include the first product and the second product.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Gershnik, Eugene
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Sharma, Neeraj

Application Number

US13/050,726
Publication Number

US 20120239379A1
Time in Patent Office

2,119 Days
Field of Search

704/235, 704/8, 704/9, 704/3, 704/240, 704/2, 704/260, 704/1, 704/12, 704/255, 704/10, 706/12, 706/52, 435/6.11, 382/230, 382/229
US Class Current

1/1
CPC Class Codes

G06F 40/263 Language identification

-Gram-based language prediction

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

29 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

-Gram-based language prediction

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links