-Gram-based language prediction
First Claim
Patent Images
1. A computer-implemented method, comprising:
- under control of a device comprising one or more processors configured with executable instructions, receiving at a graphical user interface of the device, user selection of a sample electronic text;
identifying multiple sample n-grams of the sample electronic text;
for a first language;
identifying a first set of n-grams that occur in a first language reference corresponding to the first language;
calculating a first set of Bayesian probabilities, including calculating a first Bayesian probability based at least in part on a frequency of occurrence, in the first set of n-grams, of a first sample n-gram of the multiple sample n-grams; and
calculating a first average of the first set of Bayesian probabilities;
for a second language;
identifying a second set of n-grams that occur in the second language reference corresponding to the second language;
calculating a second set of Bayesian probabilities, including calculating a second Bayesian probability based at least in part on a frequency of occurrence, in the second set of n-grams, of a second sample n-gram of the multiple sample n-grams; and
calculating a second average of the second set of Bayesian probabilities;
comparing at least the first average and the second average;
determine a language of the sample electronic text based at least in part on the comparing at least the first average and the second average;
determining a meaning of a word of the sample electronic text in a dictionary of the language; and
presenting the meaning of the word on a display of the device.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques are described for predicting the language of a text excerpt. The language prediction is accomplished by comparing n-grams of the text excerpt with n-grams of different language references. A probability is calculated for each n-gram of the text excerpt with respect to each of the language references. The calculated probabilities corresponding to a single language are then averaged to yield an overall probability corresponding to that language, and the resulting overall probabilities are compared to find the most likely language of the sample text.
29 Citations
23 Claims
-
1. A computer-implemented method, comprising:
-
under control of a device comprising one or more processors configured with executable instructions, receiving at a graphical user interface of the device, user selection of a sample electronic text; identifying multiple sample n-grams of the sample electronic text; for a first language; identifying a first set of n-grams that occur in a first language reference corresponding to the first language; calculating a first set of Bayesian probabilities, including calculating a first Bayesian probability based at least in part on a frequency of occurrence, in the first set of n-grams, of a first sample n-gram of the multiple sample n-grams; and calculating a first average of the first set of Bayesian probabilities; for a second language; identifying a second set of n-grams that occur in the second language reference corresponding to the second language; calculating a second set of Bayesian probabilities, including calculating a second Bayesian probability based at least in part on a frequency of occurrence, in the second set of n-grams, of a second sample n-gram of the multiple sample n-grams; and calculating a second average of the second set of Bayesian probabilities; comparing at least the first average and the second average; determine a language of the sample electronic text based at least in part on the comparing at least the first average and the second average; determining a meaning of a word of the sample electronic text in a dictionary of the language; and presenting the meaning of the word on a display of the device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 21, 22)
-
-
9. A computer-implemented method, comprising:
-
under control of a device comprising one or more processors configured with executable instructions, receiving at a graphical user interface of the device user selection of a sample electronic text; identifying multiple sample n-grams of the sample electronic text; for a first language; calculating a first probability based at least in part on a frequency of occurrence, in the first language, of a first sample n-gram of the multiple n-grams; calculating a second probability based at least in part on a frequency of occurrence, in the first language, of a second sample n-gram of the multiple n-grams; generating a first average based at least in part on the first probability and the second probability; for a second language; calculating a third probability based at least in part on a frequency of occurrence, in the second language, of the first sample n-gram of the multiple sample n-grams; calculating a fourth probability based at least in part on a frequency of occurrence, in the second language, of the second sample n-gram of the multiple n-grams; generating a second average based at least in part on the third probability and the fourth probability; determining a language of the sample electronic text based at least in part on comparing at least the first average and the second average; displaying, via the graphical user interface, an indication of the language; performing, via the device, a language-dependent operation based at least in part on the language of the sample electronic text; and displaying, via the graphical user interface, information associated with the language-dependent operation. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. An electronic book reader, comprising:
-
a display upon which to display electronic content of different languages; one or more processors; memory containing instructions that are executable by the one or more processors to perform actions comprising; displaying electronic content on the display, the electronic content including text; identifying multiple n-grams of at least a portion of the electronic content; for a first language; calculating a first probability based at least in part on a frequency of occurrence, in the first language, of a first sample n-gram of the multiple n-grams; calculating a second probability based at least in part on a frequency of occurrence, in the first language, of a second sample n-gram of the multiple n-grams; generating a first average based at least in part on the first probability and the second probability; for a second language; calculating a third probability based at least in part on a frequency of occurrence, in the second language, of the first sample n-gram of the multiple sample n-grams; calculating a fourth probability based at least in part on a frequency of occurrence, in the second language, of the second sample n-gram of the multiple n-grams; generating a second average based at least in part on the third probability and the fourth probability; determining a language of the sample electronic text based at least in part on comparing at least the first average and the second average; receiving designation of a first word within the electronic content; looking up a meaning of the designated first word in a dictionary of the determined language; and presenting the meaning of the designated word to the user. - View Dependent Claims (17, 18, 19, 20, 23)
-
Specification