Spell correction with hidden markov models on online social networks
First Claim
1. A method comprising, by one or more computing devices:
- receiving, from a client system of a user of an online social network, a search query comprising one or more n-grams, wherein the n-grams comprise one or more misspelled n-grams;
identifying, for each misspelled n-gram, one or more variant-tokens;
calculating, for each identified variant-token of a misspelled n-gram, a feature value indicating a likelihood of the variant-token being a correctly spelled n-gram of the misspelled n-gram, wherein the feature value is based at least on the identified variant-token, the misspelled n-gram, and one or more variant-tokens corresponding to one or more n-grams preceding the misspelled n-gram;
generating a plurality of unique combinations of the n-grams and variant-tokens, wherein each unique combination comprises a variant-token corresponding to each misspelled n-gram;
calculating a sequence-score for each unique combination based at least in part on the calculated feature values of the variant-tokens of the unique combination, wherein the sequence-score for each unique combination indicates a suitability of the unique combination for correcting the search query; and
generating one or more corrected queries, each corrected query comprising a unique combination having a sequence-score greater than a threshold sequence-score; and
sending, to the client system of the user for display in response to receiving the search query, one or more of the corrected queries.
2 Assignments
0 Petitions
Accused Products
Abstract
In one embodiment, a method includes receiving a search query including one or more n-grams, where the n-grams include one or more misspelled n-grams, identifying one or more variant-tokens for each misspelled n-gram, calculating a feature value for each identified variant-token based at least on the identified variant-token, the misspelled n-gram, and one or more variant-tokens corresponding to one or more n-grams preceding the misspelled n-gram, generating one or more unique combinations of the n-grams and variant-tokens, calculating a sequence-score for each unique combination based at least in part on the calculated feature values of the variant-tokens of the unique combination, generating one or more corrected queries, where each corrected query includes a unique combination having a sequence-score greater than a threshold sequence-score, and sending one or more of the corrected queries to a user for display.
-
Citations
19 Claims
-
1. A method comprising, by one or more computing devices:
-
receiving, from a client system of a user of an online social network, a search query comprising one or more n-grams, wherein the n-grams comprise one or more misspelled n-grams; identifying, for each misspelled n-gram, one or more variant-tokens; calculating, for each identified variant-token of a misspelled n-gram, a feature value indicating a likelihood of the variant-token being a correctly spelled n-gram of the misspelled n-gram, wherein the feature value is based at least on the identified variant-token, the misspelled n-gram, and one or more variant-tokens corresponding to one or more n-grams preceding the misspelled n-gram; generating a plurality of unique combinations of the n-grams and variant-tokens, wherein each unique combination comprises a variant-token corresponding to each misspelled n-gram; calculating a sequence-score for each unique combination based at least in part on the calculated feature values of the variant-tokens of the unique combination, wherein the sequence-score for each unique combination indicates a suitability of the unique combination for correcting the search query; and generating one or more corrected queries, each corrected query comprising a unique combination having a sequence-score greater than a threshold sequence-score; and sending, to the client system of the user for display in response to receiving the search query, one or more of the corrected queries.
-
-
2. The method of claim 1, further comprising:
-
receiving from the user a selection of one of the corrected queries; identifying one or more objects matching the selected query; and sending, to the client system of the user, a search-result page responsive to the selected query, the search-results page comprising one or more references to one or more of the identified objects, respectively.
-
-
3. The method of claim 1, wherein the calculated feature value for each identified variant-token of a misspelled n-gram comprises a transitional logarithmic probability (p) that the identified variant-token corresponds to the corrected misspelled n-gram, wherein:
p=P(x|y, z), wherein; x=the identified variant-token corresponding to the corrected misspelled n-gram; y=the misspelled n-gram entered by the user; and z=a variant-token corresponding to the n-gram immediately preceding the misspelled n-gram.
-
4. The method of claim 3, wherein the transitional logarithm probability is determined based at least on a Perceptron training algorithm.
-
5. The method of claim 3, wherein the transitional logarithm probability is calculated based at least on a contextual speller model.
-
6. The method of claim 1, wherein the calculated feature value for each identified variant-token of a misspelled n-gram is based at least on one or more language models.
-
7. The method of claim 1, wherein if there are no n-grams preceding the misspelled n-gram, then the variant-token corresponding to an n-gram preceding the misspelled n-gram is a pre-determined placeholder.
-
8. The method of claim 1, wherein the calculated sequence-score for each unique combination is based at least in part on a calculated sum of the calculated feature values of the variant-tokens of the unique combination.
-
9. The method of claim 1, wherein calculating the sequence-score for each unique combination based at least in part on the calculated feature values of the variant-tokens of the unique combination comprises:
-
weighing each of the calculated feature values of the variant-tokens of the unique combination by a pre-determined scalar value; and calculating the sequence-score of the unique combination based at least on the weighted calculated feature values of the variant-tokens of the unique combination.
-
-
10. The method of claim 9, wherein the pre-determined scalar value is determined based at least on one or more learning algorithms.
-
11. The method of claim 10, wherein one of the learning algorithms is a Perceptron training algorithm.
-
12. The method of claim 9, wherein calculating the sequence-score for each unique combination based at least in part on the calculated feature values of the variant-tokens of the unique combination comprises:
-
determining a feature vector of the unique combination based at least on the calculated feature values of the variant-tokens of the unique combination; determining a weighted vector based at least on the pre-determined scalar values of the calculated feature values of the variant-tokens of the unique combination; and calculating the sequence-score for the unique combination based at least on a dot product of the feature vector and the weighted vector of the unique combination.
-
-
13. The method of claim 1, wherein one or more of the identified variant-tokens for each misspelled n-gram are uni-grams.
-
14. The method of claim 1, wherein an identified variant-token for a misspelled n-gram comprises at least two uni-grams.
-
15. The method of claim 14, wherein calculating a feature value for the identified variant-token for the misspelled n-gram comprising at least two uni-grams comprises:
-
calculating, for each uni-gram of the identified variant-token, a feature value based at least on the uni-gram, the misspelled n-gram, and a variant-token corresponding to an n-gram preceding the misspelled n-gram; and determining the feature value for the identified variant-token based at least on an average of the calculated feature values of the at least two uni-grams-of the variant-token.
-
-
16. The method of claim 1, wherein generating the unique combinations of the n-grams and variant-tokens comprises:
utilizing Viterbi algorithm to generate the unique combinations of the n-grams and variant-tokens.
-
17. The method of claim 16, wherein utilizing Viterbi algorithm reduces a time for generating the unique combinations of the n-grams and variant-tokens.
-
18. One or more computer-readable non-transitory storage media embodying software that is operable when executed to:
-
receive, from a client system of a user of an online social network, a search query comprising one or more n-grams, wherein the n-grams comprise one or more misspelled n-grams; identify, for each misspelled n-gram, one or more variant-tokens; calculate, for each identified variant-token of a misspelled n-gram, a feature value indicating a likelihood of the variant-token being a correctly spelled n-gram of the misspelled n-gram, wherein the feature value is based at least on the identified variant-token, the misspelled n-gram, and one or more variant-tokens corresponding to one or more n-grams preceding the misspelled n-gram; generate a plurality of unique combinations of the n-grams and variant-tokens, wherein each unique combination comprises a variant-token corresponding to each misspelled n-gram; calculate a sequence-score for each unique combination based at least in part on the calculated feature values of the variant-tokens of the unique combination, wherein the sequence-score for each unique combination indicates a suitability of the unique combination for correcting the search query; and generate one or more corrected queries, each corrected query comprising a unique combination having a sequence-score greater than a threshold sequence-score; and send, to the client system of the user for display in response to receiving the search query, one or more of the corrected queries.
-
-
19. A system comprising:
- one or more processors; and
a memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to;receive, from a client system of a user of an online social network, a search query comprising one or more n-grams, wherein the n-grams comprise one or more misspelled n-grams; identify, for each misspelled n-gram, one or more variant-tokens; calculate, for each identified variant-token of a misspelled n-gram, a feature value indicating a likelihood of the variant-token being a correctly spelled n-gram of the misspelled n-gram, wherein the feature value is based at least on the identified variant-token, the misspelled n-gram, and one or more variant-tokens corresponding to one or more n-grams preceding the misspelled n-gram; generate a plurality of unique combinations of the n-grams and variant-tokens, wherein each unique combination comprises a variant-token corresponding to each misspelled n-gram; calculate a sequence-score for each unique combination based at least in part on the calculated feature values of the variant-tokens of the unique combination; and generate one or more corrected queries, each corrected query comprising a unique combination having a sequence-score greater than a threshold sequence-score, wherein the sequence-score for each unique combination indicates a suitability of the unique combination for correcting the search query; and send, to the client system of the user for display in response to receiving the search query, one or more of the corrected queries.
- one or more processors; and
Specification