Training a probabilistic spelling checker from structured data
First Claim
1. A computer-implemented method for generating a language model for computing probabilities of occurrence of queries, comprising:
- accessing a database comprising a plurality of entities, each entity having one or more names and an entity type;
accessing a query log comprising queries previously entered by users, a plurality of the queries including names of ones of the entities in the database;
generating, from the query log, a template distribution quantifying probabilities that entity types of the entities named in the queries correspond to ones of a plurality of query templates, each query template comprising an ordered set of the entity types appearing in the database;
generating the language model from the template distribution, the language model comprising a set of combinations of names of the entities and associated scores, the scores based on probabilities of occurrence of the combinations in a query; and
storing the language model on a computer readable storage device;
wherein generating, from the query log, a template distribution comprises identifying, for each of the plurality of queries, a query template matching each query based on the names of the entities associated with the queries and an ordering of the names of the entities associated with the queries;
and wherein generating the template distribution further comprises determining, for each distinct query template, a count of the plurality of the queries that correspond to the template.
1 Assignment
0 Petitions
Accused Products
Abstract
A spelling system derives a language model for a particular domain of structured data, the language model enabling determinations of alternative spellings of queries or other strings of text from that domain. More specifically, the spelling system calculates (a) probabilities that the various query entity types—such as STREET, CITY, or STATE for queries in the geographical domain—are arranged in each of the various possible orders, and (b) probabilities that an arbitrary query references given particular ones of the entities, such as the street “El Camino Real.” Based on the calculated probabilities, the spelling system generates a language model that has associated scores (e.g., probabilities) for each of a set of probable entity name orderings, where the total number of entity name orderings is substantially less than the number of all possible orderings. The language model can be applied to determine probabilities of arbitrary queries, and thus to suggest alternative queries more likely to represent what a user intended.
-
Citations
17 Claims
-
1. A computer-implemented method for generating a language model for computing probabilities of occurrence of queries, comprising:
-
accessing a database comprising a plurality of entities, each entity having one or more names and an entity type; accessing a query log comprising queries previously entered by users, a plurality of the queries including names of ones of the entities in the database; generating, from the query log, a template distribution quantifying probabilities that entity types of the entities named in the queries correspond to ones of a plurality of query templates, each query template comprising an ordered set of the entity types appearing in the database; generating the language model from the template distribution, the language model comprising a set of combinations of names of the entities and associated scores, the scores based on probabilities of occurrence of the combinations in a query; and storing the language model on a computer readable storage device; wherein generating, from the query log, a template distribution comprises identifying, for each of the plurality of queries, a query template matching each query based on the names of the entities associated with the queries and an ordering of the names of the entities associated with the queries; and wherein generating the template distribution further comprises determining, for each distinct query template, a count of the plurality of the queries that correspond to the template. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer-readable storage medium having executable computer program instructions embodied therein for generating a language model for analyzing probabilities of occurrence of text strings in a given domain, actions of the computer program instructions comprising:
-
accessing a database comprising a plurality of entities, each entity having one or more names and an entity type; accessing a query log comprising queries previously entered by users, a plurality of the queries including names of ones of the entities in the database; generating, from the query log, a template distribution quantifying probabilities that entity types of the entities named in the queries correspond to ones of a plurality of query templates, each query template comprising an ordered set of the entity types appearing in the database; and generating the language model from the template distribution, the language model comprising a set of combinations of names of the entities and associated scores, the scores based on probabilities of occurrence of the combinations in a query; wherein generating, from the query log, a template distribution comprises identifying, for each of the plurality of queries, a query template matching each query based on the names of the entities associated with the queries and an ordering of the names of the entities associated with the queries; and wherein generating the template distribution further comprises determining, for each distinct query template, a count of the plurality of the queries that correspond to the template. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A computer-implemented method for generating a language model for analyzing probabilities of occurrence of queries, the method comprising:
-
accessing a database comprising a plurality of entities, each entity corresponding having one or more names and an entity type; accessing a query log comprising a plurality of queries previously entered by users, the plurality of the queries including names of ones of the entities in the database; generating a template distribution quantifying probabilities of an arbitrary user query corresponding to ones of a plurality of query templates, each query template comprising an ordered set of the entity types, the generating comprising; identifying, for each query of the plurality of queries, a query template matching each query based on the names of the entities associated with the queries and an ordering of the names of the entities associated with the queries, and determining, for each distinct query template, a probability that a query corresponds to the matching query template; and generating the language model from the template distribution, the language model comprising a set of combinations of names of the entities and associated scores, the scores based on probabilities of occurrence of the combinations in a query. - View Dependent Claims (16, 17)
-
Specification