Training a probabilistic spelling checker from structured data
First Claim
1. A computer-implemented method for generating a geographic language model for computing probabilities of occurrence of geographic queries, comprising:
- accessing a geographic database comprising;
a plurality of geographic entities, each geographic entity corresponding to a geographic region and having one or more names and an entity type, anda plurality of links between pairs of the geographic entities;
accessing a query log comprising geographic queries previously entered by users, a plurality of the geographic queries including names of ones of the geographic entities in the geographic database;
generating, from the query log, a template distribution quantifying probabilities that entity types of the geographic entities named in the geographic queries correspond to ones of a plurality of query templates, each query template comprising an ordered set of the entity types appearing in the geographic database;
generating a geographic distribution from the query log quantifying probabilities of queries in the query log referencing ones of the geographic entities in the geographic database;
generating the geographic language model from the template distribution and the geographic distribution, the geographic language model comprising a set of combinations of names of the geographic entities and associated scores, the scores based on probabilities of occurrence of the combinations in a geographic query; and
storing the geographic language model on a computer readable storage device.
2 Assignments
0 Petitions
Accused Products
Abstract
A spelling system derives a language model for a particular domain of structured data, the language model enabling determinations of alternative spellings of queries or other strings of text from that domain. More specifically, the spelling system calculates (a) probabilities that the various query entity types—such as STREET, CITY, or STATE for queries in the geographical domain—are arranged in each of the various possible orders, and (b) probabilities that an arbitrary query references given particular ones of the entities, such as the street “El Camino Real.” Based on the calculated probabilities, the spelling system generates a language model that has associated scores (e.g., probabilities) for each of a set of probable entity name orderings, where the total number of entity name orderings is substantially less than the number of all possible orderings. The language model can be applied to determine probabilities of arbitrary queries, and thus to suggest alternative queries more likely to represent what a user intended.
301 Citations
20 Claims
-
1. A computer-implemented method for generating a geographic language model for computing probabilities of occurrence of geographic queries, comprising:
-
accessing a geographic database comprising; a plurality of geographic entities, each geographic entity corresponding to a geographic region and having one or more names and an entity type, and a plurality of links between pairs of the geographic entities; accessing a query log comprising geographic queries previously entered by users, a plurality of the geographic queries including names of ones of the geographic entities in the geographic database; generating, from the query log, a template distribution quantifying probabilities that entity types of the geographic entities named in the geographic queries correspond to ones of a plurality of query templates, each query template comprising an ordered set of the entity types appearing in the geographic database; generating a geographic distribution from the query log quantifying probabilities of queries in the query log referencing ones of the geographic entities in the geographic database; generating the geographic language model from the template distribution and the geographic distribution, the geographic language model comprising a set of combinations of names of the geographic entities and associated scores, the scores based on probabilities of occurrence of the combinations in a geographic query; and storing the geographic language model on a computer readable storage device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A non-transitory computer-readable storage medium having executable computer program instructions embodied therein for generating a language model for analyzing probabilities of occurrence of text strings in a given domain, actions of the computer program instructions comprising:
-
accessing a geographic database comprising; a plurality of geographic entities, each geographic entity corresponding to a geographic region and having one or more names, and a plurality of links between pairs of the geographic entities; accessing a query log comprising geographic queries previously entered by users, a plurality of the geographic queries including names of ones of the geographic entities in the geographic database; generating, from the query log, a template distribution quantifying probabilities that entity types of the geographic entities named in the geographic queries correspond to ones of a plurality of query templates, each query template comprising an ordered set of the entity types appearing in the geographic database; generating a geographic distribution from the query log quantifying probabilities of queries in the query log referencing ones of the geographic entities in the geographic database; generating the geographic language model from the template distribution and the geographic distribution, the geographic language model comprising a set of combinations of names of the geographic entities and associated scores, the scores based on probabilities of occurrence of the combinations in a geographic query; and storing the geographic language model on a computer readable storage device. - View Dependent Claims (15, 16, 17)
-
-
18. A computer-implemented method for generating a geographic language model for analyzing probabilities of occurrence of geographic queries, the method comprising:
-
accessing a geographic database comprising; a plurality of geographic entities, each geographic entity corresponding to a geographic region and having one or more names, and a plurality of links between pairs of the geographic entities; accessing a query log comprising a plurality of geographic queries previously entered by users, a plurality of the geographic queries including names of ones of the geographic entities in the geographic database; generating, from the query log, a template distribution quantifying probabilities that entity types of the geographic entities named in the geographic queries correspond to ones of a plurality of query templates, the generating comprising; identifying, for each of a plurality of the geographic queries in the query log, a query template comprising an ordered set of the entity types appearing in the geographic database, and determining, for each distinct query template, a probability that the arbitrary user query corresponds to the template; generating a geographic distribution from the query log quantifying probabilities of queries in the query log referencing ones of the geographic entities in the geographic database, comprising; identifying queries comprising one of the names associated with one of the entities, and which have results falling within the geographic region corresponding to the entity; and generating the geographic language model using the template distribution and the geographic distribution, the geographic language model comprising a set of orderings of names of the geographic entities and associated scores, the scores based on probabilities of occurrence of the orderings in a geographic query. - View Dependent Claims (19, 20)
-
Specification