Training a probabilistic spelling checker from structured data

US 9,558,179 B1
Filed: 12/05/2013
Issued: 01/31/2017
Est. Priority Date: 01/04/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for generating a language model for computing probabilities of occurrence of queries, comprising:

accessing a database comprising a plurality of entities, each entity having one or more names and an entity type;

accessing a query log comprising queries previously entered by users, a plurality of the queries including names of ones of the entities in the database;

generating, from the query log, a template distribution quantifying probabilities that entity types of the entities named in the queries correspond to ones of a plurality of query templates, each query template comprising an ordered set of the entity types appearing in the database;

generating the language model from the template distribution, the language model comprising a set of combinations of names of the entities and associated scores, the scores based on probabilities of occurrence of the combinations in a query; and

storing the language model on a computer readable storage device;

wherein generating, from the query log, a template distribution comprises identifying, for each of the plurality of queries, a query template matching each query based on the names of the entities associated with the queries and an ordering of the names of the entities associated with the queries;

and wherein generating the template distribution further comprises determining, for each distinct query template, a count of the plurality of the queries that correspond to the template.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A spelling system derives a language model for a particular domain of structured data, the language model enabling determinations of alternative spellings of queries or other strings of text from that domain. More specifically, the spelling system calculates (a) probabilities that the various query entity types—such as STREET, CITY, or STATE for queries in the geographical domain—are arranged in each of the various possible orders, and (b) probabilities that an arbitrary query references given particular ones of the entities, such as the street “El Camino Real.” Based on the calculated probabilities, the spelling system generates a language model that has associated scores (e.g., probabilities) for each of a set of probable entity name orderings, where the total number of entity name orderings is substantially less than the number of all possible orderings. The language model can be applied to determine probabilities of arbitrary queries, and thus to suggest alternative queries more likely to represent what a user intended.

Citations

17 Claims

1. A computer-implemented method for generating a language model for computing probabilities of occurrence of queries, comprising:
- accessing a database comprising a plurality of entities, each entity having one or more names and an entity type;
  
  accessing a query log comprising queries previously entered by users, a plurality of the queries including names of ones of the entities in the database;
  
  generating, from the query log, a template distribution quantifying probabilities that entity types of the entities named in the queries correspond to ones of a plurality of query templates, each query template comprising an ordered set of the entity types appearing in the database;
  
  generating the language model from the template distribution, the language model comprising a set of combinations of names of the entities and associated scores, the scores based on probabilities of occurrence of the combinations in a query; and
  
  storing the language model on a computer readable storage device;
  
  wherein generating, from the query log, a template distribution comprises identifying, for each of the plurality of queries, a query template matching each query based on the names of the entities associated with the queries and an ordering of the names of the entities associated with the queries;
  
  and wherein generating the template distribution further comprises determining, for each distinct query template, a count of the plurality of the queries that correspond to the template.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer-implemented method of claim 1, further comprising computing conditional probabilities of occurrence of words in queries based at least in part on the language model.
  - 3. The computer-implemented method of claim 2, further comprising:
    - receiving a query from a user;
      
      generating an alternative query that is a variant of the received query;
      
      computing a probability of occurrence of the received query and a probability of occurrence of the alternative query based at least in part on the computed conditional probabilities; and
      
      responsive to the probability of occurrence of the alternative query being higher than the probability of occurrence of the received query, providing the alternative query to the user.
  - 4. The computer-implemented method of claim 1, wherein generating the language model further comprises:
    - for a pair of names comprising a name of a first entity and a name of a second entity and corresponding to a query template, computing a score for the pair using a probability of the corresponding query template defined by the template distribution.
  - 5. The computer-implemented method of claim 1, wherein generating the language model comprises, for each entity of a plurality of the entities:
    - identifying a type of the entity;
      
      identifying a set of related types, the related types comprising entity types that appear in one of the queries directly after the entity;
      
      identifying a set of neighboring entities, each neighboring entity neighboring the entity in an entity graph and having a type that is included in the set of related types; and
      
      generating a set of pairs of the names of the entity and the names of the entities in the set of neighboring entities.
  - 6. The computer-implemented method of claim 5, wherein generating the language model further comprises computing a score for each of the generated pairs.
  - 7. The computer-implemented method of claim 6, wherein the score for a generated pair is based on a probability of a template of the template distribution corresponding to the pair.
  - 8. The computer-implemented method of claim 6, wherein the score for a generated pair is based on a rank of the type of the entity.

9. A non-transitory computer-readable storage medium having executable computer program instructions embodied therein for generating a language model for analyzing probabilities of occurrence of text strings in a given domain, actions of the computer program instructions comprising:
- accessing a database comprising a plurality of entities, each entity having one or more names and an entity type;
  
  accessing a query log comprising queries previously entered by users, a plurality of the queries including names of ones of the entities in the database;
  
  generating, from the query log, a template distribution quantifying probabilities that entity types of the entities named in the queries correspond to ones of a plurality of query templates, each query template comprising an ordered set of the entity types appearing in the database; and
  
  generating the language model from the template distribution, the language model comprising a set of combinations of names of the entities and associated scores, the scores based on probabilities of occurrence of the combinations in a query;
  
  wherein generating, from the query log, a template distribution comprises identifying, for each of the plurality of queries, a query template matching each query based on the names of the entities associated with the queries and an ordering of the names of the entities associated with the queries;
  
  and wherein generating the template distribution further comprises determining, for each distinct query template, a count of the plurality of the queries that correspond to the template.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The non-transitory computer-readable storage medium of claim 9, the actions further comprising:
    - computing conditional probabilities of occurrence of words in queries based at least in part on the language model.
  - 11. The non-transitory computer-readable storage medium of claim 10, the actions further comprising:
    - receiving a query from a user;
      
      generating an alternative query that is a variant of the received query;
      
      computing a probability of occurrence of the received query and a probability of occurrence of the alternative query based at least in part on the computed conditional probabilities; and
      
      responsive to the probability of occurrence of the alternative query being higher than the probability of occurrence of the received query, providing the alternative query to the user.
  - 12. The non-transitory computer-readable medium of claim 9, wherein generating the language model further comprises:
    - for a pair of names comprising a name of a first entity and a name of a second entity and corresponding to a query template, computing a score for the pair using a probability of the corresponding query template defined by the template distribution.
  - 13. The non-transitory computer-readable storage medium of claim 9, wherein generating the language model comprises, for each entity of a plurality of the entities:
    - identifying a type of the entity;
      
      identifying a set of related types, the related types comprising entity types that appear in one of the queries directly after the entity;
      
      identifying a set of neighboring entities, each neighboring entity neighboring the entity in an entity graph and having a type that is included in the set of related types; and
      
      generating a set of pairs of the names of the entity and the names of the entities in the set of neighboring entities.
  - 14. The non-transitory computer-readable storage medium of claim 13, wherein generating the language model further comprises computing a score for each of the generated pairs based on at least one of:
    - a probability of a template of the template distribution corresponding to the pair; and
      
      a rank of the type of the entity.

15. A computer-implemented method for generating a language model for analyzing probabilities of occurrence of queries, the method comprising:
- accessing a database comprising a plurality of entities, each entity corresponding having one or more names and an entity type;
  
  accessing a query log comprising a plurality of queries previously entered by users, the plurality of the queries including names of ones of the entities in the database;
  
  generating a template distribution quantifying probabilities of an arbitrary user query corresponding to ones of a plurality of query templates, each query template comprising an ordered set of the entity types, the generating comprising;
  
  identifying, for each query of the plurality of queries, a query template matching each query based on the names of the entities associated with the queries and an ordering of the names of the entities associated with the queries, anddetermining, for each distinct query template, a probability that a query corresponds to the matching query template; and
  
  generating the language model from the template distribution, the language model comprising a set of combinations of names of the entities and associated scores, the scores based on probabilities of occurrence of the combinations in a query.
- View Dependent Claims (16, 17)
- - 16. The computer-implemented method of claim 15, wherein generating the language model comprises, for each entity of a plurality of the entities:
    - identifying a type of the entity;
      
      identifying a set of related types, the related types comprising entity types that appear in one of the queries directly after the entity;
      
      identifying a set of neighboring entities, each neighboring entity neighboring the entity in an entity graph and having a type that is included in the set of related types;
      
      generating a set of pairs of the names of the entity and the names of the entities in the set of neighboring entities; and
      
      for each of the generated pairs, computing a score for the pair based at least in part on;
      
      a probability of a template of the template distribution corresponding to the pair; and
      
      a rank of the type of the entity.
  - 17. The computer-implemented method of claim 15, further comprising:
    - computing conditional probabilities of occurrence of words in queries based at least in part on the language model;
      
      receiving a query from a user;
      
      generating an alternative query that is a variant of the received query;
      
      computing a probability of occurrence of the received query and a probability of occurrence of the alternative query based at least in part on the computed conditional probabilities; and
      
      responsive to the probability of occurrence of the alternative query being higher than the probability of occurrence of the received query, providing the alternative query to the user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Jurca, Radu, Fleury, Pascal, Murphy, Bruce Winston
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
Wu, Fuming

Application Number

US14/098,394
Time in Patent Office

1,153 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/29   Geographical information da...

G06F 40/232   Orthographic correction, e....

G06F 40/253   Grammatical analysis; Style...

G06N 20/00   Machine learning

Training a probabilistic spelling checker from structured data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Training a probabilistic spelling checker from structured data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links