Language modeling of complete language sequences

US 9,786,269 B2
Filed: 05/02/2013
Issued: 10/10/2017
Est. Priority Date: 03/14/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

accessing, by a data processing apparatus, training data indicating queries submitted by one or more users;

determining, by the data processing apparatus and for at least some of the queries, a count of a number of times the training data indicates the query was submitted;

selecting, by the data processing apparatus, a proper subset of the queries based on the counts;

training, by the data processing apparatus, a first component of a language model based on the counts, the first component including first probability data indicating relative frequencies of the selected queries among the training data;

training, by the data processing apparatus, a second component of the language model based on the training data, the second component including second probability data for assigning scores to queries that are not included in the selected queries;

determining, by the data processing apparatus, adjustment data that includes one or more weighting values for normalizing the second probability data with respect to the first probability data; and

storing, by the data processing apparatus, the first component, the second component, and the adjustment data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for language modeling of complete language sequences. Training data indicating language sequences is accessed, and counts for a number of times each language sequence occurs in the training data are determined. A proper subset of the language sequences is selected, and a first component of a language model is trained. The first component includes first probability data for assigning scores to the selected language sequences. A second component of the language model is trained based on the training data, where the second component includes second probability data for assigning scores to language sequences that are not included in the selected language sequences. Adjustment data that normalizes the second probability data with respect to the first probability data is generated, and the first component, the second component, and the adjustment data are stored.

28 Citations

20 Claims

1. A method comprising:
- accessing, by a data processing apparatus, training data indicating queries submitted by one or more users;
  
  determining, by the data processing apparatus and for at least some of the queries, a count of a number of times the training data indicates the query was submitted;
  
  selecting, by the data processing apparatus, a proper subset of the queries based on the counts;
  
  training, by the data processing apparatus, a first component of a language model based on the counts, the first component including first probability data indicating relative frequencies of the selected queries among the training data;
  
  training, by the data processing apparatus, a second component of the language model based on the training data, the second component including second probability data for assigning scores to queries that are not included in the selected queries;
  
  determining, by the data processing apparatus, adjustment data that includes one or more weighting values for normalizing the second probability data with respect to the first probability data; and
  
  storing, by the data processing apparatus, the first component, the second component, and the adjustment data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, wherein training the first component of the language model based on the counts comprises generating a first probability distribution over a set of possible outcomes limited to the queries occurring in the training data;
    - andwherein training the second component of the language model based on the training data comprises generating a second probability distribution for which the range of possible outcomes is not limited to a defined set of queries.
  - 3. The method of claim 2, wherein the one or more weighting values represent weights for weighting the second probability distribution relative to the first probability distribution to form a combined probability distribution.
  - 4. The method of claim 1, wherein accessing training data indicating queries submitted by one or more users comprises accessing one or more query logs indicating voice queries spoken by different users;
    - wherein selecting the proper subset of the queries based on the counts comprises selecting queries having the highest counts.
  - 5. The method of claim 1, wherein selecting the proper subset of the queries based on the counts comprises selecting queries having a count that equals or exceeds a minimum threshold value, the minimum threshold value being greater than one.
  - 6. The method of claim 1, wherein selecting the proper subset of the queries based on the counts comprises:
    - selecting queries having a first number of terms based on a first threshold; and
      
      selecting queries having a second number of terms based on a second threshold, the second number of terms being different from the first number of terms, and the second threshold being different from the first threshold.
  - 7. The method of claim 1, wherein training the first component of the language model based on the counts comprises determining, for each of the selected queries, a score indicating a relative frequency of occurrence of the selected query, as a complete query, in the training data.
  - 8. The method of claim 1, wherein training the first component of the language model comprises generating data indicating a first probability distribution for which a first sum of probabilities of occurrence of the selected queries is a first value;
    - wherein training the second component of the language model comprises generating data indicating a second probability distribution for which a second sum of probabilities of occurrence of the selected queries is a second value;
      
      wherein determining the adjustment data comprises determining a first weighting value based on the first value and the second value.
  - 9. The method of claim 1, wherein a first weighting value of the one or more weighting values represents data for equalizing a portion of a probability distribution of the second component with a corresponding portion of a probability distribution of the first component.
  - 10. The method of claim 1, further comprising:
    - determining a first score for a particular query using the first component of the language model;
      
      determining a second score for the particular query using the second component of the language model;
      
      determining that the first score and the second score do not satisfy a predetermined relationship; and
      
      in response to determining that the first score and the second score do not satisfy the predetermined relationship, removing the particular query from the selected queries to generate an altered set of selected queries.
  - 11. The method of claim 10, further comprising, after removing the particular query from the selected queries, determining second adjustment data based on the altered set of selected queries.
  - 12. The method of claim 1, wherein training the second component of the language model based on the training data comprises training an n-gram model.
  - 13. The method of claim 1, wherein training the second component of the language model based on the training data comprises training the second component using a proper subset of the training data, the proper subset of the training data excluding instances of the selected queries.
  - 14. The method of claim 1, wherein accessing training data indicating queries submitted by one or more users comprises (i) accessing first training data indicating first queries associated with a first geographical area and (ii) accessing second training data indicating second queries associated with a second geographical area that is larger than the first geographical area;
    - wherein determining, for at least some of the queries, a count of the number of times the training data indicates the query was submitted comprises determining, for at least some of the first queries, a count of a number of times the first training data indicates the query was submitted;
      
      wherein selecting the proper subset of the queries based on the counts comprises selecting queries from among the first queries associated with the first geographical area;
      
      wherein training the first component of the language model based on the counts comprises training the first component based on the counts indicating the number of times the first training data indicates that the selected queries were submitted; and
      
      wherein training the second component of the language model based on the training data comprises training the second component of the language model based on the second training data indicating queries associated with the second geographical area.
  - 15. The method of claim 14, further comprising:
    - receiving a query;
      
      determining that the received query is associated with the first geographical area;
      
      in response to determining that the received query is associated with the first geographical area, selecting the first component from among a plurality of language models corresponding to different geographical areas;
      
      using the first component to evaluate one or more candidate transcriptions that are included in the selected queries; and
      
      using the second component to evaluate one or more candidate transcriptions that are not included in the selected queries.
  - 16. The method of claim 1, further comprising:
    - receiving a candidate transcription for one or more utterances;
      
      determining that the candidate transcription is one of the selected queries;
      
      in response to determining that the candidate transcription is one of the selected queries, determining a score for the candidate transcription using the first component of the language model; and
      
      evaluating the candidate transcription based on the score.
  - 17. The method of claim 1, further comprising:
    - receiving a candidate transcription for one or more utterances;
      
      determining that the candidate transcription is not one of the selected queries;
      
      in response to determining that the candidate transcription is not one of the selected queries, determining a score for the candidate transcription using the second component of the language model; and
      
      evaluating the candidate transcription based on the score.
  - 18. The method of claim 1, further comprising using the language model in a speech recognition system to determine a candidate transcription of an utterance.

19. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  accessing, by the one or more computers, training data indicating queries submitted by one or more users;
  
  determining, by the one or more computers and for at least some of the queries, a count of a number of times the training data indicates the query was submitted;
  
  selecting, by the one or more computers, a proper subset of the queries based on the counts;
  
  training, by the one or more computers, a first component of a language model based on the counts, the first component including first probability data indicating relative frequencies of the selected queries among the training data;
  
  training, by the one or more computers, a second component of the language model based on the training data, the second component including second probability data for assigning scores to queries that are not included in the selected queries;
  
  determining, by the one or more computers, adjustment data that includes one or more weighting values for normalizing the second probability data with respect to the first probability data; and
  
  storing, by the one or more computers, the first component, the second component, and the adjustment data.

20. A non-transitory computer storage medium storing a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- accessing, by the one or more computers, training data indicating queries submitted by one or more users;
  
  determining, by the one or more computers and for at least some of the queries, a count of a number of times the training data indicates the query was submitted;
  
  selecting, by the one or more computers, a proper subset of the queries based on the counts;
  
  training, by the one or more computers, a first component of a language model based on the counts, the first component including first probability data indicating relative frequencies of the selected queries among the training data;
  
  training, by the one or more computers, a second component of the language model based on the training data, the second component including second probability data for assigning scores to queries that are not included in the selected queries;
  
  determining, by the one or more computers, adjustment data that includes one or more weighting values for normalizing the second probability data with respect to the first probability data; and
  
  storing, by the one or more computers, the first component, the second component, and the adjustment data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Chelba, Ciprian I., Sak, Hasim, Schalkwyk, Johan
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US13/875,406
Publication Number

US 20140278407A1
Time in Patent Office

1,622 Days
Field of Search

704235, 704260, 7042701, 704275, 704 4, 704 9, 707706, 707723, 715810
US Class Current
CPC Class Codes

G10L 15/063 Training

G10L 15/197 Probabilistic grammars, e.g...

Language modeling of complete language sequences

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

28 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Language modeling of complete language sequences

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

28 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links