Clustering classes in language modeling

US 9,529,898 B2
Filed: 03/12/2015
Issued: 12/27/2016
Est. Priority Date: 08/26/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

obtaining, by a computing system, a plurality of text samples that each includes a respective class term that belongs to a same pre-defined class of topically related terms;

for each respective text sample among the plurality of text samples;

(i) identifying that the respective class term of the respective text sample belongs to a particular sub-class, among a plurality of sub-classes, of the pre-defined class of topically related terms; and

(ii) assigning the respective text sample to a particular group of text samples that corresponds to the particular sub-class to which the respective class term of the respective text sample belongs, such that the plurality of text samples are assigned among a plurality of groups of text samples and each respective group of text samples among the plurality of groups of text samples corresponds to a different one of a plurality of sub-classes of the pre-defined class of topically related terms;

generating, by the computing system and for each respective sub-class among the plurality of sub-classes, a respective sub-class context model that represents probabilities of language sequences determined based on the text samples assigned to the corresponding group of text samples for the respective sub-class;

merging, by the computing system, particular ones of the sub-class context models that are determined to be similar to generate a hierarchical set of context models;

selecting, by the computing system, particular ones of the context models from among the hierarchical set of context models;

generating, by the computing system, a class-based language model that includes, for each of the selected context models, a respective class that corresponds to the respective context model; and

providing the class-based language model in a speech recognition system and transcribing speech characterized in an audio signal to text using the class-based language model in the speech recognition system.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This document describes, among other things, a computer-implemented method. The method can include obtaining a plurality of text samples that each include one or more terms belonging to a first class of terms. The plurality of text samples can be classified into a plurality of groups of text samples. Each group of text samples can correspond to a different sub-class of terms. For each of the groups of text samples, a sub-class context model can be generated based on the text samples in the respective group of text samples. Particular ones of the sub-class context models that are determined to be similar can be merged to generate a hierarchical set of context models. Further, the method can include selecting particular ones of the context models and generating a class-based language model based on the selected context models.

Citations

19 Claims

1. A computer-implemented method, comprising:
- obtaining, by a computing system, a plurality of text samples that each includes a respective class term that belongs to a same pre-defined class of topically related terms;
  
  for each respective text sample among the plurality of text samples;
  
  (i) identifying that the respective class term of the respective text sample belongs to a particular sub-class, among a plurality of sub-classes, of the pre-defined class of topically related terms; and
  
  (ii) assigning the respective text sample to a particular group of text samples that corresponds to the particular sub-class to which the respective class term of the respective text sample belongs, such that the plurality of text samples are assigned among a plurality of groups of text samples and each respective group of text samples among the plurality of groups of text samples corresponds to a different one of a plurality of sub-classes of the pre-defined class of topically related terms;
  
  generating, by the computing system and for each respective sub-class among the plurality of sub-classes, a respective sub-class context model that represents probabilities of language sequences determined based on the text samples assigned to the corresponding group of text samples for the respective sub-class;
  
  merging, by the computing system, particular ones of the sub-class context models that are determined to be similar to generate a hierarchical set of context models;
  
  selecting, by the computing system, particular ones of the context models from among the hierarchical set of context models;
  
  generating, by the computing system, a class-based language model that includes, for each of the selected context models, a respective class that corresponds to the respective context model; and
  
  providing the class-based language model in a speech recognition system and transcribing speech characterized in an audio signal to text using the class-based language model in the speech recognition system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The computer-implemented method of claim 1, wherein at least a portion of the text samples in the plurality of text samples are obtained from logs of natural-language search queries or from logs of speech recognition results.
  - 3. The computer-implemented method of claim 1, wherein the pre-defined class of topically related terms comprises dates, and wherein the respective sub-classes of terms each comprises a different set of one or more dates.
  - 4. The computer-implemented method of claim 3, wherein the respective class term of a first text sample among the plurality of text samples indicates a date without naming any of a calendar year, a calendar month, or a calendar day.
  - 5. The computer-implemented method of claim 1, wherein the pre-defined class of topically related terms comprises one of percentages, song names, currency names or values, times, city names, state names, address numbers, chapter numbers, operands, times, and geographic locations.
  - 6. The computer-implemented method of claim 1, further comprising determining that the particular ones of the sub-class context models are similar using a Kullback-Liebler divergence technique.
  - 7. The computer-implemented method of claim 1, wherein merging the particular ones of the sub-class context models to generate the hierarchical set of context models comprises:
    - adding the sub-class context models to the hierarchical set of context models;
      
      repeatedly identifying and merging two or more context models that are determined to be similar to generate an intermediate context model, wherein the two or more context models are selected from a group consisting of the sub-class context models and other intermediate context models that have previously been generated by merging context models; and
      
      adding each instance of the intermediate context model that is generated to the hierarchical set of context models.
  - 8. The computer-implemented method of claim 1, wherein selecting the particular ones of the context models from among the hierarchical set of context models comprises selecting multiple context models that are collectively trained based on substantially all of the text samples for which the sub-class context models were generated, wherein at least one of the multiple context models selected is an intermediate context model that was generated by merging two or more context models that were determined to be similar.
  - 9. The computer-implemented method of claim 1, wherein the particular ones of the context models are selected so as to optimize one or more metrics.
  - 10. The computer-implemented method of claim 9, wherein the particular ones of the context models are selected so as to optimize two or more metrics including:
    - (i) minimizing a total number of context models selected to generate the class-based language model, and (ii) maximizing variation among the particular ones of the context models selected.
  - 11. The computer-implemented method of claim 1, further comprising evaluating one or more performance metrics of the class-based language model using a first set of input data.
  - 12. The computer-implemented method of claim 11, further comprising:
    - selecting second ones of the context models from among the hierarchical set of context models;
      
      generating a second class-based language model based on the selected second ones of the context models, each of the second ones of the context models corresponding to a class in the second class-based language model;
      
      evaluating the one or more performance metrics of the second class-based language model using the first set of input data; and
      
      determining whether the class-based language model or the second class-based language model is better based on the one or more performance metrics.
  - 13. The computer-implemented method of claim 1, wherein the sub-class context models are generated by training the sub-class context models using text samples in which the respective class terms of the text samples are redacted and replaced with respective sub-class identifiers that correspond to the sub-classes of the respective groups to which the respective text samples are assigned.
  - 14. The computer-implemented method of claim 1, comprising generating the sub-class context models based on training data that comprises truncated text samples that include only the n terms that immediately precede the respective class terms in the text samples through the n terms that immediately follow the respective class terms in the text samples.
  - 15. The computer-implemented method of claim 14, wherein n equals one, two, or three.
  - 16. The computer-implemented method of claim 1, wherein, for each respective group of text samples among the plurality of groups of text samples, the respective class terms of the text samples assigned to the respective group of text samples are substantially identical.

17. One or more non-transitory computer-readable media including instructions that, when executed by one or more processors, cause performance of operations, comprising:
- obtaining, by a computing system, a plurality of text samples that each includes a respective class term that belongs to a same pre-defined class of topically related terms;
  
  for each respective text sample among the plurality of text samples;
  
  (i) identifying that the respective class term of the respective text sample belongs to a particular sub-class, among a plurality of sub-classes, of the pre-defined class of topically related terms; and
  
  (ii) assigning the respective text sample to a particular group of text samples that corresponds to the particular sub-class to which the respective class term of the respective text sample belongs, such that the plurality of text samples are assigned among a plurality of groups of text samples and each respective group of text samples among the plurality of groups of text samples corresponds to a different one of a plurality of sub-classes of the pre-defined class of topically related terms;
  
  generating, by the computing system and for each respective sub-class among the plurality of sub-classes, a respective sub-class context model that represents probabilities of language sequences determined based on the text samples assigned to the corresponding group of text samples for the respective sub-class;
  
  merging, by the computing system, particular ones of the sub-class context models that are determined to be similar to generate a hierarchical set of context models;
  
  selecting, by the computing system, particular ones of the context models from among the hierarchical set of context models;
  
  generating, by the computing system, a class-based language model that includes, for each of the selected context models, a respective class that corresponds to the respective context model; and
  
  providing the class-based language model in a speech recognition system and transcribing speech characterized in an audio signal to text using the class-based language model in the speech recognition system.
- View Dependent Claims (18)
- - 18. The one or more non-transitory computer-readable media of claim 17, wherein, for each respective group of text samples among the plurality of groups of text samples, the respective class terms of the text samples assigned to the respective group of text samples are substantially identical.

19. A system comprising:
- one or more processors; and
  
  one or more computer-readable media including instructions that, when executed by the one or more processors, cause performance of operations comprising;
  
  obtaining, by the system, a plurality of text samples that each includes a respective class term that belongs to a same pre-defined class of topically related terms;
  
  for each respective text sample among the plurality of text samples;
  
  (i) identifying that the respective class term of the respective text sample belongs to a particular sub-class, among a plurality of sub-classes, of the pre-defined class of topically related terms; and
  
  (ii) assigning the respective text sample to a particular group of text samples that corresponds to the particular sub-class to which the respective class term of the respective text sample belongs, such that the plurality of text samples are assigned among a plurality of groups of text samples and each respective group of text samples among the plurality of groups of text samples corresponds to a different one of a plurality of sub-classes of the pre-defined class of topically related terms;
  
  generating, by the system and for each respective sub-class among the plurality of sub-classes, a respective sub-class context model that represents probabilities of language sequences determined based on the text samples assigned to the corresponding group of text samples for the respective sub-class;
  
  merging, by the system, particular ones of the sub-class context models that are determined to be similar to generate a hierarchical set of context models;
  
  selecting, by the system, particular ones of the context models from among the hierarchical set of context models;
  
  generating, by the system, a class-based language model that includes, for each of the selected context models, a respective class that corresponds to the respective context model; and
  
  providing the class-based language model in a speech recognition system and transcribing speech characterized in an audio signal to text using the class-based language model in the speech recognition system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Epstein, Mark Edward, Schogol, Vladislav
Primary Examiner(s)
REINIER, BARBARA DIANE

Application Number

US14/656,027
Publication Number

US 20160062985A1
Time in Patent Office

656 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/353   into predefined classes

G06F 40/216   using statistical methods

G06F 40/289   Phrasal analysis, e.g. fini...

Clustering classes in language modeling

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Clustering classes in language modeling

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links