Training acoustic models using distributed computing techniques

US 8,959,014 B2
Filed: 06/29/2012
Issued: 02/17/2015
Est. Priority Date: 06/30/2011
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;

receiving speech data and data identifying a transcription for the speech data;

accessing a phonetic representation for the transcription;

extracting training sequences from the phonetic representation for a particular phone in the phonetic representation, the training sequences comprising two or more training sequences that include (i) a particular sequence of multiple phones and (ii) a different number of contextual phones surrounding the particular phone;

identifying a partitioning key for the training sequences based on the particular sequence of multiple phones that occurs in the two or more training sequences;

selecting, from among a plurality of processing modules, a processing module to which the identified partitioning key is assigned, the processing module being designated to train a portion of an acoustic model that corresponds to the identified partitioning key; and

transmitting, to the selected processing module, (i) data identifying the training sequences and (ii) a portion of the speech data that corresponds to the training sequence that includes the most contextual phones.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models. Speech data and data identifying a transcription for the speech data are received. A phonetic representation for the transcription is accessed. Training sequences are identified for a particular phone in the phonetic representation. Each of the training sequences includes a different set of contextual phones surrounding the particular phone. A partitioning key is identified based on a sequence of phones that occurs in each of the training sequences. A processing module to which the identified partitioning key is assigned is selected. Data identifying the training sequences and a portion of the speech data are transmitted to the selected processing module.

Citations

21 Claims

1. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving speech data and data identifying a transcription for the speech data;
  
  accessing a phonetic representation for the transcription;
  
  extracting training sequences from the phonetic representation for a particular phone in the phonetic representation, the training sequences comprising two or more training sequences that include (i) a particular sequence of multiple phones and (ii) a different number of contextual phones surrounding the particular phone;
  
  identifying a partitioning key for the training sequences based on the particular sequence of multiple phones that occurs in the two or more training sequences;
  
  selecting, from among a plurality of processing modules, a processing module to which the identified partitioning key is assigned, the processing module being designated to train a portion of an acoustic model that corresponds to the identified partitioning key; and
  
  transmitting, to the selected processing module, (i) data identifying the training sequences and (ii) a portion of the speech data that corresponds to the training sequence that includes the most contextual phones.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The system of claim 1, wherein accessing the phonetic representation for the transcription comprises accessing a phonetic representation comprising context-independent phones.
  - 3. The system of claim 1, wherein receiving speech data comprises receiving feature vectors that indicate speech characteristics.
  - 4. The system of claim 1, wherein transmitting at least a portion of the speech data comprises transmitting a speech data instance for fewer than all of the training sequences in the set of training sequences.
  - 5. The system of claim 1, wherein transmitting the at least a portion of the speech data comprises transmitting the speech data corresponding to the training sequence that includes the most contextual phones, without transmitting additional speech data for the other training sequences to the selected processing module.
  - 6. The system of claim 1, wherein the operations further comprise:
    - receiving, at the selected processing module, the data identifying the training sequences and the portion of the speech data that corresponds to the training sequence that includes the most contextual phones; and
      
      accessing, at the selected processing module, a different subset of the received speech data for each of the training sequences.
  - 7. The system of claim 1, wherein identifying the partitioning key based on the particular sequence of phones that occurs in the two or more training sequences comprises selecting the partitioning key from among a plurality of partitioning keys assigned to different processing modules based on a sequence of two or more consecutive phones that occurs in each of the two or more training sequences.
  - 8. The system of claim 1, wherein identifying the partitioning key based on the particular sequence of phones that occurs in the two or more training sequences comprises identifying the partitioning key based on a sequence in each of the two or more training sequences that includes one contextual phone before the particular phone and one contextual phone after the particular phone.
  - 9. The system of claim 1, wherein identifying the partitioning key based on the particular sequence of phones that occurs in the two or more training sequences comprises identifying a partitioning key for each of the training sequences, wherein the same partitioning key is identified for each of the training sequences extracted for the particular phone.
  - 10. The system of claim 1, wherein the training sequences are first training sequences that each comprise a same central triphone;
    - wherein identifying the partitioning key based on the particular sequence of phones that occurs in the two or more training sequences comprises identifying the partitioning key based on the same central triphone included in the first training sequences; and
      
      wherein the operations further comprise transmitting, to the processing module and not to any of the other processing modules in the plurality of processing modules, data identifying second training sequences comprising the same central triphone included in the first training sequences, the second training sequences being extracted from a phonetic representation for a transcription for second speech data.
  - 11. The system of claim 1, wherein extracting the training sequences for the particular phone in the phonetic representation comprises identifying at leasta first sequence that includes one contextual phone before the particular phone and one contextual phone after the particular phone,a second sequence that includes two contextual phones before the particular phone and two contextual phones after the particular phone, anda third sequence that includes three contextual phones before the particular phone and three contextual phones after the particular phone.
  - 12. The system of claim 1, wherein extracting the training sequences for the particular phone in the phonetic representation comprises extracting sequences of consecutive phones in the phonetic representation.
  - 13. The system of claim 1, wherein the operations further comprise:
    - receiving, at the selected processing module, the data identifying the training sequences; and
      
      aggregating, at the selected processing module, the portion of the speech data with speech data for other instances of the training sequences.
  - 14. The system of claim 13, wherein the operations further comprise:
    - generating, at the selected processing module, a model for a first training sequence of the training sequences based on the aggregated speech data for first the training sequence; and
      
      storing the generated model in a distributed associative array, the generated model being stored in a partition of the distributed associative array being associated with the identified partitioning key.
  - 15. The system of claim 14, wherein generating the model for the first training sequence comprises generating a context-dependent Gaussian mixture model dependent on the sequence of contextual phones included in the first training sequence, the Gaussian mixture model representing the output distribution of a Hidden Markov Model state of a central phone of the first training sequence.
  - 16. The system of claim 14, wherein storing the generated model in the distributed associative array comprises storing the generated model in the distributed associative array such that the generated model is associated with a key that uniquely corresponds to the first training sequence.
  - 17. The system of claim 13, wherein the operations further comprise:
    - determining, at the selected processing module, that the aggregated speech data includes data for fewer than a threshold number of instances of a second training sequence of the training sequences; and
      
      in response to determining that the aggregated speech data includes data for fewer than the threshold number of instances of the second training sequence, not generating a model for the second training sequence.

18. A computer-implemented method, comprising:
- receiving speech data and data identifying a transcription for the speech data;
  
  accessing a phonetic representation for the transcription;
  
  extracting training sequences from the phonetic representation for a particular phone in the phonetic representation, the training sequences comprising two or more training sequences that include (i) a particular sequence of multiple phones and (ii) a different number of contextual phones surrounding the particular phone;
  
  identifying a partitioning key for the training sequences based on the particular sequence of multiple phones that occurs in the two or more training sequences;
  
  selecting, from among a plurality of processing modules, a processing module to which the identified partitioning key is assigned, the processing module being designated to train a portion of an acoustic model that corresponds to the identified partitioning key; and
  
  transmitting, to the selected processing module, (i) data identifying the training sequences and (ii) a portion of the speech data that corresponds to the training sequence that includes the most contextual phones.
- View Dependent Claims (19)
- - 19. The computer-implemented method of claim 18, wherein identifying the partitioning key based on the particular sequence of phones that occurs in the two or more training sequences comprises selecting the partitioning key from among a plurality of partitioning keys assigned to different processing modules based on a sequence of two or more consecutive phones that occurs in each of the two or more training sequences.

20. A non-transitory computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- receiving speech data and data identifying a transcription for the speech data;
  
  accessing a phonetic representation for the transcription;
  
  extracting training sequences from the phonetic representation for a particular phone in the phonetic representation, the training sequences comprising two or more training sequences that include (i) a particular sequence of multiple phones and (ii) a different number of contextual phones surrounding the particular phone;
  
  identifying a partitioning key for the training sequences based on the particular sequence of multiple phones that occurs in the two or more training sequences;
  
  selecting, from among a plurality of processing modules, a processing module to which the identified partitioning key is assigned, the processing module being designated to train a portion of an acoustic model that corresponds to the identified partitioning key; and
  
  transmitting, to the selected processing module, (i) data identifying the training sequences and (ii) a portion of the speech data that corresponds to the training sequence that includes the most contextual phones.

21. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  assigning a plurality of partitioning keys to a plurality of processing modules, each partitioning key being assigned to only one of the plurality of processing modules, the partitioning keys corresponding to non-overlapping sets of phonetic sequences;
  
  receiving speech data for an utterance and data identifying a transcription for the utterance;
  
  accessing a phonetic representation for the transcription;
  
  extracting, for a particular phone in the phonetic representation, multiple training sequences from the phonetic representation, each of the multiple training sequences including (i) a particular sequence of multiple phones and (ii) a different number of contextual phones surrounding the particular phone, wherein the particular phone corresponds to a central position in each of the multiple training sequences;
  
  selecting, from among the plurality of assigned partitioning keys, a partitioning key that corresponds to each of the multiple training sequences based on a sequence of multiple phones that occurs in each of the multiple training sequences;
  
  selecting a processing module from among the plurality of processing modules based on the identified partitioning key, the selected processing module being designated to train a portion of an acoustic model corresponding to the identified partitioning key;
  
  identifying a portion of the speech data that corresponds to a training sequence of the multiple training sequences that includes the most contextual phones; and
  
  transmitting, to the selected processing module, (i) data identifying the training sequences and (ii) data indicating the portion of the speech data that corresponds to the training sequence that includes the most contextual phones.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Xu, Peng, Pereira, Fernando, Chelba, Ciprian I.
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
KOVACEK, DAVID M

Application Number

US13/539,225
Publication Number

US 20130006612A1
Time in Patent Office

963 Days
Field of Search

704 1- 10, 704236-240, 704/249, 704255-257, 704E17001-E17016, 704E15001-E1505, 704E11001-E11007
US Class Current

704/10
CPC Class Codes

G10L 15/063   Training

G10L 15/14   using statistical models, e...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/34   Adaptation of a single reco...

G10L 2015/0631   Creating reference template...

Training acoustic models using distributed computing techniques

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Training acoustic models using distributed computing techniques

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links