Training acoustic models using connectionist temporal classification

US 10,803,855 B1
Filed: 01/25/2019
Issued: 10/13/2020
Est. Priority Date: 12/31/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

accessing, by data processing hardware, acoustic model training data comprising training audio data and word-level transcriptions for the training audio data;

flat start training, by data processing hardware, a first connectionist temporal classification (CTC) acoustic model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions, the first CTC acoustic model trained without using any previously determined fixed alignment targets between the training audio data and the word-level transcriptions;

generating, by the data processing hardware using the trained first CTC acoustic model, a context-dependent state inventory from approximate phonetic alignments between the training audio data and the phonetic sequences corresponding to the word-level transcriptions; and

training, by the data processing hardware, a second CTC acoustic model using the context-dependent state inventory to generate outputs corresponding to one or more context-dependent states.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.

130 Citations

26 Claims

1. A method comprising:
- accessing, by data processing hardware, acoustic model training data comprising training audio data and word-level transcriptions for the training audio data;
  
  flat start training, by data processing hardware, a first connectionist temporal classification (CTC) acoustic model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions, the first CTC acoustic model trained without using any previously determined fixed alignment targets between the training audio data and the word-level transcriptions;
  
  generating, by the data processing hardware using the trained first CTC acoustic model, a context-dependent state inventory from approximate phonetic alignments between the training audio data and the phonetic sequences corresponding to the word-level transcriptions; and
  
  training, by the data processing hardware, a second CTC acoustic model using the context-dependent state inventory to generate outputs corresponding to one or more context-dependent states.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, further comprising:
    - generating, by the data processing hardware using the trained first CTC acoustic model, alignment data indicating alignments between the training audio data and the word-level transcriptions for the training audio data,wherein training the second CTC acoustic model further uses the alignment data to generate the outputs corresponding to the one or more context-dependent states.
  - 3. The method of claim 1, wherein the first CTC acoustic model comprises a bidirectional CTC acoustic model.
  - 4. The method of claim 1, wherein the first CTC acoustic model comprises a recurrent neural network including one or more long short-term memory layers.
  - 5. The method of claim 1, wherein the second CTC acoustic model comprises a unidirectional CTC acoustic model.
  - 6. The method of claim 1, wherein the second CTC acoustic model comprises a recurrent neural network including one or more long short-term memory layers.
  - 7. The method of claim 1, wherein the one or more context-dependent states comprise labels for triphones or scores for labels for triphones.
  - 8. The method of claim 1, further comprising, after flat start training the first CTC acoustic model, determining, by the data processing hardware, the approximate phonetic alignments between the training audio data and the phonetic sequences corresponding to the word-level transcriptions by determining an approximate alignment of an input phonetic sequence with a graph representing all possible alternative target label sequences.
  - 9. The method of claim 1, wherein flat start training the first CTC acoustic model comprises training the first CTC acoustic model to recognize multiple different pronunciations for a word in the word-level transcriptions using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word.
  - 10. The method of claim 1, wherein flat start training the first CTC acoustic model includes training the first CTC acoustic model to recognize multiple different verbalizations for a word in the word-level transcriptions using a verbalization model that indicates multiple different spoken words as valid verbalizations of the word.
  - 11. The method of claim 1, further comprising, after training the second CTC acoustic model:
    - receiving, at the data processing hardware, audio data for a portion of an utterance;
      
      providing, by the data processing hardware, input data corresponding to the received audio data as input to the trained second CTC acoustic model, the trained second CTC acoustic model configured to generate output values indicating likelihood corresponding to different context-dependent states in response to the input data corresponding to the received audio data; and
      
      generating, by the data processing hardware, a transcription for the utterance based on the output values generated by the second CTC acoustic model; and
      
      providing, by the data processing hardware, the transcription of the utterance to a client device in communication with the data processing hardware over a network.
  - 12. The method of claim 11, wherein generating the transcription for the utterance comprises live streaming speech recognition results such that the data processing hardware provides the transcription of the utterance to the client device while the data processing hardware concurrently receives audio data for an additional portion of the utterance.
  - 13. The method of claim 11, wherein receiving the audio data comprises receiving the audio data from the client device.

14. A system comprising:
- data processing hardware;
  
  memory hardware in communication with the data processing hardware and storing instructions that when executed by the data processing hardware cause the data processing hardware to perform operations comprising;
  
  accessing acoustic model training data comprising training audio data and word-level transcriptions for the training audio data;
  
  flat start training a first connectionist temporal classification (CTC) acoustic model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions, the first CTC acoustic model trained without using any previously determined fixed alignment targets between the training audio data and the word-level transcriptions;
  
  generating, using the trained first CTC acoustic model, a context-dependent state inventory from approximate phonetic alignments between the training audio data and the phonetic sequences corresponding to the word-level transcriptions; and
  
  training a second CTC acoustic model using the context-dependent state inventory to generate outputs corresponding to one or more context-dependent states.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 15. The system of claim 14, wherein the operations further comprise:
    - generating, using the trained first CTC acoustic model, alignment data indicating alignments between the training audio data and the word-level transcriptions for the training audio data,wherein training the second CTC acoustic model further uses the alignment data to generate the outputs corresponding to the one or more context-dependent states.
  - 16. The system of claim 14, wherein the first CTC acoustic model comprises a bidirectional CTC acoustic model.
  - 17. The system of claim 14, wherein the first CTC acoustic model comprises a recurrent neural network including one or more long short-term memory layers.
  - 18. The system of claim 14, wherein the second CTC acoustic model comprises a unidirectional CTC acoustic model.
  - 19. The system of claim 14, wherein the second CTC acoustic model comprises a recurrent neural network including one or more long short-term memory layers.
  - 20. The system of claim 14, wherein the one or more context-dependent states comprise labels for triphones or scores for labels for triphones.
  - 21. The system of claim 14, wherein the operations further comprise, after flat start training the first CTC acoustic model, determining the approximate phonetic alignments between the training audio data and the phonetic sequences corresponding to the word-level transcriptions by determining an approximate alignment of an input phonetic sequence with a graph representing all possible alternative target label sequences.
  - 22. The system of claim 14, wherein flat start training the first CTC acoustic model comprises training the first CTC acoustic model to recognize multiple different pronunciations for a word in the word-level transcriptions using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word.
  - 23. The system of claim 14, wherein flat start training the first CTC acoustic model includes training the first CTC acoustic model to recognize multiple different verbalizations for a word in the word-level transcriptions using a verbalization model that indicates multiple different spoken words as valid verbalizations of the word.
  - 24. The system of claim 14, wherein the operations further comprise, after training the second CTC acoustic model:
    - receiving audio data for a portion of an utterance;
      
      providing input data corresponding to the received audio data as input to the trained second CTC acoustic model, the trained second CTC acoustic model configured to generate output values indicating likelihood corresponding to different context-dependent states in response to the input data corresponding to the received audio data; and
      
      generating a transcription for the utterance based on the output values generated by the second CTC acoustic model; and
      
      providing the transcription of the utterance to a client device in communication with the data processing hardware over a network.
  - 25. The system of claim 24, wherein generating the transcription for the utterance comprises live streaming speech recognition results such that the data processing hardware provides the transcription of the utterance to the client device while the data processing hardware concurrently receives audio data for an additional portion of the utterance.
  - 26. The system of claim 24, wherein receiving the audio data comprises receiving the audio data from the client device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Rao, Kanury Kanishka, Senior, Andrew W., Sak, Hasim
Primary Examiner(s)
Sirjani, Fariba

Application Number

US16/258,309
Time in Patent Office

627 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/16   using artificial neural net...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/30   Distributed recognition, e....

G10L 2015/022   Demisyllables, biphones or ...

Training acoustic models using connectionist temporal classification

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

130 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Training acoustic models using connectionist temporal classification

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

130 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links