Training acoustic models using connectionist temporal classification

US 10,229,672 B1
Filed: 01/03/2017
Issued: 03/12/2019
Est. Priority Date: 12/31/2015
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more computers of a speech recognition system, the method comprising:

training, by the one or more computers of the speech recognition system, a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments;

training, by the one or more computers of the speech recognition system, a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states;

accessing, by the one or more computers of the speech recognition system, the second CTC acoustic model;

receiving, by the one or more computers of the speech recognition system, audio data for a portion of an utterance;

providing, by the one or more computers of the speech recognition system, input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data;

generating, by the one or more computers of the speech recognition system, data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and

providing, by the one or more computers of the speech recognition system, the data indicating the transcription as output of the automated speech recognition system.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.

Citations

20 Claims

1. A method performed by one or more computers of a speech recognition system, the method comprising:
- training, by the one or more computers of the speech recognition system, a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments;
  
  training, by the one or more computers of the speech recognition system, a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states;
  
  accessing, by the one or more computers of the speech recognition system, the second CTC acoustic model;
  
  receiving, by the one or more computers of the speech recognition system, audio data for a portion of an utterance;
  
  providing, by the one or more computers of the speech recognition system, input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data;
  
  generating, by the one or more computers of the speech recognition system, data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and
  
  providing, by the one or more computers of the speech recognition system, the data indicating the transcription as output of the automated speech recognition system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein receiving the audio data comprises receiving the audio data from a client device over a computer network, andwherein providing the data indicating the transcription comprises providing the data indicating the transcription to the client device over the computer network.
  - 3. The method of claim 1, wherein providing the data indicating the transcription comprises live streaming speech recognition results such that the data indicating the transcription is provided while the one or more computers concurrently receive audio data for an additional portion of the utterance.
  - 4. The method of claim 1, wherein the accessed second CTC acoustic model is a unidirectional CTC acoustic model and the first CTC acoustic model is a bidirectional CTC acoustic model.
  - 5. The method of claim 1, wherein the accessed second CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers, and wherein the first CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers.
  - 6. The method of claim 1, wherein the second CTC acoustic model is configured to provide outputs that identify labels for triphones or to provide outputs that include scores for labels for triphones.
  - 7. The method of claim 1, wherein the second CTC acoustic model is trained independent of non-CTC acoustic models.
  - 8. The method of claim 1, wherein the second CTC acoustic model is trained to recognize multiple different pronunciations of a word using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word.
  - 9. The method of claim 1, wherein the second CTC acoustic model is trained to recognize multiple different verbalizations of a written word using a verbalization model that indicates multiple different spoken words as valid verbalizations of the written word.
  - 10. The method of claim 1, wherein the accessed second CTC model is a recurrent neural network model that has been trained using a CTC loss function, and is configured to selectively indicate that a blank output label has a higher likelihood score than phoneme output labels in response to input data to the accessed second CTC model.
  - 11. The method of claim 1, wherein the operations further comprise generating, as the output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data, output vectors that each indicate (i) likelihoods corresponding to different phonetic units and (ii) a likelihood corresponding to a blank label that does not represent a phonetic unit.

12. A speech recognition system comprising:
- one or more computers of the speech recognition system; and
  
  a non-transitory computer-readable medium coupled to the one or more computers having instructions stored thereon which, when executed by the one or more computers, cause the one or more computers to perform operations comprising;
  
  training a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments;
  
  training a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states;
  
  accessing the second CTC acoustic model;
  
  receiving audio data for a portion of an utterance;
  
  providing input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data;
  
  generating data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and
  
  providing the data indicating the transcription as output of the automated speech recognition system.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The system of claim 12, wherein receiving the audio data comprises receiving the audio data from a client device over a computer network, andwherein providing the data indicating the transcription comprises providing the data indicating the transcription to the client device over the computer network.
  - 14. The system of claim 12, wherein providing the data indicating the transcription comprises live streaming speech recognition results such that the data indicating the transcription is provided while the one or more computers concurrently receive audio data for an additional portion of the utterance.
  - 15. The system of claim 12, wherein the accessed second CTC acoustic model is a unidirectional CTC acoustic model and the first CTC acoustic model is a bidirectional CTC acoustic model.
  - 16. The system of claim 12, wherein the accessed second CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers, and wherein the first CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers.
  - 17. The system of claim 12, wherein the second CTC acoustic model is configured to provide outputs that identify labels for triphones or to provide outputs that include scores for labels for triphones.
  - 18. The system of claim 12, wherein the second CTC acoustic model is trained to recognize multiple different pronunciations of a word using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word.
  - 19. The system of claim 12, wherein the second CTC acoustic model is trained to recognize multiple different verbalizations of a written word using a verbalization model that indicates multiple different spoken words as valid verbalizations of the written word.

20. A non-transitory computer-readable storage medium storing a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- training a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments;
  
  training a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states;
  
  accessing the second CTC acoustic model;
  
  receiving audio data for a portion of an utterance;
  
  providing input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data;
  
  generating data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and
  
  providing the data indicating the transcription as output of an automated speech recognition service.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Rao, Kanury Kanishka, Senior, Andrew W., Sak, Hasim
Primary Examiner(s)
Sirjani, Fariba

Application Number

US15/397,327
Time in Patent Office

798 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/16   using artificial neural net...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/30   Distributed recognition, e....

G10L 2015/022   Demisyllables, biphones or ...

Training acoustic models using connectionist temporal classification

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Training acoustic models using connectionist temporal classification

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links