Training acoustic models using connectionist temporal classification
First Claim
1. A method performed by one or more computers of a speech recognition system, the method comprising:
- training, by the one or more computers of the speech recognition system, a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments;
training, by the one or more computers of the speech recognition system, a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states;
accessing, by the one or more computers of the speech recognition system, the second CTC acoustic model;
receiving, by the one or more computers of the speech recognition system, audio data for a portion of an utterance;
providing, by the one or more computers of the speech recognition system, input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data;
generating, by the one or more computers of the speech recognition system, data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and
providing, by the one or more computers of the speech recognition system, the data indicating the transcription as output of the automated speech recognition system.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.
-
Citations
20 Claims
-
1. A method performed by one or more computers of a speech recognition system, the method comprising:
-
training, by the one or more computers of the speech recognition system, a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments; training, by the one or more computers of the speech recognition system, a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states; accessing, by the one or more computers of the speech recognition system, the second CTC acoustic model; receiving, by the one or more computers of the speech recognition system, audio data for a portion of an utterance; providing, by the one or more computers of the speech recognition system, input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data; generating, by the one or more computers of the speech recognition system, data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and providing, by the one or more computers of the speech recognition system, the data indicating the transcription as output of the automated speech recognition system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A speech recognition system comprising:
-
one or more computers of the speech recognition system; and a non-transitory computer-readable medium coupled to the one or more computers having instructions stored thereon which, when executed by the one or more computers, cause the one or more computers to perform operations comprising; training a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments; training a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states; accessing the second CTC acoustic model; receiving audio data for a portion of an utterance; providing input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data; generating data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and providing the data indicating the transcription as output of the automated speech recognition system. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer-readable storage medium storing a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
training a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments; training a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states; accessing the second CTC acoustic model; receiving audio data for a portion of an utterance; providing input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data; generating data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and providing the data indicating the transcription as output of an automated speech recognition service.
-
Specification