Training acoustic models using connectionist temporal classification
First Claim
1. A method comprising:
- accessing, by data processing hardware, acoustic model training data comprising training audio data and word-level transcriptions for the training audio data;
flat start training, by data processing hardware, a first connectionist temporal classification (CTC) acoustic model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions, the first CTC acoustic model trained without using any previously determined fixed alignment targets between the training audio data and the word-level transcriptions;
generating, by the data processing hardware using the trained first CTC acoustic model, a context-dependent state inventory from approximate phonetic alignments between the training audio data and the phonetic sequences corresponding to the word-level transcriptions; and
training, by the data processing hardware, a second CTC acoustic model using the context-dependent state inventory to generate outputs corresponding to one or more context-dependent states.
0 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.
130 Citations
26 Claims
-
1. A method comprising:
-
accessing, by data processing hardware, acoustic model training data comprising training audio data and word-level transcriptions for the training audio data; flat start training, by data processing hardware, a first connectionist temporal classification (CTC) acoustic model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions, the first CTC acoustic model trained without using any previously determined fixed alignment targets between the training audio data and the word-level transcriptions; generating, by the data processing hardware using the trained first CTC acoustic model, a context-dependent state inventory from approximate phonetic alignments between the training audio data and the phonetic sequences corresponding to the word-level transcriptions; and training, by the data processing hardware, a second CTC acoustic model using the context-dependent state inventory to generate outputs corresponding to one or more context-dependent states. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system comprising:
-
data processing hardware; memory hardware in communication with the data processing hardware and storing instructions that when executed by the data processing hardware cause the data processing hardware to perform operations comprising; accessing acoustic model training data comprising training audio data and word-level transcriptions for the training audio data; flat start training a first connectionist temporal classification (CTC) acoustic model on the acoustic model training data to generate phonetic sequences corresponding to the word-level transcriptions, the first CTC acoustic model trained without using any previously determined fixed alignment targets between the training audio data and the word-level transcriptions; generating, using the trained first CTC acoustic model, a context-dependent state inventory from approximate phonetic alignments between the training audio data and the phonetic sequences corresponding to the word-level transcriptions; and training a second CTC acoustic model using the context-dependent state inventory to generate outputs corresponding to one or more context-dependent states. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
Specification