MODEL TRAINING FOR AUTOMATIC SPEECH RECOGNITION FROM IMPERFECT TRANSCRIPTION DATA
First Claim
1. A computer-implemented method, comprising:
- a. aligning an utterance from a set of training data with a corresponding original transcription from the set of training data to produce a time-aligned transcription with time alignment information for each word in the utterance, wherein the set of training data includes transcription errors;
b. decoding the same utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription with time alignment information for each word;
c. aligning the time-aligned and decoded transcriptions according to time alignment information;
d. selecting all segments from the utterance having at least Q contiguous matching aligned words, where Q is a positive integer; and
e. training the incremental acoustic model with the selected segments.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques and systems for training an acoustic model are described. In an embodiment, a technique for training an acoustic model includes dividing a corpus of training data that includes transcription errors into N parts, and on each part, decoding an utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription. The technique may further include inserting silence between a pair of words into the decoded transcription and aligning an original transcription corresponding to the utterance with the decoded transcription according to time for each part. The technique may further include selecting a segment from the utterance having at least Q contiguous matching aligned words, and training the incremental acoustic model with the selected segment. The trained incremental acoustic model may then be used on a subsequent part of the training data. Other embodiments are described and claimed.
72 Citations
20 Claims
-
1. A computer-implemented method, comprising:
-
a. aligning an utterance from a set of training data with a corresponding original transcription from the set of training data to produce a time-aligned transcription with time alignment information for each word in the utterance, wherein the set of training data includes transcription errors; b. decoding the same utterance with an incremental acoustic model and an incremental language model to produce a decoded transcription with time alignment information for each word; c. aligning the time-aligned and decoded transcriptions according to time alignment information; d. selecting all segments from the utterance having at least Q contiguous matching aligned words, where Q is a positive integer; and e. training the incremental acoustic model with the selected segments. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-readable storage medium storing computer-executable program instructions that when executed cause a computing system to:
-
compute a frame posterior for each word in an utterance from a corpus comprising audio data and a corresponding transcription that contains transcription errors; train an acoustic model with confidence-based maximum likelihood estimation (MLE) training using the frame posterior; estimate acoustic model parameters with confidence-based discriminative training using the frame posterior; and generate a finalized acoustic model. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system, comprising:
-
an alignment component operative to align an utterance from a corpus of training data including transcription errors with a corresponding original transcription from the corpus of training data to produce a time-aligned transcription with time alignment information for each word in the utterance; a decoding component operative to decode the utterance from the corpus of training data using an incremental acoustic model and an incremental language model to produce a decoded transcription;
wherein the alignment component is operative to align the time-aligned transcription with the decoded transcription;a segment selecting component operative to select a segment from the utterance having at least Q contiguous matching aligned words, where Q is a positive integer; and a training component to train the incremental acoustic model with the selected segment and to generate a final acoustic model. - View Dependent Claims (17, 18, 19, 20)
-
Specification