Methods and apparatus for performing speech recognition using acoustic models which are improved through an interactive process
First Claim
1. A method of processing audio data and text data comprising:
- performing a speech recognition operation on the audio data using an initial acoustic model to produce a set of recognized text including words;
identifying, as a function of words included in said text data, correctly recognized words in the set of recognized text data;
selecting a subset of said audio data, corresponding to the correctly recognized words, for re-training purposes, the subset of said audio data including less than all of said audio data; and
re-training the initial acoustic model, using the selected subset of audio data corresponding to correctly recognized words as training data, to produce an updated acoustic model.
2 Assignments
0 Petitions
Accused Products
Abstract
Automated methods and apparatus for synchronizing audio and text data, e.g., in the form of electronic files, representing audio and text expressions of the same work or information are described. Also described are automated methods of detecting errors and other discrepancies between the audio and text versions of the same work. A speech recognition operation is performed on the audio data initially using a speaker independent acoustic model. The recognized text in addition to audio time stamps are produced by the speech recognition operation. The recognized text is compared to the text in text data to identify correctly recognized words. The acoustic model is then retrained using the correctly recognized text and corresponding audio segments from the audio data transforming the initial acoustic model into a speaker trained acoustic model. The retrained acoustic model is then used to perform an additional speech recognition operation on the audio data. The audio and text data are synchronized using the results of the updated acoustic model. In addition, one or more error reports based on the final recognition results are generated showing discrepancies between the recognized words and the words included in the text. By retraining the acoustic model in the above described manner, improved accuracy is achieved.
-
Citations
28 Claims
-
1. A method of processing audio data and text data comprising:
-
performing a speech recognition operation on the audio data using an initial acoustic model to produce a set of recognized text including words;
identifying, as a function of words included in said text data, correctly recognized words in the set of recognized text data;
selecting a subset of said audio data, corresponding to the correctly recognized words, for re-training purposes, the subset of said audio data including less than all of said audio data; and
re-training the initial acoustic model, using the selected subset of audio data corresponding to correctly recognized words as training data, to produce an updated acoustic model. - View Dependent Claims (2, 4, 5, 6, 7, 8, 9, 12, 15, 16)
performing an additional speech recognition operation on the audio data using the updated acoustic model; and
synchronizing the audio and text data by using the results of the additional speech recognition operation to identify points in the audio and text data which correspond to one another.
-
-
4. The method of claim 1, wherein re-training the initial acoustic model includes the acts of:
-
operating a phonetic recognizer to generate a phonetic representation of each correctly recognized word from the audio data corresponding to the correctly recognized word;
operating a phonetic transcriber to generate from a text representation of at least some of the correctly identified words, a plurality of possible phonetic representations of each word; and
selecting to be used for acoustic model retraining purposes, from the plurality of possible phonetic representations of each word, the transcribed phonetic representation which most closely matches the phonetic representation generated for the corresponding word by the phonetic recognizer.
-
-
5. The method of claim 4, wherein re-training the initial acoustic model further includes the acts of:
performing an acoustic model training operation using the selected transcribed phonetic representations, the initial acoustic model, and the audio data corresponding to the words for which transcribed phonetic representations have been selected.
-
6. The method of claim 5, wherein re-training the initial acoustic model further includes the acts of:
-
using the updated acoustic model as the initial acoustic model;
and repeating, until a pre-selected acoustic model updating stop criterion is satisfied, the steps of;
performing a speech recognition operation on the audio data using the initial acoustic model to produce a set of recognized text including words;
using the text data to identify correctly recognized words in the set of recognized text data; and
re-training the initial acoustic model using the audio data corresponding to correctly recognized words as training data to produce an updated acoustic model.
-
-
7. The method of claim 6, wherein the pre-selected acoustic model updating stop criterion is a failure of the updated acoustic model to produce improved speech recognition results when used to perform a speech recognition operation on said audio data.
-
8. The method of claim 2, wherein re-training the initial acoustic model includes the acts of:
-
operating a phonetic recognizer to generate a phonetic representation of each correctly recognized word from the audio data corresponding to the correctly recognized word;
operating a phonetic transcriber to generate from a text representation of at least some of the correctly identified words, a plurality of possible phonetic representations of each word; and
for each word for which a plurality of possible phonetic representations are generated by the phonetic transcriber;
determining if at least one of the plurality of phonetic representations of the word matches the phonetic representation of the word generated by the phonetic recognizer; and
selecting the phonetic representation generated by the phonetic recognizer to be used in training the updated acoustic model when it is determined that none of the plurality of phonetic representations of the word generated by the phonetic transcriber matches the output of the phonetic recognizer for the corresponding word.
-
-
9. The method of claim 8, wherein for each word for which a plurality of possible phonetic representations are generated by the phonetic transcriber, re-training the initial acoustic model includes the further acts of:
selecting the phonetic representation generated by the phonetic transcriber which most closely matches the output of phonetic recognizer for the word, when it is determined that at least one of the plurality of phonetic representations of the word generated by the phonetic transcriber matches the output of the phonetic recognizer for the corresponding word.
-
12. The method of claim 2, wherein synchronizing the audio and text data includes the acts of:
-
globally aligning recognized text produced by performing the speech recognition operation using the updated acoustic model with words included in the text data;
identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence.
-
-
15. The method of claim 2,
wherein the audio data and text data are unsynchronized audio and text versions, respectively, of the same literary work; -
wherein the initial acoustic model is a speaker independent acoustic model; and
wherein the updated acoustic model is a speaker trained acoustic model;
the method further comprising;
generating a report, using the speech recognition results obtained by performing the additional speech recognition on the audio data, identifying discrepancies between the audio data and the text data.
-
-
16. The method of claim 1, wherein retraining the initial acoustic model includes the acts of:
-
supplying the identified correctly recognized words to an acoustic model retraining device; and
operating the acoustic model retraining device to retrain the initial acoustic model using the correctly recognized words as training data.
-
-
3. A method of processing audio data and text data comprising:
-
performing a speech recognition operation on the audio data using an initial acoustic model to produce a set of recognized text including words;
identifying, as a function of words included in said text data, correctly recognized words in the set of recognized text data;
re-training the initial acoustic model, using the audio data corresponding to correctly recognized words as training data, to produce an updated acoustic model;
performing an additional speech recognition operation on the audio data using the updated acoustic model; and
synchronizing the audio and text data by using the results of the additional speech recognition operation to identify points in the audio and text data which correspond to one another, wherein synchronizing the audio and text data includes the act of;
aligning a set of recognized words separated by silence produced by performing said additional speech recognition operation with words, separated by punctuation corresponding to silence, included in the text data.
-
-
10. A method of processing audio data and text data comprising:
-
performing a speech recognition operation on the audio data using an initial acoustic model to produce a set of recognized text including words;
identifying, as a function of words included in said text data, correctly recognized words in the set of recognized text data; and
re-training the initial acoustic model, using the audio data corresponding to correctly recognized words as training data, to produce an updated acoustic model;
performing an additional speech recognition operation on the audio data using the updated acoustic model; and
synchronizing the audio and text data by using the results of the additional speech recognition operation to identify points in the audio and text data which correspond to one another, synchronizing the audio and text data including;
i. globally aligning recognized text produced by performing the speech recognition operation using the updated acoustic model with words included in the text data;
ii. identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
iii. inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence. - View Dependent Claims (11, 13, 14)
where the first identified location is a location in the recognized text where silence is expected to occur based on information included in the aligned text data; - and
wherein synchronizing the audio and text data further comprising the acts of;
identifying an additional location in the recognized text where silence was recognized and additional silence was expected to occur based on information included in the aligned text data and where at least one correctly recognized word adjoins the additional recognized silence; and
inserting into the text data, at the location aligned with said additional identified location, a pointer to the audio data corresponding to the additional recognized silence.
-
-
13. The method of claim 10,
where the first identified location is a location in the recognized text where silence is expected to occur based on information included in the aligned text data; - and
wherein synchronizing the audio and text data further comprises the acts of;
identifying an additional location in the recognized text where silence was recognized and additional silence was expected to occur based on information included in the aligned text data and where at least one correctly recognized word adjoins the additional recognized silence; and
inserting into the text data, at the location aligned with said additional identified location, a pointer to the audio data corresponding to the additional recognized silence.
- and
-
14. The method of claim 13,
wherein the audio data and text data are unsynchronized audio and text versions, respectively, of the same literary work; -
wherein the initial acoustic model is a speaker independent acoustic model; and
wherein the updated acoustic model is a speaker trained acoustic model.
-
-
17. A computer readable medium, comprising:
-
computer executable instructions for controlling a computer device to perform a speech recognition operation on the audio data using an initial acoustic model to produce a set of recognized text including words;
identify, as a function of words included in said text data, correctly recognized words in the set of recognized text data;
select a subset of said audio data, corresponding to the correctly recognized words, for re-training purposes, the subset of said audio data including less than all of said audio data; and
re-train the initial acoustic model, using the selected subset of audio data corresponding to correctly recognized words as training data, to produce an updated acoustic model.
-
-
18. A device for processing electronic text data and electronic audio data, comprising:
-
memory for storing an initial speaker independent acoustic model;
a speech recognizer for performing a first speech recognition operation on the audio data, as a function of the initial speaker independent acoustic model, to produce a set of recognized text;
means for identifying, as a function of words included in said text data, words which are correctly recognized by the speech recognizer;
an audio data selection module for selecting a subset of said audio data, corresponding to words which were correctly recognized by the speech recognizer, for re-training purposes, the subset of said audio data including less than all of said audio data; and
an acoustic model trainer for re-training the initial acoustic model, using the subset of said audio data corresponding to correctly recognized words as training data, to thereby produce an updated acoustic model. - View Dependent Claims (19, 20, 21, 24)
means for using the updated acoustic model to perform a second speech recognition operation on said audio data; and
means for synchronizing the audio data and the text data as a function of recognized text produced using the updated acoustic model.
-
-
20. The device of claim 19, wherein the acoustic model trainer includes:
-
a phonetic recognizer for generating phonetic representations of recognized words from audio data corresponding to said words;
a phonetic transcriber for generating a plurality of phonetic transcriptions for at least some words for which the phonetic recognizer generates a phonetic representation; and
a phonetic representation selector coupled to the phonetic recognizer and phonetic transcriber for selecting one of the phonetic representations output by the phonetic transcriber for each word used by the acoustic model trainer for re-training the initial acoustic model.
-
-
21. The device of claim 19, wherein the acoustic model trainer includes:
-
a phonetic recognizer for generating phonetic representations of recognized words from audio data corresponding to said words;
a phonetic transcriber for generating a plurality of phonetic transcriptions for at least some words for which the phonetic recognizer generates a phonetic representation; and
a phonetic representation selector for selecting, as a function of the output of the phonetic recognizer, one of the phonetic representations output by the phonetic transcriber or the phonetic recognizer for each word used by the acoustic model trainer for re-training the initial acoustic model.
-
-
24. The device of claim 19, further comprising:
means for generating a report of discrepancies between the content of the audio data and text data based on speech recognition results obtained by using the updated acoustic model to perform a speech recognition operation on the audio data.
-
22. A device for processing electronic text data and electronic audio data, comprising:
-
memory for storing an initial speaker independent acoustic model;
a speech recognizer for performing a first speech recognition operation on the audio data, as a function of the initial speaker independent acoustic model, to produce a set of recognized text;
means for identifying, as a function of words included in said text data, words which are correctly recognized by the speech recognizer;
an acoustic model trainer for re-training the initial acoustic model, using audio data corresponding to correctly recognized words as training data, to thereby produce an updated acoustic model;
means for using the updated acoustic model to perform a second speech recognition operation on said audio data; and
means for synchronizing the audio data and the text data as a function of recognized text produced using the updated acoustic model, said means for synchronizing the audio data and the text data including;
means for globally aligning text recognized by performing a speech recognition operation on the audio data using the updated acoustic model with words included in the text data;
means for identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
means for inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence. - View Dependent Claims (23)
a statistical language model generation module coupled to the speech recognizer for generating from the text data a statistical language model used by the speech recognizer when performing a speech recognition operation on the audio data.
-
-
25. A device for processing audio data and text data corresponding to the same work, comprising:
-
a speech recognizer for performing a speech recognition operation on the audio data to produce a set of recognized text;
means for aligning the set of recognized text with the text data;
means for comparing the recognized text to the text data to identify inconsistencies between the recognized text and text data;
means for generating a report including information about identified inconsistencies between the audio and text data corresponding to the same work;
a language model for use by the speech recognizer when performing a speech recognition operation;
an initial speaker independent acoustic model for use by the speech recognizer with said language model when performing an initial speech recognition operation on the audio data; and
means for re-training the initial acoustic model to produce an updated acoustic model through an iterative acoustic model training process which uses a subset of said audio data, the subset of audio data corresponding to words which are determined by said comparing means to have been correctly recognized and including less than all of said audio data. - View Dependent Claims (26, 27, 28)
a phonetic recognizer for generating phonetic representations of recognized words from audio data corresponding to said words;
a phonetic transcriber for generating a plurality of phonetic transcriptions for at least some words for which the phonetic recognizer generates a phonetic representation; and
a phonetic representation selector for selecting, as a function of the output of the phonetic recognizer, one of the phonetic representations output by the phonetic transcriber or the phonetic recognizer for each word used by the acoustic model trainer for re-training the initial acoustic model.
-
-
27. The device of claim 26, wherein the audio data and text data represent unsynchronized audio and text versions, respectively, of the same literary work.
-
28. The device of claim 26, wherein the audio data represents an audio version of a program and wherein the text data represents a transcript of the audio version of said program.
Specification