System and method for rescoring N-best hypotheses of an automatic speech recognition system
First Claim
1. A computer readable medium storing a computer program to perform method steps for execution by a processor, the method steps comprising:
- generating a synthetic waveform for each of N textual transcriptions of an original waveform, wherein N is greater than 1 and the N textual transcriptions are generated by a speech recognition system and represent N-best textual transcription hypotheses of the original waveform;
for each synthetic waveform,time-aligning feature vectors of the synthetic waveform with feature vectors of the original waveform at a phoneme level;
computing a mean of the feature vectors which align to each phoneme for the original waveform and the synthetic waveform;
computing a distance measure between each phoneme mean of the original waveform and the synthetic waveform;
summing the distance measures to generate an overall distance measure representing a distance between the original waveform and the synthetic waveform;
comparing scores based on the overall distance measure between the synthetic waveform and the original waveform, an acoustic model score of a corresponding textual transcription of the synthetic waveform, and a language model score of the corresponding textual transcription to determine a corresponding one of the N-best textual transcriptions; and
selecting for output the determined N-best textual transcription.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for rescoring the N-best hypotheses from an automatic speech recognition system by comparing an original speech waveform to synthetic speech waveforms that are generated for each text sequence of the N-best hypotheses. A distance is calculated from the original speech waveform to each of the synthesized waveforms, and the text associated with the synthesized waveform that is determined to be closest to the original waveform is selected as the final hypothesis. The original waveform and each synthesized waveform are aligned to a corresponding text sequence on a phoneme level. The mean of the feature vectors which align to each phoneme is computed for the original waveform as well as for each of the synthesized hypotheses. The distance of a synthesized hypothesis to the original speech signal is then computed as the sum over all phonemes in the hypothesis of the Euclidean distance between the means of the feature vectors of the frames aligning to that phoneme for the original and the synthesized signals. The text of the hypothesis which is closest under the above metric to the original waveform is chosen as the final system output.
-
Citations
11 Claims
-
1. A computer readable medium storing a computer program to perform method steps for execution by a processor, the method steps comprising:
-
generating a synthetic waveform for each of N textual transcriptions of an original waveform, wherein N is greater than 1 and the N textual transcriptions are generated by a speech recognition system and represent N-best textual transcription hypotheses of the original waveform; for each synthetic waveform, time-aligning feature vectors of the synthetic waveform with feature vectors of the original waveform at a phoneme level; computing a mean of the feature vectors which align to each phoneme for the original waveform and the synthetic waveform; computing a distance measure between each phoneme mean of the original waveform and the synthetic waveform; summing the distance measures to generate an overall distance measure representing a distance between the original waveform and the synthetic waveform; comparing scores based on the overall distance measure between the synthetic waveform and the original waveform, an acoustic model score of a corresponding textual transcription of the synthetic waveform, and a language model score of the corresponding textual transcription to determine a corresponding one of the N-best textual transcriptions; and selecting for output the determined N-best textual transcription. - View Dependent Claims (2)
-
-
3. A method for recognizing speech, the method comprising the steps of:
-
generating a synthetic waveform for each of N textual transcriptions of an original waveform, wherein N is greater than 1 and the N textual transcriptions are generated by a speech recognition system and represent N-best textual transcription hypotheses of the original waveform; for each synthetic waveform, computing a distance measure between the synthetic waveform and the original waveform; summing the distance measures to generate an overall distance measure representing a distance between the original waveform and the synthetic waveform; generating a score S from the overall distance measure D, an acoustic model score A of the corresponding textual transcription for the synthetic wave, and a language model score L of the corresponding textual transcription, wherein the score S=−
D+(a*A)+(b*L), and ‘
a’ and
‘
b’
are constants;selecting for output one of the textual transcriptions corresponding to the synthetic waveform having the score that indicates the synthetic wave is closest to the original waveform. - View Dependent Claims (4, 5, 6)
-
-
7. An automatic speech recognition system, comprising:
-
a decoder for decoding an original waveform of acoustic utterances to produce N textual transcriptions, the N textual transcriptions representing N-best textual transcription hypotheses of the decoded original waveform; a text to speech system generating a synthetic waveform for each of the N textual transcriptions; a means to perform a speaker normalization on the original waveform to match vocal-tract characteristics of a speaker from whose data the TTS is derived; and a comparator for comparing scores based on an overall distance measure between each synthetic waveform and the normalized original waveform, an acoustic model score of a corresponding textual transcription of the synthetic waveform, and a language model score of the corresponding textual transcription to determine a corresponding one of the N-best textual transcriptions to output, wherein the overall distance measures are computed by a processor; computing a distance measure between the synthetic waveform and the normalized original waveform; and summing the distance measures to generate an overall distance measure representing a distance between the normalized original waveform and the synthetic waveform, and wherein N is greater than 1. - View Dependent Claims (8, 9, 10, 11)
-
Specification