System and method for rescoring N-best hypotheses of an automatic speech recognition system

US 7,761,296 B1
Filed: 04/02/1999
Issued: 07/20/2010
Est. Priority Date: 04/02/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A computer readable medium storing a computer program to perform method steps for execution by a processor, the method steps comprising:

generating a synthetic waveform for each of N textual transcriptions of an original waveform, wherein N is greater than 1 and the N textual transcriptions are generated by a speech recognition system and represent N-best textual transcription hypotheses of the original waveform;

for each synthetic waveform,time-aligning feature vectors of the synthetic waveform with feature vectors of the original waveform at a phoneme level;

computing a mean of the feature vectors which align to each phoneme for the original waveform and the synthetic waveform;

computing a distance measure between each phoneme mean of the original waveform and the synthetic waveform;

summing the distance measures to generate an overall distance measure representing a distance between the original waveform and the synthetic waveform;

comparing scores based on the overall distance measure between the synthetic waveform and the original waveform, an acoustic model score of a corresponding textual transcription of the synthetic waveform, and a language model score of the corresponding textual transcription to determine a corresponding one of the N-best textual transcriptions; and

selecting for output the determined N-best textual transcription.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for rescoring the N-best hypotheses from an automatic speech recognition system by comparing an original speech waveform to synthetic speech waveforms that are generated for each text sequence of the N-best hypotheses. A distance is calculated from the original speech waveform to each of the synthesized waveforms, and the text associated with the synthesized waveform that is determined to be closest to the original waveform is selected as the final hypothesis. The original waveform and each synthesized waveform are aligned to a corresponding text sequence on a phoneme level. The mean of the feature vectors which align to each phoneme is computed for the original waveform as well as for each of the synthesized hypotheses. The distance of a synthesized hypothesis to the original speech signal is then computed as the sum over all phonemes in the hypothesis of the Euclidean distance between the means of the feature vectors of the frames aligning to that phoneme for the original and the synthesized signals. The text of the hypothesis which is closest under the above metric to the original waveform is chosen as the final system output.

286 Citations

11 Claims

1. A computer readable medium storing a computer program to perform method steps for execution by a processor, the method steps comprising:
- generating a synthetic waveform for each of N textual transcriptions of an original waveform, wherein N is greater than 1 and the N textual transcriptions are generated by a speech recognition system and represent N-best textual transcription hypotheses of the original waveform;
  
  for each synthetic waveform,time-aligning feature vectors of the synthetic waveform with feature vectors of the original waveform at a phoneme level;
  
  computing a mean of the feature vectors which align to each phoneme for the original waveform and the synthetic waveform;
  
  computing a distance measure between each phoneme mean of the original waveform and the synthetic waveform;
  
  summing the distance measures to generate an overall distance measure representing a distance between the original waveform and the synthetic waveform;
  
  comparing scores based on the overall distance measure between the synthetic waveform and the original waveform, an acoustic model score of a corresponding textual transcription of the synthetic waveform, and a language model score of the corresponding textual transcription to determine a corresponding one of the N-best textual transcriptions; and
  
  selecting for output the determined N-best textual transcription.
- View Dependent Claims (2)
- - 2. The computer readable medium of claim 1, wherein the alignment is performed using a Viterbi alignment process.

3. A method for recognizing speech, the method comprising the steps of:
- generating a synthetic waveform for each of N textual transcriptions of an original waveform, wherein N is greater than 1 and the N textual transcriptions are generated by a speech recognition system and represent N-best textual transcription hypotheses of the original waveform;
  
  for each synthetic waveform,computing a distance measure between the synthetic waveform and the original waveform;
  
  summing the distance measures to generate an overall distance measure representing a distance between the original waveform and the synthetic waveform;
  
  generating a score S from the overall distance measure D, an acoustic model score A of the corresponding textual transcription for the synthetic wave, and a language model score L of the corresponding textual transcription, wherein the score S=−
  
  D+(a*A)+(b*L), and ‘
  
  a’ and
  
  ‘
  
  b’
  
  are constants;
  
  selecting for output one of the textual transcriptions corresponding to the synthetic waveform having the score that indicates the synthetic wave is closest to the original waveform.
- View Dependent Claims (4, 5, 6)
- - 4. The method of claim 3, further comprising:
    - aligning frames of the original waveform and frames of each synthetic waveform to a corresponding one of the N textual transcriptions; and
      
      calculating the distance measure between the original waveform and each of the synthetic waveforms based on the corresponding alignments.
  - 5. The method of claim 4, further comprising:
    - retrieving feature vectors corresponding to the original waveform; and
      
      generating feature vectors for each synthetic waveform such that the feature vectors for the synthetic waveforms are-similar in structure to the feature vectors of the original waveform,wherein the alignment is performed by time-aligning the feature vectors of the original waveform and the feature vectors of each synthetic waveform with the corresponding one of the N textual transcriptions.
  - 6. The method of claim 3, further comprising:
    - computing a mean feature vector of all feature vectors comprising each aligned frame for both the original and Nth synthetic waveform, wherein the distance measure for each aligned frame is calculated by determining a distance between each means of the corresponding aligned frames.

7. An automatic speech recognition system, comprising:
- a decoder for decoding an original waveform of acoustic utterances to produce N textual transcriptions, the N textual transcriptions representing N-best textual transcription hypotheses of the decoded original waveform;
  
  a text to speech system generating a synthetic waveform for each of the N textual transcriptions;
  
  a means to perform a speaker normalization on the original waveform to match vocal-tract characteristics of a speaker from whose data the TTS is derived; and
  
  a comparator for comparing scores based on an overall distance measure between each synthetic waveform and the normalized original waveform, an acoustic model score of a corresponding textual transcription of the synthetic waveform, and a language model score of the corresponding textual transcription to determine a corresponding one of the N-best textual transcriptions to output,wherein the overall distance measures are computed by a processor;
  
  computing a distance measure between the synthetic waveform and the normalized original waveform; and
  
  summing the distance measures to generate an overall distance measure representing a distance between the normalized original waveform and the synthetic waveform, andwherein N is greater than 1.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The system of claim 7, further comprising a feature analysis processor adapted to generate a set of feature vectors for the normalized original waveform and generate a set of feature vectors for each of the N synthetic waveforms.
  - 9. The system of claim 7, further comprises:
    - means for aligning frames of the normalized original waveform and frames of each synthetic waveform to a corresponding one of the N textual transcriptions; and
      
      means for calculating the distance measure between the normalized original waveform and each of the synthetic waveforms based on the corresponding alignments.
  - 10. The system of claim 9, wherein the frames are aligned on a phoneme level.
  - 11. The system of claim 9, wherein the means for calculating the distance measures comprises:
    - means for calculating an individual distance between each aligned frame of the original normalized waveform and each of the N synthetic waveforms; and
      
      means for summing the individual distances of the aligned frames of the original normalized waveform and each synthetic waveform to compute the overall distance measures between the original normalized waveform and each synthetic waveform.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Eide, Ellen M., Bakis, Raimo
Primary Examiner(s)
Armstrong; Angela A

Application Number

US09/286,099
Time in Patent Office

4,127 Days
Field of Search

704/238, 704/247, 704/260
US Class Current

704/247
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 15/08   Speech classification or se...

G10L 15/10   using distance or distortio...

System and method for rescoring N-best hypotheses of an automatic speech recognition system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

286 Citations

11 Claims

Specification

Use Cases

Quick Links

Others

System and method for rescoring N-best hypotheses of an automatic speech recognition system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

286 Citations

11 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others