Minimum error rate training of combined string models

US 5,606,644 A
Filed: 04/26/1996
Issued: 02/25/1997
Est. Priority Date: 07/22/1993
Status: Expired due to Term

First Claim

Patent Images

1. A method of making a speech recognition model database based on a training speech utterance signal and a plurality of sets of current speech recognition models, the method comprising the steps of:

a. generating a set of one or more confusable string models, a confusable string model comprising a plurality of said current speech recognition models from two or more of said sets of said current speech recognition models, wherein each of at least two of said sets of models corresponds to different characteristics of speech and wherein the confusable string model is a model which, if selected to represent the training speech utterance, would result in a misrecognition of the training speech utterance;

b. generating a first scoring signal based on the training speech utterance signal and a string model for that utterance, the string model for the utterance comprising a plurality of said current speech recognition models from said two or more of said sets of said current speech recognition models;

c. generating one or more second scoring signals, a second scoring signal based on the training speech utterance signal and a confusable string model;

d. generating a signal reflecting a comparison of a likelihood of correctly recognizing the training speech utterance and a likelihood of incorrectly recognizing the training speech utterance, said likelihood of correctly recognizing the training speech utterance based on the first scoring signal and said likelihood of incorrectly recognizing the training speech utterance based on the one or more second scoring signals; and

e. based on the signal reflecting the comparison of said likelihoods, modifying one or more of said current speech recognition models to increase the probability that a string model for the utterance will have a rank order higher than the confusable string models.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of making a speech recognition model database is disclosed. The database is formed based on a training string utterance signal and a plurality of sets of current speech recognition models. The sets of current speech recognition models may include acoustic models, language models, and other knowledge sources. In accordance with an illustrative embodiment of the invention, a set of confusable string models is generated, each confusable string model comprising speech recognition models from two or more sets of speech recognition models (such as acoustic and language models). A first scoring signal is generated based on the training string utterance signal and a string model for that utterance, wherein the string model for the utterance comprises speech recognition models from two or more sets of speech recognition models. One or more second scoring signals are also generated, wherein a second scoring signal is based on the training string utterance signal and a confusable string model. A misrecognition signal is generated based on the first scoring signal and the one or more second scoring signals. Current speech recognition models are modified, based on the misrecognition signal to increase the probability that a correct string model will have a rank order higher than other confusable string models.

84 Citations

View as Search Results

21 Claims

1. A method of making a speech recognition model database based on a training speech utterance signal and a plurality of sets of current speech recognition models, the method comprising the steps of:
- a. generating a set of one or more confusable string models, a confusable string model comprising a plurality of said current speech recognition models from two or more of said sets of said current speech recognition models, wherein each of at least two of said sets of models corresponds to different characteristics of speech and wherein the confusable string model is a model which, if selected to represent the training speech utterance, would result in a misrecognition of the training speech utterance;
  
  b. generating a first scoring signal based on the training speech utterance signal and a string model for that utterance, the string model for the utterance comprising a plurality of said current speech recognition models from said two or more of said sets of said current speech recognition models;
  
  c. generating one or more second scoring signals, a second scoring signal based on the training speech utterance signal and a confusable string model;
  
  d. generating a signal reflecting a comparison of a likelihood of correctly recognizing the training speech utterance and a likelihood of incorrectly recognizing the training speech utterance, said likelihood of correctly recognizing the training speech utterance based on the first scoring signal and said likelihood of incorrectly recognizing the training speech utterance based on the one or more second scoring signals; and
  
  e. based on the signal reflecting the comparison of said likelihoods, modifying one or more of said current speech recognition models to increase the probability that a string model for the utterance will have a rank order higher than the confusable string models.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1 wherein the step of generating the set of one or more confusable string models comprises generating N-best word string models.
  - 3. The method of claim 1 wherein the first scoring signal reflects a measure of similarity between the training speech utterance signal and the string model for that utterance.
  - 4. The method of claim 3 wherein the measure of similarity comprises a log-likelihood recognizer score.
  - 5. The method of claim 1 wherein the second scoring signal reflects a measure of similarity between the training speech utterance signal and one of said confusable string models.
  - 6. The method of claim 5 wherein the measure of similarity comprises a log-likelihood recognizer score.
  - 7. The method of claim 1 wherein the step of generating a signal reflecting a comparison comprises forming a difference of the first scoring signal and a combination of the one or more second scoring signals.
  - 8. The method of claim 1 wherein the step of modifying the one or more of said current speech recognition models comprises the steps of:
    - 1. generating a recognition model modification signal reflecting a gradient of a function, the function reflecting a recognizer score of a training speech utterance based on a string model for that utterance and one or more recognizer scores of the training speech utterance based on the one or more confusable string models; and
      
      2. adapting the one or more of said current speech recognition models based on the modification signal.
  - 9. The method of claim 8 wherein the function reflects a difference of a recognizer score of a training speech utterance based on a string model for that utterance and a weighted sum of the one or more recognizer scores of the training speech utterance based on one or more confusable string models.
  - 10. The method of claim 1 wherein one of said sets of current speech recognition models comprises acoustic models.
  - 11. The method of claim 10 wherein the acoustic models comprise hidden Markov models.
  - 12. The method of claim 1 wherein one of said sets of current speech recognition models comprises language models.
  - 13. The method of claim 1 wherein one of said sets of current speech recognition models comprises pitch models.
  - 14. The method of claim 1 wherein one of said sets of current speech recognition models comprises energy models.
  - 15. The method of claim 1 wherein one of said sets of current speech recognition models comprises speaking rate models.
  - 16. The method of claim 1 wherein one of said sets of current speech recognition models comprises duration models.

17. A speech recognizer trainer for making a speech recognition model database based on a training speech utterance signal and a plurality of sets of current speech recognition models, the trainer comprisinga. means for generating a set of one or more confusable string models, a confusable string model comprising a plurality of said current speech recognition models from two or more of said sets of said current speech recognition models, wherein each of at least two of said sets of models corresponds to different characteristics of speech and wherein the Confusable string model is a model which, if selected to represent the training speech utterance, would result in a misrecognition of the training speech utterance;
- b. means for generating a first scoring signal based on the speech utterance signal and a string model for that utterance, the string model for the utterance comprising a plurality of said current speech recognition models from said two or more of said sets of said current speech recognition models;
  
  c. means for generating one or more second scoring signals, a second scoring signal based on the training speech utterance signal and a confusable string model;
  
  d. means for generating a signal reflecting a comparison of a likelihood of correctly recognizing the training speech utterance and a likelihood of incorrectly recognizing the training speech utterance, said likelihood of correctly recognizing the training speech utterance based on the first scoring signal and said likelihood of incorrectly recognizing the training speech utterance based on the one or more second scoring signals; and
  
  e. means, responsive to the signal reflecting the comparison of said likelihoods, for modifying one or more of said current speech recognition models to increase the probability that a string model for the utterance will have a rank order higher than the confusable string models.
- View Dependent Claims (18, 19)
- - 18. The trainer of claim 17 wherein the means for generating a signal reflecting a comparison comprises means for forming a difference between the first scoring signal and a combination of the one or more second scoring signals.
  - 19. The trainer of claim 17 wherein the means for modifying the one or more of said current speech recognition models comprises:
    - 1. means for generating a recognition model modification signal reflecting a gradient of a function, the function reflecting a recognizer score of a training speech utterance based on a string model for that utterance and one or more recognizer scores of the training speech utterance based on the one or more confusable string models; and
      
      2. means for adapting the one or more of said current speech recognition models based on the modification signal.

20. A speech recognition system comprising:
- a. a feature extractor for receiving an unknown speech signal and generating feature signals characterizing the unknown speech signal;
  
  b. a memory for storing a plurality of sets of speech recognition models, one or more of the speech recognition models generated in accordance with a process ofmodifying parameters of predetermined speech recognition models to enhance the probability that a correct string model will have a rank order higher than one or more confusable string models,wherein each of at least two of said sets of speech recognition models corresponds to different characteristics of speech and wherein said confusable string models are models which, if selected to represent a training speech utterance, would result in a misrecognition of the training speech utterance,and wherein the modification of said parameters is based on a comparison of a likelihood of correctly recognizing a training speech utterance and a likelihood of incorrectly recognizing the training speech utterance, the likelihood of correctly recognizing the training speech utterance based on a first scoring signal and the likelihood of incorrectly recognizing the training speech utterance based on one or more second scoring signals,the first scoring signal generated based on the training speech utterance and a string model for that utterance, the string model for the utterance comprising a plurality of said speech recognition models from said at least two of said sets of said speech recognition models, andeach of the second scoring signals generated based on the training speech utterance signal and one of said confusable string models; and
  
  c. a scoring processor, coupled to the feature extractor and the memory, for comparing a string model with features of the unknown speech signal, the string model comprising one or more of said speech recognition models from each of the plurality of said sets of speech recognition models, and for recognizing the unknown speech signal based on one of a plurality of string models which most closely compares with the features of the unknown speech signal.

21. A method of making a speech recognition model database based on a training speech utterance signal,two or more sets of current speech recognition models, each of which sets representing a different characteristic of speech, andat least one confusable string model, said confusable string model comprising a plurality of said current speech recognition models from each of said sets of models and being a model which, if selected to represent the training speech utterance signal, would result in a misrecognition of said signal, the method comprising:
- a. generating a first scoring signal based on the training speech utterance signal and a string model for that utterance, the string model for the utterance comprising a plurality of said current speech recognition models from said two or more of said sets of current speech recognition models;
  
  b. generating one or more second scoring signals, a second scoring signal based on the training speech utterance signal and a confusable string model;
  
  c. generating a signal reflecting a comparison of a likelihood of correctly recognizing the training speech utterance and a likelihood of incorrectly recognizing the training speech utterance, said likelihood of correctly recognizing the training speech utterance based on the first scoring signal and said likelihood of incorrectly recognizing the training speech utterance based on the one or more second scoring signals; and
  
  d. based on the signal reflecting the comparison of said likelihoods, modifying one or more of said current speech recognition models to increase the probability that a string model for that utterance will have a rank order higher than the confusable string models.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Chou, Wu, Juang, Biing-Hwang, Lee, Chin-Hui
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
CHOWDHURY, INDRINAL

Application Number

US08/638,408
Time in Patent Office

305 Days
Field of Search

395/2.52, 395/2.53, 395/2.54, 395/2.55, 395/2.64, 395/2.65, 395/2.66
US Class Current

704/243
CPC Class Codes

G10L 15/063 Training

Minimum error rate training of combined string models

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

84 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Minimum error rate training of combined string models

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

84 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links