Context-dependent speech recognizer using estimated next word context

US 5,233,681 A
Filed: 04/24/1992
Issued: 08/03/1993
Est. Priority Date: 04/24/1992
Status: Expired due to Fees

First Claim

Patent Images

1. A speech recognition apparatus comprising:

means for generating a set of two or more speech hypotheses, each speech hypothesis comprising a partial hypothesis of zero or more words followed by a candidate word selected from a vocabulary of candidate words;

means for storing a set of word models, each word model representing one or more possible coded representations of an utterance of a word;

means for generating an initial model of each speech hypothesis, each initial model comprising a model of the partial hypothesis followed by a model of the candidate word;

an acoustic processor for generating a sequence of coded representations of an utterance to be recognized;

means for generating an initial hypothesis score for each speech hypothesis, each initial hypothesis score comprising an estimate of the closeness of a match between the initial model of the speech hypothesis and the sequence of coded representations of the utterance;

means for storing an initial subset of one or more speech hypotheses, from the set of speech hypotheses, having the best initial hypothesis scores;

next context estimating means for estimating, for each speech hypothesis in the initial subset, a likely word, from the vocabulary of words, which is likely to follow the speech hypothesis;

means for generating a revised model of each speech hypothesis in the initial subset, each revised model comprising a model of the partial hypothesis followed by a revised model of the candidate word, the revised candidate word model being dependent at least on the word which is estimated to be likely to follow the speech hypothesis;

means for generating a revised hypothesis score for each speech hypothesis in the initial subset, each revised hypothesis score comprising an estimate of the closeness of a match between the revised model of the speech hypothesis and the sequence of coded representations of the utterance;

means for storing a reduced subset of one or more speech hypotheses, from the initial subset of speech hypotheses, having the best revised match scores; and

means for outputting at least one word of one or more of the speech hypotheses in the reduced subset.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition apparatus and method estimates the next word context for each current candidate word in a speech hypothesis. An initial model of each speech hypothesis comprises a model of a partial hypothesis of zero or more words followed by a model of a candidate word. An initial hypothesis score for each speech hypothesis comprises an estimate of the closeness of a match between the initial model of the speech hypothesis and a sequence of coded representations of the utterance. The speech hypotheses having the best initial hypothesis scores form an initial subset. For each speech hypothesis in the initial subset, the word which is most likely to follow the speech hypothesis is estimated. A revised model of each speech hypothesis in the initial subset comprises a model of the partial hypothesis followed by a revised model of the candidate word. The revised candidate word model is dependent at least on the word which is estimated to be most likely to follow the speech hypothesis. A revised hypothesis score for each speech hypothesis in the initial subset comprises an estimate of the closeness of a match between the revised model of the speech hypothesis and the sequence of coded representations of the utterance. The speech hypotheses from the initial subset which have the best revised match scores are stored as a reduced subset. At least one word of one or more of the speech hypotheses in the reduced subset is output as a speech recognition result.

Citations

31 Claims

1. A speech recognition apparatus comprising:
- means for generating a set of two or more speech hypotheses, each speech hypothesis comprising a partial hypothesis of zero or more words followed by a candidate word selected from a vocabulary of candidate words;
  
  means for storing a set of word models, each word model representing one or more possible coded representations of an utterance of a word;
  
  means for generating an initial model of each speech hypothesis, each initial model comprising a model of the partial hypothesis followed by a model of the candidate word;
  
  an acoustic processor for generating a sequence of coded representations of an utterance to be recognized;
  
  means for generating an initial hypothesis score for each speech hypothesis, each initial hypothesis score comprising an estimate of the closeness of a match between the initial model of the speech hypothesis and the sequence of coded representations of the utterance;
  
  means for storing an initial subset of one or more speech hypotheses, from the set of speech hypotheses, having the best initial hypothesis scores;
  
  next context estimating means for estimating, for each speech hypothesis in the initial subset, a likely word, from the vocabulary of words, which is likely to follow the speech hypothesis;
  
  means for generating a revised model of each speech hypothesis in the initial subset, each revised model comprising a model of the partial hypothesis followed by a revised model of the candidate word, the revised candidate word model being dependent at least on the word which is estimated to be likely to follow the speech hypothesis;
  
  means for generating a revised hypothesis score for each speech hypothesis in the initial subset, each revised hypothesis score comprising an estimate of the closeness of a match between the revised model of the speech hypothesis and the sequence of coded representations of the utterance;
  
  means for storing a reduced subset of one or more speech hypotheses, from the initial subset of speech hypotheses, having the best revised match scores; and
  
  means for outputting at least one word of one or more of the speech hypotheses in the reduced subset.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. A speech recognition apparatus as claimed in claim 1, characterized in that the revised model of each speech hypothesis in the initial subset does not include a model of the word which is estimated to be likely to follow the speech hypothesis.
  - 3. A speech recognition apparatus as claimed in claim 2, characterized in that the acoustic processor comprises:
    - means for measuring the value of at least one feature of an utterance over each of a series of successive time intervals to produce a series of feature vector signals representing the feature values;
      
      means for storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value and having a unique identification value;
      
      means for comparing the closeness of the feature value of a first feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for the first feature vector signal and each prototype vector signal;
      
      ranking means for associating a first-rank score with the prototype vector signal having the best prototype match score, and for associating a second-rank score with the prototype vector signal having the second best prototype match score; and
      
      means for outputting at least the identification value and the rank score of the first-ranked prototype vector signal, and the identification value and the rank score of the second-ranked prototype vector signal, as a coded utterance representation signal of the first feature vector signal.
  - 4. A speech recognition apparatus as claimed in claim 3, characterized in that the partial hypothesis comprises a series of words, and the partial hypothesis model comprises a series of word models, each word model representing a corresponding word in the partial hypothesis.
  - 5. A speech recognition apparatus as claimed in claim 4, characterized in that each hypothesis score comprises an estimate of the probability of occurrence of each word in the hypothesis.
  - 6. A speech recognition apparatus as claimed in claim 5, characterized in that the next context estimating means further comprises means for generating a next context score for each next context candidate word in the vocabulary of candidate words, each next context score comprising an estimate of the closeness of a match between a model of the next context candidate word and a portion of the sequence of coded representations of the utterance.
  - 7. A speech recognition apparatus as claimed in claim 5, characterized in that the next context estimating means further comprises:
    - means for identifying, for each speech hypothesis, a first portion of the sequence of coded representations of the utterance which is most likely to correspond to the speech hypothesis, and a second portion of the sequence of coded representations of the utterance which follows the first portion; and
      
      means for generating a next context score for each next context candidate word in the vocabulary of candidate words, each next context score comprising an estimate of the closeness of a match between a model of the next context candidate word and the second portion of the sequence of coded representations of the utterance.
  - 8. A speech recognition apparatus as claimed in claim 5, characterized in that the next context estimating means estimates the probability of occurrence of the next context candidate word.
  - 9. A speech recognition apparatus as claimed in claim 8, characterized in that the next context estimating means estimates the conditional probability of occurrence of the next context candidate word given the occurrence of at least one word in the speech hypothesis.
  - 10. A speech recognition apparatus as claimed in claim 8, characterized in that the next context estimating means estimates the probability of occurrence of the next context candidate word independent of the speech hypothesis.
  - 11. A speech recognition apparatus as claimed in claim 5, characterized in that the next context estimating means estimates, for each speech hypothesis in the initial subset, the most likely word, from the vocabulary of words, which is most likely to follow the speech hypothesis.
  - 12. A speech recognition apparatus as claimed in claim 5, characterized in that the means for storing hypotheses, and the means for storing word models comprise electronic read/write memory.
  - 13. A speech recognition apparatus as claimed in claim 5, characterized in that the measuring means comprises a microphone.
  - 14. A speech recognition apparatus as claimed in claim 5, characterized in that the word output means comprises a video display.
  - 15. A speech recognition apparatus as claimed in claim 14, characterized in that the video display comprises a cathode ray tube.
  - 16. A speech recognition apparatus as claimed in claim 14, characterized in that the video display comprises a liquid crystal display.
  - 17. A speech recognition apparatus as claimed in claim 14, characterized in that the video display comprises a printer.
  - 18. A speech recognition apparatus as claimed in claim 5, characterized in that the word output means comprises an audio generator.
  - 19. A speech recognition apparatus as claimed in claim 18, characterized in that the audio generator comprises a loudspeaker.
  - 20. A speech recognition apparatus as claimed in claim 18, characterized in that the audio generator comprises a headphone.

21. A speech recognition method comprising:
- generating a set of two or more speech hypotheses, each speech hypothesis comprising a partial hypothesis of zero or more words followed by a candidate word selected from a vocabulary of candidate words;
  
  storing a set of word models, each word model representing one or more possible coded representations of an utterance of a word;
  
  generating an initial model of each speech hypothesis, each initial model comprising a model of the partial hypothesis followed by a model of the candidate word;
  
  generating a sequence of coded representations of an utterance to be recognized;
  
  generating an initial hypothesis score for each speech hypothesis, each initial hypothesis score comprising an estimate of the closeness of a match between the initial model of the speech hypothesis and the sequence of coded representations of the utterance;
  
  storing an initial subset of one or more speech hypotheses, from the set of speech hypotheses, having the best initial hypothesis scores;
  
  estimating, for each speech hypothesis in the initial subset, a likely word, from the vocabulary of words, which is likely to follow the speech hypothesis;
  
  generating a revised model of each speech hypothesis in the initial subset, each revised model comprising a model of the partial hypothesis followed by a revised model of the candidate word, the revised candidate word model being dependent at least on the word which is estimated to be likely to follow the speech hypothesis;
  
  generating a revised hypothesis score for each speech hypothesis in the initial subset, each revised hypothesis score comprising an estimate of the closeness of a match between the revised model of the speech hypothesis and the sequence of coded representations of the utterance;
  
  storing a reduced subset of one or more speech hypotheses, from the initial subset of speech hypotheses, having the best revised match scores; and
  
  outputting at least one word of one or more of the speech hypotheses in the reduced subset.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 22. A speech recognition method as claimed in claim 21, characterized in that the revised model of each speech hypothesis in the initial subset does not include a model of the word which is estimated to be likely to follow the speech hypothesis.
  - 23. A speech recognition method as claimed in claim 22, characterized in that the step of generating a sequence of coded representations of an utterance comprises:
    - measuring the value of at least one feature of an utterance over each of a series of successive time intervals to produce a series of feature vector signals representing the feature values;
      
      storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value and having a unique identification value;
      
      comparing the closeness of the feature value of a first feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for the first feature vector signal and each prototype vector signal;
      
      associating a first-rank score with the prototype vector signal having the best prototype match score, and for associating a second-rank score with the prototype vector signal having the second best prototype match score; and
      
      outputting at least the identification value and the rank score of the first-ranked prototype vector signal, and the identification value and the rank score of the second-ranked prototype vector signal, as a coded utterance representation signal of the first feature vector signal.
  - 24. A speech recognition method as claimed in claim 23, characterized in that the partial hypothesis comprises a series of words, and the partial hypothesis model comprises a series of word models, each word model representing a corresponding word in the partial hypothesis.
  - 25. A speech recognition method as claimed in claim 24, characterized in that each hypothesis score comprises an estimate of the probability of occurrence of each word in the hypothesis.
  - 26. A speech recognition method as claimed in claim 25, characterized in that the step of estimating the word which is likely to follow the speech hypothesis comprises generating a next context score for each next context candidate word in the vocabulary of candidate words, each next context score comprising an estimate of the closeness of a match between a model of the next context candidate word and a portion of the sequence of coded representations of the utterance.
  - 27. A speech recognition method as claimed in claim 25, characterized in that the step of estimating the word which is likely to follow the speech hypothesis comprises:
    - identifying, for each speech hypothesis, a first portion of the sequence of coded representations of the utterance which is most likely to correspond to the speech hypothesis, and a second portion of the sequence of coded representations of the utterance which follows the first portion; and
      
      generating a next context score for each next context candidate word in the vocabulary of candidate words, each next context score comprising an estimate of the closeness of a match between a model of the next context candidate word and the second portion of the sequence of coded representations of the utterance.
  - 28. A speech recognition method as claimed in claim 25, characterized in that the step of estimating the word which is likely to follow the speech hypothesis comprises estimating the probability of occurrence of the next context candidate word.
  - 29. A speech recognition apparatus as claimed in claim 28, characterized in that the step of estimating the word which is likely to follow the speech hypothesis comprises estimating the conditional probability of occurrence of the next context candidate word given the occurrence of at least one word in the speech hypothesis.
  - 30. A speech recognition apparatus as claimed in claim 28, characterized in that the step of estimating the word which is likely to follow the speech hypothesis comprises estimating the probability of occurrence of the next context candidate word independent of the speech hypothesis.
  - 31. A speech recognition apparatus as claimed in claim 25, characterized in that the step of estimating the word which is likely to follow the speech hypothesis comprises estimating the most likely word, from the vocabulary of words, which is most likely to follow the speech hypothesis.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Picheny, Michael A., Gopalakrishnan, Ponani S., Bahl, Lalit R., De Souza, Peter V.
Primary Examiner(s)
Fleming, Michael R.
Assistant Examiner(s)
HAFIZ, TARIQ R

Application Number

US07/874,271
Time in Patent Office

466 Days
Field of Search

381/41, 381/43, 381/51, 395/2
US Class Current

704/251
CPC Class Codes

G10L 15/19 Grammatical context, e.g. d...

G10L 15/193 Formal grammars, e.g. finit...

Context-dependent speech recognizer using estimated next word context

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Context-dependent speech recognizer using estimated next word context

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links