Method for speech recognition

US 4,805,219 A
Filed: 04/03/1987
Issued: 02/14/1989
Est. Priority Date: 04/03/1987
Status: Expired due to Fees

First Claim

Patent Images

1. A method of determining the probability that a given portion of speech to be recognized corresponds to a speech pattern, representing a common sound sequence occurring in one or more words, the method comprising:

time aligning a series of acoustic descriptions representing the speech to be recognized against a time-aligning model comprised of a series of acoustic sub-models;

deriving a time-aligned speech model having a series of acoustic sub-models, each of which is derived from the acoustic speech descriptions time aligned against a corresponding sub-model of the time-aligning model;

providing time-aligned acoustic models of each of a first class of speech patterns, each of which time-aligned pattern models is derived by;

time aligning a series of acoustic descriptions from one or more utterances of that speech pattern against the time-aligning model, andderiving, for that pattern model, a series of acoustic sub-models, each of which is derived from the acoustic descriptions from those one or more utterances time aligned against a corresponding sub-model of the time-aligning model;

comparing the time-aligned speech model against each of a plurality of the time-aligned pattern models so as to produce a score for each such comparison as a function of how closely each sub-model of the speech model compares to its corresponding sub-model of a given pattern model;

selecting which speech patterns warrant a more computationally intensive comparison against the speech to be recognized in response to the scores produced for the comparisons between the speech model and the pattern models; and

performing that more computationally intensive comparison for the selected speech patterns in order to determine which of the selected speech patterns most probably corresponds to said speech to be recognized.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method determines if a portion of speech corresponds to a speech pattern by time aligning both the speech and a plurality of speech pattern models against a common time-aligning model. This compensates for speech variation between the speech and the pattern models. The method then compares the resulting time-aligned speech model against the resulting time-aligned pattern models to determine which of the patterns most probably corresponds to the speech. Preferably there are a plurality of time-aligning models, each representing a group of somewhat similar sound sequences which occur in different words. Each of these time-aligning models is scored for similarity against a portion of speech, and the time-aligned speech model and time-aligned pattern models produced by time alignment with the best scoring time-aligning model are compared to determine the likelihood that each speech pattern corresponds to the portion of speech. This is performed for each successive portion of speech. When a portion of speech appears to correspond to a given speech pattern model, a range of likely start times is calculated for the vocabulary word associated with that model, and a word score is calculated to indicate the likelihood of that word starting in that range. The method uses a more computationally intensive comparison between the speech and selected vocabulary words, so as to more accurately determine which words correspond with which portions of the speech. When this more intensive comparison indicates the ending of a word at a given point in the speech, the method selects the best scoring vocabulary words whose range of start times overlaps that ending time, and performs the computationally intensive comparison on those selected words starting at that point in the speech.

85 Citations

View as Search Results

20 Claims

1. A method of determining the probability that a given portion of speech to be recognized corresponds to a speech pattern, representing a common sound sequence occurring in one or more words, the method comprising:
- time aligning a series of acoustic descriptions representing the speech to be recognized against a time-aligning model comprised of a series of acoustic sub-models;
  
  deriving a time-aligned speech model having a series of acoustic sub-models, each of which is derived from the acoustic speech descriptions time aligned against a corresponding sub-model of the time-aligning model;
  
  providing time-aligned acoustic models of each of a first class of speech patterns, each of which time-aligned pattern models is derived by;
  
  time aligning a series of acoustic descriptions from one or more utterances of that speech pattern against the time-aligning model, andderiving, for that pattern model, a series of acoustic sub-models, each of which is derived from the acoustic descriptions from those one or more utterances time aligned against a corresponding sub-model of the time-aligning model;
  
  comparing the time-aligned speech model against each of a plurality of the time-aligned pattern models so as to produce a score for each such comparison as a function of how closely each sub-model of the speech model compares to its corresponding sub-model of a given pattern model;
  
  selecting which speech patterns warrant a more computationally intensive comparison against the speech to be recognized in response to the scores produced for the comparisons between the speech model and the pattern models; and
  
  performing that more computationally intensive comparison for the selected speech patterns in order to determine which of the selected speech patterns most probably corresponds to said speech to be recognized.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. A method as described in claim 1, which further includes:
    - selecting which speech patterns warrant a computationally intensive comparison against the speech in response to the scores produced for the comparisons between the speech model and the pattern models; and
      
      performing that more computationally intensive comparison for the selected speech patterns.
  - 3. A method as described in claim 2, wherein:
    - the first class of speech patterns, those for which time-aligned pattern models are provided, represent sound sequences contained in individual vocabulary words which are shorter in length than most of the vocabulary words of which they are part; and
      
      the selected speech patterns for which the more computationally intensive comparison is performed represent entire vocabulary words.
  - 4. A method as described in claim 1, whereinthere are a plurality of time-aligning models, each with a series of sub-models;
    - the time aligning of the acoustic speech descriptions is performed against each of a plurality of the time-aligning models; and
      
      the providing of time-aligned pattern models includes providing, for a given speech pattern, a plurality of time-aligned pattern models which are derived by time aligning acoustic descriptions of one or more utterances of that speech pattern against each of a plurality of time-aligning models.
  - 5. A method as described in claim 4, further including providing time-aligning models which have been prepared by the following process:
    - using dynamic programming to divide the series of acoustic descriptions of one or more utterances of each of a plurality of speech sequences into series of corresponding segments;
      
      deriving a model of each such speech sequence, which speech sequence pattern model has a series of sub-models, each of which is derived from the acoustic descriptions included in a group of corresponding segments produced by the dynamic programming from the one or more utterances of that speech sequence;
      
      dividing the speech sequence models into clusters of similar models;
      
      deriving an acoustic cluster model for each such cluster, which model has a series of sub-models corresponding to the series of sub-models of the speech sequence models placed within its corresponding cluster; and
      
      using the resulting cluster models as time-aligning models.
  - 6. A method as described in claim 5, wherein:
    - the using of dynamic programming in the preparation of time-aligning models includes using dynamic programming to segment the acoustic descriptions of one or more utterances of each of a plurality of vocabulary word into series of corresponding segments;
      
      each speech sequence model is derived from N sequential groups of corresponding segments produced by the dynamic programming of a given vocabulary word, and each such speech sequence model has N sub-models, each derived from the acoustic descriptions associated with one of those N groups of sequential segments;
      
      the cluster models each have N sub-models, each of which is derived from a corresponding one of the N sub-models of each of the speech sequence models grouped into the cluster model'"'"'s associated cluster; and
      
      the time-aligned pattern models are produced by time aligning portions of individual vocabulary words against individual time-aligning models to produce time-aligned pattern models each having N sub-models.
  - 7. A method as described in claim 6, some vocabulary words have different time-aligned pattern models associated with their successive parts.
  - 8. A method as described in claim 6, wherein:
    - each of the acoustic descriptions in the series of acoustic descriptions of the speech to be recognized and of the utterance of the speech patterns contains M parameter values;
      
      each of the N sub-models of each time-aligning models has an M dimensional probability distribution, each dimension of which corresponds to one of the parameter values of the acoustic descriptions;
      
      each of the time-aligned pattern models comprises an N x M dimensional probability distribution, each dimension of which is derived from the values for a given one of the M parameters of the acoustic descriptions from one or more utterances of the model'"'"'s speech pattern which have been time aligned against a given one of the N sub-models of a time-aligning model;
      
      each of the time-aligned speech models comprises a corresponding N x M dimensional vector of parameter values, each dimension of which represents the values of a given one of the M parameters over that portion of the speech descriptions time aligned against a given one of the N sub-models of a time-aligning model; and
      
      the score produced for the comparison of a time-aligned speech model and a time-aligned pattern model which have been time aligned against the same time-aligning model is a likelihood score representing the probability of the parameter vectors of the speech model being generated by the probability distribution of the pattern model.
  - 9. A method as described in claim 4, wherein:
    - the time aligning of the speech descriptions against each of a plurality of time-aligning models includes;
      
      deriving a score indicating the closeness of that speech to each such time-aligning models; and
      
      selecting the time-aligning model which has the best, or closest, score against the acoustic description of speech; and
      
      the comparing of the time-aligned speech model against a plurality of time-aligned patterns models is performed for the speech model and the pattern models produced by time alignment against the same best scoring time-alignment model.
  - 10. A method as described in claim 9, wherein:
    - the time aligning of the speech descriptions against each of a plurality of time-aligning models includes using dynamic programming to;
      
      compare each of those time-aligning models against the entire length of a series of acoustic descriptions representing a segment of speech longer than that associated with individual speech patterns of the first class;
      
      determine when a given time-aligning model matches a given portion of this series of speech descriptions with at least a given degree of closeness;
      
      produce a time-aligned speech model for the matching portion of speech against the matching time-aligning model; and
      
      produce a score indicating the closeness of the match between the matching portion of speech and the matching time-aligning model;
      
      the selecting of the time-aligning model which has the best score is done repeatedly so as to produce a series of such best scoring time-aligning models each associated with successive portions of the series of speech description; and
      
      the comparing of the time-aligned speech model against the time-aligned pattern models is performed separately in association each of the series of best scoring time-aligning model.
  - 11. A method as described in claim 10, wherein:
    - individual speech patterns of the first class represent speech sounds associated with individual vocabulary words;
      
      the method further includes calculating a word score for a given vocabulary word at a given location in the series of speech descriptions, which calculation includes;
      
      basing the calculation on one or more of the series of best scoring time-aligning models;
      
      for each such best scoring model, taking the speech model produced by time alignment with it and a pattern model associated with the vocabulary word produced by time alignment with it, and calculating a partial score based on the closeness of the match between that speech model and that pattern model; and
      
      combining the resulting partial scores calculated for each such best scoring model to produce the word score.
  - 12. A method as described in claim 11, wherein the number of such best scoring time-aligning models for which partial scores are derived and used to calculate the word score for a given word is determined by the length of speech, in previous utterances of the word, between the beginning of the given word and sounds corresponding the last of the best scoring time-aligning models used to calculate the word score.
  - 13. A method as described in claim 11, wherein the word scores produced for various locations along the series of speech descriptions are used to select which words warrant a more computationally intensive comparison against those locations.

14. A method of recognizing continuous speech comprising:
- providing a plurality of acoustic cluster models, each of which includes a series of acoustic sub-models, said cluster models being derived byusing dynamic programming to divide one or more utterances of each of a plurality of vocabulary words into series of corresponding segments;
  
  deriving a series of sub-models for each such word, with each sub-model representing a group of corresponding segments from the one or more utterances of that word;
  
  dividing the series of sub-models from the different words into clusters of relatively similar series; and
  
  calculating a model for each such cluster which reflects the series of sub-models which have been grouped into that cluster;
  
  performing a comparison between a portion of the speech to be recognized and each of a plurality of the cluster models,selecting the one or more cluster models against which the portion of speech compares most closely; and
  
  performing further comparison between that portion of speech and the words whose series of sub-models have been associated with the one or more selected cluster models against which that portion of speech compares most closely.

15. A method of recognizing continuous speech comprising:
- scanning a temporal acoustic representation of the speech for the occurrence of acoustic patterns, each of which occurs in one or more individual vocabulary words;
  
  when such patterns are detected in the speech representation, calculating a range of start times in the representation at which it is most likely each of the one or more vocabulary words associated with that pattern started;
  
  performing a computationally intensive comparison between a portion of the speech representation and each of a plurality of vocabulary words, including determining when there is a probability better than a given threshold that a match of a given vocabulary word against the speech representation has terminated at a given ending time in that representation, wherein said computationally intensive comparison between a given vocabulary word and a given portion of speech is more computationally intense than the scanning of that portion of speech for the occurrence of a given acoustic pattern; and
  
  using those vocabulary words which have a range of start times which overlaps the given ending time as words against which to perform said intensive comparisons, starting approximately at the given ending time.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. A method as described in claim 15, wherein:
    - the performing of said computationally intensive comparisons includes performing dynamic programming between the speech representation and each of the plurality of vocabulary words; and
      
      the using of those vocabulary words which have an overlapping range of start times includes seeding the dynamic programming of all those vocabulary words approximately at the given ending time.
  - 17. A method as described in claim 16, wherein:
    - the dynamic programming includes generating endscores at successive times in the speech representation, each of which indicates the probability that the match of a given vocabulary word against the speech representation ends at that endscore'"'"'s time; and
      
      the determining of when there is a probability better than a given threshold that the intensive comparison between the speech representation and a vocabulary word has ended is performed by determining when the endscore for the given word is better than a given threshold.
  - 18. A method as in claim 17, wherein:
    - the initial score of the dynamic programming of the vocabulary words having overlapping range of start times is calculated as a function of a combination of the endscores at the given ending time.
  - 19. A method as in claim 15, wherein:
    - the calculating of a range of start times includes calculating back from the time in the speech representation at which the pattern is detected by an amount derived by observations of the length of time, in previous utterances of the word, between the occurrence of that pattern and the beginning of the word.
  - 20. A method as described in claim 19, wherein:
    - the calculating of a range of start times further includes calculating the width of that range as a function of the amount of variation observed, in previous utterances of the word, between the occurrence of that pattern and the beginning of the word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Dragon Systems, Inc. (Microsoft Corporation)
Original Assignee
Dragon Systems, Inc. (Microsoft Corporation)
Inventors
Gillick, Laurence, Baker, James K.
Primary Examiner(s)
Roskoski, Bernard

Application Number

US07/035,628
Time in Patent Office

683 Days
Field of Search

381/41, 381/43, 381/45, 364/513.5
US Class Current

704/241
CPC Class Codes

G10L 15/00 Speech recognition G10L17/0...

Method for speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

85 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Method for speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

85 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others