Method for representing word models for use in speech recognition

US 4,903,305 A
Filed: 03/23/1989
Issued: 02/20/1990
Est. Priority Date: 05/12/1986
Status: Expired due to Fees

First Claim

Patent Images

1. A method of deriving an acoustic word representation for use in speech recognition systems, said method comprising:

creating a word model for each of a plurality of words, each word model having a temporal sequence of acoustic models derived from one or more utterances of its associated word;

clustering the individual acoustic models from each of the plurality of word models, so as to place individual models into clusters of relatively similar models;

providing a cluster ID for each such cluster; and

creating a cluster spelling for a given word, said cluster spelling including a collection of cluster IDs indicating the clusters into which the sequence of acoustic models of said given word'"'"'s word model have been placed by said clustering.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method is provided for deriving acoustic word representations for use in speech recognition. Initial word models are created, each formed of a sequence of acoustic sub-models. The acoustic sub-models from a plurality of word models are clustered, so as to group acoustically similar sub-models from different words, using, for example, the Kullback-Leibler information as a metric of similarity. Then each word is represented by cluster spelling representing the clusters into which its acoustic sub-models were placed by the clustering. Speech recognition is performed by comparing sequences of frames from speech to be recognized against sequences of acoustic models associated with the clusters of the cluster spelling of individual word models. The invention also provides a method for deriving a word representation which involves receiving a first set of frame sequences for a word, using dynamic programming to derive a corresponding initial sequence of probabilistic acoustic sub-models for the word independently of any previously derived acoustic model particular to the word, using dynamic programming to time align each of a second set of frame sequences for the word into a succession of new sub-sequences corresponding to the initial sequence of models, and using these new sub-sequences to calculate new probabilistic sub-models.

364 Citations

34 Claims

1. A method of deriving an acoustic word representation for use in speech recognition systems, said method comprising:
- creating a word model for each of a plurality of words, each word model having a temporal sequence of acoustic models derived from one or more utterances of its associated word;
  
  clustering the individual acoustic models from each of the plurality of word models, so as to place individual models into clusters of relatively similar models;
  
  providing a cluster ID for each such cluster; and
  
  creating a cluster spelling for a given word, said cluster spelling including a collection of cluster IDs indicating the clusters into which the sequence of acoustic models of said given word'"'"'s word model have been placed by said clustering.

2. A method of deriving an acoustic word representation for use in speech recognition systems, said method comprising:
- receiving a one or more sequences of acoustic frames for each of a plurality of words, each of said frames having a corresponding set of n parameter values;
  
  using dynamic programming to derive from said one or more frame sequences associated with each such word, a corresponding sequence of dynamic programming elements (hereinafter referred to as dp elements in this and depending claims), said dynamic programming including;
  
  creating a sequence of dp elements for each word, each having an n-dimensional probability distribution;
  
  using one or more iterations of dynamic programming to seek a relatively optimal match between the successive probability distributions of the sequence of dp elements for a given word and the successive parameter values of the one or more frame sequences associated with that word, so as to divide each of the one or more frame sequences associated with a given word into a plurality of sub-sequences each associated with one of said dp elements, each of said iterations involving calculating a new n-dimensional probability distribution for individual dp elements, each dimension of a given dp element'"'"'s distribution being calculated as a function of corresponding parameter values from frames matched with the given dp element by said iteration;
  
  clustering the dp elements produced by said dynamic programming for each of said plurality of words into a plurality of clusters, said clustering including placing individual dp elements into the cluster of such elements which has a probability distribution closest to that element'"'"'s own probability distribution, as determined by a certain statistical metric, and calculating an n-dimensional probability distribution for each cluster which is derived from the corresponding n-dimensional probability distribution of the dp elements placed within it; and
  
  creating a sequence of such clusters to represent a given word, with successive clusters of the sequence corresponding to successive dp elements in the sequence of such elements derived for the word by said dynamic programming, and with each such cluster being the cluster into which its corresponding dp element is placed by said clustering.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 3. A method of deriving an acoustic word representation as described in claim 2, wherein said receiving of a sequence of acoustic frames includes receiving a plurality of sequences of acoustic frames for a plurality of said words, with each of said frame sequences received for a given word corresponding to a different utterance of said word.
  - 4. A method of deriving an acoustic word representation as described in claim 3, wherein said receiving of a plurality of frame sequences includes receiving, for each word, frame sequences derived from each of m speakers, where m is an integer larger than one.
  - 5. A method of deriving an acoustic word representation as described in claim 2, wherein said clustering of dp elements includes clustering together dp elements derived from utterances of different words.
  - 6. A method of deriving an acoustic word representation as described in claim 2, wherein:
    - said calculating of a new n-dimensional probability distribution for dp elements includes calculating a measure of central tendency and a measure of spread for each of said n-dimensions of a given dp element'"'"'s probability distribution; and
      
      said calculating of an n-dimensional probability distribution for each cluster includes calculating a measure of central tendency and a measure of spread for each of the n-dimensions of said cluster probability distribution.
  - 7. A method of deriving an acoustic word representation as described in claim 6, wherein:
    - said certain statistical metric used in determining the closeness of a given dp element to a cluster, is derived from a formula of the following general form;
      
      space="preserve" listing-type="equation">E.sub.n [g{s.sub.n (x)}]
      where E_n is the expected value of the expression in square brackets which follows it over the distribution of frames x in the probability distribution of the node f_n ;
      
      where s_n (x) is a score derived form the likelihood that a given frame x would be generated by the probability distribution f_n of the node;
      
      wherein s_c (x) is a score derived from the likelihood that a given frame x would be generated by the probability distribution f_c of the cluster model;
      
      and where g{a,b} is a function of the disparity, or more loosely speaking, the difference, between s_n (x) and s_c (x).
  - 8. A method of deriving an acoustic word representation as described in claim 7, wherein:
    - said certain statistical metric used in determining the closeness of a given dp element to a cluster, is derived from the following formula;
      
      space="preserve" listing-type="equation">K(f.sub.n, f.sub.e)=E.sub.n [h(log{f.sub.n (x)/f.sub.e (x)})]
      where E_n is the expected value of the expression which follows it with respect to the given dp element distribution of values of x;
      
      where f_n is the probability distribution of the given dp element;
      
      where f_e is the probability distribution of the cluster; and
      
      where h(x)=x^E, where z is a positive integer.
  - 9. A method of deriving an acoustic word representation as described in claim 7, wherein:
    - said certain statistical metric used in determining the closeness of a given dp element to a cluster, is derived from the following formula;
      
      space="preserve" listing-type="equation">K(f.sub.n, f.sub.e)=E.sub.n [h(log{f.sub.n (x)/f.sub.e (x)}]
      where E_n is the expected value of the expression which follows it with respect to the given dp element distribution of values of x;
      
      where f_n is the probability distribution of the given dp element;
      
      where f_e is the probability distribution of the cluster; and
      
      where h(x)=|x|^z, where z is any positive number.
  - 10. A method of deriving an acoustic word representation as described in claim 2 further comprising:
    - additional dynamic programming to re-divide each frame sequence into a new sequence of dp elements, said additional dynamic programming including;
      
      using dynamic programming to divide each of said one or more frame sequences associated with a given word into a new plurality of sub-sequences by time aligning said frame sequence against the sequence of cluster probability distributions associated with that word as a result of the method described in claim 2; and
      
      calculating new n-dimensional cluster probability distributions based on the division of said frame sequences into said new sub-sequences.
  - 11. A method of deriving an acoustic word representation as described in claim 10 wherein said calculating new n-dimensional cluster probabilities includes calculating an n-dimensional probability distribution for each cluster from the values of each of the n parameters of the frames associated with the new sub-sequences which are time aligned against that cluster by said additional dynamic programming.
  - 12. A method of deriving an acoustic word representation as described in claim 11, wherein the combined process of said additional dynamic programming and said calculating of new n-dimensional cluster probability distributions is repeated more than once.
  - 13. A method of deriving an acoustic word representation as described in claims 2, wherein said placing of each dp element into a cluster includes performing multiple clustering passes:
    - the first pass of which includes comparing each dp element to each cluster formed so far in said first pass, and placing it in the cluster to which it is closest according to said metric, unless, according to said metric, it is further than a specified threshold distance from any such clusters, in which case it is made into a separate cluster of which it is initially the only member; and
      
      the subsequent passes of which are substantially identical to said first pass, except that if said clustering places a given dp element into a cluster other than the one it is already in, the probability distribution of the cluster from which the dp element has been withdrawn must be recalculated to reflect the withdrawal of that dp element.
  - 14. A method of deriving an acoustic word representation as described in claim 13, further including comparing a singleton cluster, that is, a cluster having only one dp element associated with it, if there is such a singleton cluster, with each other cluster, and combining it with the other cluster to which it is closest according to said metric, unless its distance from said closest other cluster, according to said metric, is more than a specified threshold.
  - 15. A method of deriving an acoustic word representation as described in claim 13, wherein said comparing of each dp element to a cluster includes, when the dp element is not included in the cluster, temporarily altering the cluster for the purpose of the comparison to have the probability distribution which it would have if the dp element were in it.
  - 16. A method of deriving an acoustic word representation as described in claim 2, wherein said clustering of dp elements includes:
    - clustering said dp elements into a relatively small number of first level clusters;
      
      clustering the dp elements belonging to each of said first level clusters into a number of sub-clusters by;
      
      placing individual dp element belonging to a given first level cluster into the sub-cluster of dp elements which has a probability distribution closest to that element'"'"'s own probability distribution, as determined by a certain statistical metric; and
      
      calculating an n-dimensional probability distribution for each sub-cluster which is derived from the corresponding n-dimensional probability distributions of the dp elements placed within it; and
      
      wherein said creating of a sequence of clusters to represent a given word includes creating a sequence of sub-clusters to represent the word, with successive sub-clusters of the sequence corresponding to successive dp elements in the sequence of such elements derived for the word by said dynamic programming, and with each such sub-cluster being the sub-cluster into which its corresponding dp element is placed by said clustering algorithm.
  - 17. A method of deriving an acoustic word representation as described in claim 16, wherein said clustering of said dp elements into said first level clusters includes placing substantially all the dp elements associated with a given phoneme in one first level cluster, so that said given first level cluster corresponds to a phoneme and so that all sub-clusters associated with that first level cluster correspond to sounds associated with that phoneme.
  - 18. A method of deriving an acoustic word representation as described in claim 17, wherein said clustering of said dp elements into said first level clusters is done with human intervention to assure that substantially all the dp elements associated with said given phoneme are placed in said one first level cluster.
  - 19. A method of deriving an acoustic word representation as described in claim 16, wherein said clustering of said dp elements into sub-clusters is performed automatically without human intervention.

20. A method of deriving an acoustic word representation for use in speech recognition systems, comprising:
- receiving a first set of sequences of acoustic frames generated by one or more utterances of a given word, each of said frames having a set of n parameter values;
  
  using dynamic programming, independently of any previously derived acoustic model particular to said given word, to automatically derive from said first set of frame sequences an initial acoustic model of said given word comprised of an initial sequence of acoustic probability distribution models, said dynamic programming including;
  
  dividing each of said first set of frame sequences into a corresponding plurality of sub-sequences of frames independently of any previously derived acoustic model particular to said given word;
  
  calculating a probability distribution model for each group of corresponding sub-sequences, which model includes an n-dimensional probability distribution, each dimension of which is calculated from one of the n corresponding parameter values of the frames occurring in its group of corresponding sub-sequences;
  
  using dynamic programming to time align each of said first set of frame sequences against said sequence of probability distribution models;
  
  dividing each of said first set of frame sequences into a new corresponding plurality of sub-sequences of frames based on said time alignment against said sequence of probability distribution models;
  
  calculating a new probability distribution model, of the type described above, for each group of corresponding sub-sequences;
  
  repeating one or more times the steps of using dynamic programming to time align, dividing each of said first sets of frame sequences into a new corresponding plurality of sub-sequences, and calculating new probability distribution models; and
  
  storing the sequence of probability distributions calculated by the last repetition of these three steps as said initial acoustic word model (hereinafter referred to as the initial sequence of probability distribution models in this and depending claims);
  
  using dynamic programming to time align each of a second set of frame sequences generated by one or more utterances of said given word against said initial sequence of probability distribution models, so as to divide each of said second set of frame sequences into a corresponding plurality of new sub-sequences, with each of said new sub-sequences being associated with one of said probability distribution models; and
  
  calculating a dynamic programming element (hereinafter referred to as a dp element in this and depending claims) for each group of corresponding new sub-sequences, which dp element includes an n-dimensional probability distribution, each dimension of which is calculated from one of the n corresponding parameter values of the frames of its associated group of corresponding new sub-sequences.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
- - 21. A method of deriving an acoustic word representation as described in claim 20, wherein:
    - said second set of frame sequences includes a plurality of frame sequences derived from the utterance of said given word by each of m speakers, where m is an integer larger than one; and
      
      said using of dynamic programming to time align is used to time align said frame sequences from each of said m speakers against said initial sequences of probability distribution models, so as to divide each of said frame sequences from each speaker into a corresponding plurality of new sub-sequences, with the corresponding new sub-sequences from different frame sequences being time aligned against the same probability distribution model in said initial sequence.
  - 22. A method of deriving an acoustic word representation as described in claim 21, wherein information from corresponding new sub-sequences from said m speakers are combined to derive multiple speaker word models.
  - 23. A method of deriving an acoustic word representation as describe in claim 21:
    - wherein the method described in claim 21 is performed for each of a plurality of words;
      
      wherein said calculating of a dp element includes calculating a separate dp element for each group of said corresponding new sub-sequences from the different frame sequences from said m speakers;
      
      further including grouping together all the dp elements from said m speakers which correspond to the same initial probability distribution model of the same word to form a parallel dp element having a probability distribution with n separate probability distribution for each of said m speakers; and
      
      further clustering said parallel dp elements into a plurality of multi-speaker clusters, said clustering including placing individual parallel dp elements into the cluster of such elements which has a probability distribution closest to the element'"'"'s own probability distribution, as determined by a certain statistical metric, and calculating an n times m dimensional probability distribution for each multi-speaker cluster which is derived from the corresponding n times m dimensional probability distribution of each of the parallel dp elements placed within it.
  - 24. A method of deriving an acoustic word representation as described in claim 23, further including calculating a multi-speaker n-dimensional probability distribution for each of said multi-speaker clusters, the value of each dimension of which is derived from the corresponding parameter value from the frames of sub-sequences of frames associated with parallel dp elements which have been placed in that multi-speaker cluster.
  - 25. A method of deriving an acoustic word representation as described in claim 24, further including making an acoustic representation for a given word by taking said initial sequence of probability distribution models for that word and replacing each of the models in said initial sequence with the multi-speaker n-dimensional probability distribution for the multi-speaker cluster in which the parallel dp element corresponding to that model was placed.
  - 26. A method of deriving an acoustic word representation as described in claim 20, whereinsaid receiving of a first set of frame sequences includes receiving a plurality of such sequences generated by a plurality of utterances of said given word;
    - said dividing of said first set of frame sequences into sub-sequences includes dividing each of said first set of frame sequences into a corresponding plurality of initial sub-sequences; and
      
      said calculating of a probability distribution model for each of said initial sub-sequences includes calculating one such model for each group of corresponding initial sub-sequences, each dimension of which is calculated from one of the corresponding parameter values of the frames occurring in said group of corresponding initial sub-sequences.
  - 27. A method of deriving an acoustic word representation as described in claim 26, wherein said receiving of a plurality of such frame sequences includes receiving on or more of such frame sequences for said given word from each of m speakers, where m is an integer larger than one.

28. A method of recognizing which word from among a plurality of words a given utterance corresponds to, said method comprising:
- receiving a sequence of acoustic frames generated by the utterance of a given word, each of said frames having a corresponding set of n parameter values;
  
  storing an alphabet of sound symbols, each of which has stored in association with it an n-dimensional probability distribution, with one dimension corresponding to each said n parameter values associated with said frames, wherein said alphabet of sound symbols is derived by clustering similar sounds in different words into a single sound symbol;
  
  storing an acoustic spelling for each of said plurality of words, each of which spellings represents a sequence of one or more of said sound symbols, with a plurality of said sound symbols being used in the spelling of more than one word; and
  
  comparing the parameter values of said frame sequence against the sequence of corresponding probability distributions associated with said acoustic spelling for a given word to determine if the frame sequence corresponds to said word.
- View Dependent Claims (29, 30, 31, 32, 33, 34)
- - 29. A speech recognition method as described in claim 28, further including training a speech recognition system to recognize words spoken by a given speaker, said training including:
    - receiving a sequence of training frames associated with a sequence of one or more training words having a known acoustic spelling;
      
      dividing said training frame sequence into a plurality of sub-sequences of frames, with each of said sub-sequences of frames being associated with one of said sound symbols in said training words; and
      
      calculating one of said n dimensional probability distributions for a given sound symbol from the parameter values of the frames associated with that given sound symbols by said dividing of said training frame sequence into sub-sequences, and associating that probability distribution with that sound symbol in said alphabet of sound symbols.
  - 30. A speech recognition method as described in claim 29, whereinsaid storing of an alphabet of sound symbols includes storing an initial n-dimensional probability distribution for each of a plurality of said sound symbols previous to said training by said given speaker;
    - andsaid dividing of said training frame sequence includes time aligning said training frame sequence against the sequence of initial probability distributions associated with the acoustic spellings for said sequence of training words.
  - 31. A speech recognition method as described in claim 30, wherein said time aligning of said training frame sequence against the sequence of corresponding initial probability distributions includes making said time alignment by means of dynamic programming.
  - 32. A speech recognition method as described in claim 29, wherein said sequence of one or more training words includes substantially fewer words than are contained in said plurality of words for which spellings are stored.
  - 33. A speech recognition method as described in claim 32, wherein the acoustic spelling of said sequence of one or more training words include all the sound symbols stored in said alphabet of sound symbols.
  - 34. A speech recognition method as described in claim 28, wherein each of said sound symbols has only one of said n-dimensional probability distributions stored in association with it.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Dragon Systems, Inc. (Microsoft Corporation)
Original Assignee
Dragon Systems, Inc. (Microsoft Corporation)
Inventors
Gillick, Laurence, Sturtevant, Dean, Baker, James K., Baker, Janet M., Roth, Robert S.
Primary Examiner(s)
Harkcom, Gary V.
Assistant Examiner(s)
Knepper, David D.

Application Number

US07/328,738
Time in Patent Office

334 Days
Field of Search

364/513.5, 364/513, 381/41-43
US Class Current

704/245
CPC Class Codes

G10L 15/063   Training

G10L 15/144   Training of HMMs

G10L 2015/0631   Creating reference template...

Method for representing word models for use in speech recognition

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

364 Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Method for representing word models for use in speech recognition

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

364 Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links