Method for representing word models for use in speech recognition
First Claim
1. A method of deriving an acoustic word representation for use in speech recognition systems, said method comprising:
- creating a word model for each of a plurality of words, each word model having a temporal sequence of acoustic models derived from one or more utterances of its associated word;
clustering the individual acoustic models from each of the plurality of word models, so as to place individual models into clusters of relatively similar models;
providing a cluster ID for each such cluster; and
creating a cluster spelling for a given word, said cluster spelling including a collection of cluster IDs indicating the clusters into which the sequence of acoustic models of said given word'"'"'s word model have been placed by said clustering.
0 Assignments
0 Petitions
Accused Products
Abstract
A method is provided for deriving acoustic word representations for use in speech recognition. Initial word models are created, each formed of a sequence of acoustic sub-models. The acoustic sub-models from a plurality of word models are clustered, so as to group acoustically similar sub-models from different words, using, for example, the Kullback-Leibler information as a metric of similarity. Then each word is represented by cluster spelling representing the clusters into which its acoustic sub-models were placed by the clustering. Speech recognition is performed by comparing sequences of frames from speech to be recognized against sequences of acoustic models associated with the clusters of the cluster spelling of individual word models. The invention also provides a method for deriving a word representation which involves receiving a first set of frame sequences for a word, using dynamic programming to derive a corresponding initial sequence of probabilistic acoustic sub-models for the word independently of any previously derived acoustic model particular to the word, using dynamic programming to time align each of a second set of frame sequences for the word into a succession of new sub-sequences corresponding to the initial sequence of models, and using these new sub-sequences to calculate new probabilistic sub-models.
364 Citations
34 Claims
-
1. A method of deriving an acoustic word representation for use in speech recognition systems, said method comprising:
-
creating a word model for each of a plurality of words, each word model having a temporal sequence of acoustic models derived from one or more utterances of its associated word; clustering the individual acoustic models from each of the plurality of word models, so as to place individual models into clusters of relatively similar models; providing a cluster ID for each such cluster; and creating a cluster spelling for a given word, said cluster spelling including a collection of cluster IDs indicating the clusters into which the sequence of acoustic models of said given word'"'"'s word model have been placed by said clustering.
-
-
2. A method of deriving an acoustic word representation for use in speech recognition systems, said method comprising:
-
receiving a one or more sequences of acoustic frames for each of a plurality of words, each of said frames having a corresponding set of n parameter values; using dynamic programming to derive from said one or more frame sequences associated with each such word, a corresponding sequence of dynamic programming elements (hereinafter referred to as dp elements in this and depending claims), said dynamic programming including; creating a sequence of dp elements for each word, each having an n-dimensional probability distribution; using one or more iterations of dynamic programming to seek a relatively optimal match between the successive probability distributions of the sequence of dp elements for a given word and the successive parameter values of the one or more frame sequences associated with that word, so as to divide each of the one or more frame sequences associated with a given word into a plurality of sub-sequences each associated with one of said dp elements, each of said iterations involving calculating a new n-dimensional probability distribution for individual dp elements, each dimension of a given dp element'"'"'s distribution being calculated as a function of corresponding parameter values from frames matched with the given dp element by said iteration; clustering the dp elements produced by said dynamic programming for each of said plurality of words into a plurality of clusters, said clustering including placing individual dp elements into the cluster of such elements which has a probability distribution closest to that element'"'"'s own probability distribution, as determined by a certain statistical metric, and calculating an n-dimensional probability distribution for each cluster which is derived from the corresponding n-dimensional probability distribution of the dp elements placed within it; and creating a sequence of such clusters to represent a given word, with successive clusters of the sequence corresponding to successive dp elements in the sequence of such elements derived for the word by said dynamic programming, and with each such cluster being the cluster into which its corresponding dp element is placed by said clustering. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method of deriving an acoustic word representation for use in speech recognition systems, comprising:
-
receiving a first set of sequences of acoustic frames generated by one or more utterances of a given word, each of said frames having a set of n parameter values; using dynamic programming, independently of any previously derived acoustic model particular to said given word, to automatically derive from said first set of frame sequences an initial acoustic model of said given word comprised of an initial sequence of acoustic probability distribution models, said dynamic programming including; dividing each of said first set of frame sequences into a corresponding plurality of sub-sequences of frames independently of any previously derived acoustic model particular to said given word; calculating a probability distribution model for each group of corresponding sub-sequences, which model includes an n-dimensional probability distribution, each dimension of which is calculated from one of the n corresponding parameter values of the frames occurring in its group of corresponding sub-sequences; using dynamic programming to time align each of said first set of frame sequences against said sequence of probability distribution models; dividing each of said first set of frame sequences into a new corresponding plurality of sub-sequences of frames based on said time alignment against said sequence of probability distribution models; calculating a new probability distribution model, of the type described above, for each group of corresponding sub-sequences; repeating one or more times the steps of using dynamic programming to time align, dividing each of said first sets of frame sequences into a new corresponding plurality of sub-sequences, and calculating new probability distribution models; and storing the sequence of probability distributions calculated by the last repetition of these three steps as said initial acoustic word model (hereinafter referred to as the initial sequence of probability distribution models in this and depending claims); using dynamic programming to time align each of a second set of frame sequences generated by one or more utterances of said given word against said initial sequence of probability distribution models, so as to divide each of said second set of frame sequences into a corresponding plurality of new sub-sequences, with each of said new sub-sequences being associated with one of said probability distribution models; and calculating a dynamic programming element (hereinafter referred to as a dp element in this and depending claims) for each group of corresponding new sub-sequences, which dp element includes an n-dimensional probability distribution, each dimension of which is calculated from one of the n corresponding parameter values of the frames of its associated group of corresponding new sub-sequences. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
-
-
28. A method of recognizing which word from among a plurality of words a given utterance corresponds to, said method comprising:
-
receiving a sequence of acoustic frames generated by the utterance of a given word, each of said frames having a corresponding set of n parameter values; storing an alphabet of sound symbols, each of which has stored in association with it an n-dimensional probability distribution, with one dimension corresponding to each said n parameter values associated with said frames, wherein said alphabet of sound symbols is derived by clustering similar sounds in different words into a single sound symbol; storing an acoustic spelling for each of said plurality of words, each of which spellings represents a sequence of one or more of said sound symbols, with a plurality of said sound symbols being used in the spelling of more than one word; and comparing the parameter values of said frame sequence against the sequence of corresponding probability distributions associated with said acoustic spelling for a given word to determine if the frame sequence corresponds to said word. - View Dependent Claims (29, 30, 31, 32, 33, 34)
-
Specification