Method for creating and using multiple-word sound models in speech recognition
First Claim
1. A prefiltering method for use in a speech recognition system, said method comprising:
- receiving an acoustic description of an utterance to be recognized;
storing a vocabulary of words;
storing a plurality of probabilistic acoustic cluster models and using individual ones of said acoustic cluster models to represent at least a part of more than one vocabulary word;
comparing at least a portion of said acoustic description from said utterance against each of said cluster models, and producing a cluster likelihood score for each cluster model against which such a comparison is made;
using the cluster likelihood score produced for each cluster model to calculate a prefilter score for words represented by that cluster model; and
selecting a subset of said vocabulary words to undergo a more lengthy comparison against said utterance to be recognized based on the prefilter scores associated with said vocabulary words;
wherein;
said acoustic description of said utterance to be recognized includes a succession of acoustic descriptions representing a sequence of sounds associated with said utterance;
said cluster models each comprises a succession of probabilistic acoustic models, for modeling a sequence of sounds associated with each word represented by said cluster model;
said comparing includes comparing a succession of said acoustic descriptions from the utterance to be recognized against the succession of acoustic models from each of a plurality of cluster models and producing a cluster likelihood score for each such cluster model as a result of that comparison; and
said cluster models are wordstart cluster models, that is, models which only represent the initial portion of many words in said vocabulary.
1 Assignment
0 Petitions
Accused Products
Abstract
A first speech recognition method receives an acoustic description of an utterance to be recognized and scores a portion of that description against each of a plurality of cluster models representing similar sounds from different words. The resulting score for each cluster is used to calculate a word score for each word represented by that cluster. Preferably these word scores are used to prefilter vocabulary words, and the description of the utterance includes a succession of acoustic decriptions which are compared by linear time alignment against a succession of acoustic models. A second speech recognition method is also provided which matches an acoustic model with each of a succession of acoustic descriptions of an utterance to be recognized. Each of these models has a probability score for each vocabulary word. The probability scores for each word associated with the matching acoustic models are combined to form a total score for that word. The preferred speech recognition method calculates to separate word scores for each currently active vocabulary word from a common succession of sounds. Preferably the first scores is calculated by a time alignment method, while the second score is calculated by a time independent method. Preferably this calculation of two separate word scores is used in one of multiple word-selecting phase of a recognition process, such as in the prefiltering phase.
252 Citations
24 Claims
-
1. A prefiltering method for use in a speech recognition system, said method comprising:
-
receiving an acoustic description of an utterance to be recognized; storing a vocabulary of words; storing a plurality of probabilistic acoustic cluster models and using individual ones of said acoustic cluster models to represent at least a part of more than one vocabulary word; comparing at least a portion of said acoustic description from said utterance against each of said cluster models, and producing a cluster likelihood score for each cluster model against which such a comparison is made; using the cluster likelihood score produced for each cluster model to calculate a prefilter score for words represented by that cluster model; and selecting a subset of said vocabulary words to undergo a more lengthy comparison against said utterance to be recognized based on the prefilter scores associated with said vocabulary words; wherein; said acoustic description of said utterance to be recognized includes a succession of acoustic descriptions representing a sequence of sounds associated with said utterance; said cluster models each comprises a succession of probabilistic acoustic models, for modeling a sequence of sounds associated with each word represented by said cluster model; said comparing includes comparing a succession of said acoustic descriptions from the utterance to be recognized against the succession of acoustic models from each of a plurality of cluster models and producing a cluster likelihood score for each such cluster model as a result of that comparison; and said cluster models are wordstart cluster models, that is, models which only represent the initial portion of many words in said vocabulary. - View Dependent Claims (2, 3)
-
-
4. A prefiltering method for use in a speech recognition system, said method comprising:
-
receiving an acoustic description of an utterance to be recognized; storing a vocabulary of words; storing a plurality of probabilistic acoustic cluster models and using individual ones of said acoustic cluster models to represent at least a part of more than one vocabulary word; comparing at least a portion of said acoustic description from said utterance against each of said cluster models, and producing a cluster likelihood score for each cluster model against which such a comparison is made; using the cluster likelihood score produced for each cluster model to calculate a prefilter score for words represented by that cluster model; and selecting a subset of said vocabulary words to undergo a more lengthy comparison against said utterance to be recognized based on the prefilter scores associated with said vocabulary words; wherein; said acoustic description of said utterance to be recognized includes a succession of acoustic descriptions representing a sequence of sounds associated with said utterance; said cluster models each comprises a succession of probabilistic acoustic models, for modeling a sequence of sounds associated with each word represented by said cluster model; said comparing includes comparing a succession of said acoustic descriptions from the utterance to be recognized against the succession of acoustic models from each of a plurality of cluster models and producing a cluster likelihood score for each such cluster model as a result of that comparison; and said comparing includes using linear time alignment to compare successive descriptions from said utterance against corresponding successive models of said cluster models.
-
-
5. A prefiltering method for use in a speech recognition system, said method comprising:
-
receiving an acoustic description of an utterance to be recognized; storing a vocabulary of words; storing a plurality of probabilistic acoustic cluster models and using individual ones of said acoustic cluster models to represent at least a part of more than one vocabulary word; comparing at least a portion of said acoustic description from said utterance against each of said cluster models, and producing a cluster likelihood score for each cluster model against which such a comparison is made; using the cluster likelihood produced for each cluster model to calculate a prefilter score for words represented by that cluster model; and selecting a subset of said vocabulary words to undergo a more lengthy comparison against said utterance to be recognized based on the prefilter scores associated with said vocabulary words; wherein; said receiving of an acoustic description of said utterance to be recognized includes receiving a sequence of individual frames, each describing said utterance during a brier period of time, and said comparing includes deriving a series of smoothed frames from said sequence of individual frames, each of said smoothed frames being derived from a weighted average of a plurality of individual frames, and comparing at least one of said smoothed frame against said cluster models.
-
-
6. A speech recognition method comprising:
-
receiving an acoustic description of an utterance to be recognized, including a succession of acoustic descriptions representing a sequence of sounds associated with the utterance; storing a vocabulary of words; storing a plurality of sound-sequence models and using individual ones of said sound-sequence models to represent at least a part of more than one word, each of said models comprising a succession of probabilistic acoustic models modeling a sequence of sounds associated with each word represented by said sound-sequence model; using linear time alignment to compare a succession of acoustic descriptions from said utterance to be recognized against the succession of acoustic models from each of a plurality of sound-sequence models, and for producing a sound-sequence score for each such sound-sequence model as a result of its comparison; using the sound-sequence score produced for a given sound-sequence model as a result of its comparison to calculate a word score for each of a plurality of words associated with that sound-sequence model. - View Dependent Claims (7, 8)
-
-
9. A method of making models to represent sounds of vocabulary words for use in speech recognition, said method comprising:
-
deriving an acoustic description of each of a plurality of vocabulary words, each acoustic description comprising a succession of acoustic descriptions representing a sequence of sounds associated with its corresponding word; clustering the acoustic descriptions of said vocabulary words to derive a plurality of probabilistic multi-word cluster models, each comprising a succession of acoustic models derived from the corresponding succession of acoustic descriptions of the words whose descriptions have been grouped in that model'"'"'s cluster wherein said clustering to derive said multi-word cluster models includes; clustering said acoustic descriptions of vocabulary words to derive a plurality of probabilistic multi-word clusters, each having a multi-word cluster model formed of a succesion of cluster portions which represent successive temporal portions of the common sound sequence represented by the multi-word cluster model; clustering the cluster portions from a plurality of multi-word cluster models to derive a plurality of cluster-portion cluster models; and forming a record of which multi-word clusters have had their cluster portions placed in which cluster-portion cluster models.
-
-
10. A method of making models to represent sounds of vocabulary words for use in speech recognition, said method comprising:
-
deriving an acoustic description of each of a plurality of vocabulary words, each acoustic description comprising a succession of acoustic descriptions representing a sequence of sounds associated with its corresponding word; clustering the acoustic descriptions of said vocabulary words to derive a plurality of probabilistic multi-word cluster models, each comprising a succession of acoustic models derived from the corresponding succession of acoustic descriptions of the words whose descriptions have been grouped in that model'"'"'s cluster; wherein; said acoustic descriptions used in said clustering describe only the initial portion of some words in said vocabulary; and
thussaid multi-word cluster models are wordstart models, that is, models which represent only the initial portions of some words in said vocabulary. - View Dependent Claims (11)
-
-
12. A speech recognition method comprising:
-
receiving a succession of acoustic descriptions, each of which describes one of a succession of sounds from an utterance to be recognized; storing a plurality of acoustic models; storing in association with each such model a separate word score for each of a plurality of vocabulary words, which score indicates the probability that its associated word corresponds to the utterance to be recognized given that the score'"'"'s associated acoustic model is found to match an acoustic description from that utterance; matching one of said acoustic models with each of said acoustic descriptions based on the relative closeness of said models with said descriptions; calculating a total word score for each of said vocabulary words by combining the word scores for that word associated with each of the acoustic models which matches one of said acoustic descriptions; and using the total word score produced for each vocabulary word to determine which of said vocabulary words most probably corresponds to the utterance to be recognized; wherein said word score stored in association with each of the acoustic models for each of said vocabulary words indicates the probability that its word corresponds to the utterance to be recognized given that the word'"'"'s model is found to match an acoustic description from that utterance, independent of the time within the succession of acoustic descriptions that the matching acoustic description occurs.
-
-
13. A speech recognition method comprising:
-
receiving a succession of acoustic descriptions, each of which describes one of a succession of sounds from an utterance to be recognized; storing a plurality of acoustic models; storing in association which each such model a separate word score for each of a plurality of vocabulary words, which score indicates the probability that its associated word corresponds to the utterance to be recognized given that the score'"'"'s associated acoustic model is found to match an acoustic description from that utterance; matching one of said acoustic models with each of said acoustic descriptions based on the relative closeness of said models with said descriptions; calculating a total word score for each of said vocabulary words by combining the word scores for that word associated with each of the acoustic models which matches one of said acoustic descriptions; and using the total word score produced for each vocabulary word to determine which of said vocabulary words most probably corresponds to the utterance to be recognized;
in which;the word score for each vocabulary word stored in association with each acoustic model corresponds to a logarithm of the probability of that word corresponding to the utterance to be recognized given that its model is found to match an acoustic description from that utterance; and said calculating of a total word score for each vocabulary word includes adding the word scores associated with that word from each of the selected acoustic models. - View Dependent Claims (14)
-
-
15. A speech recognition method comprising:
-
receiving a succession of acoustic descriptions, each of which describes one of a succession of sounds from an utterance to be recognized; storing a plurality of acoustic models; storing in association with each such model a separate word score for each of a plurality of vocabulary words, which score indicates the probability that its associated word corresponds to the utterance to be recognized given that the score'"'"'s associated acoustic model is found to match an acoustic description from that utterance; matching one of said acoustic models with each of said acoustic descriptions based on the relative closeness of said models with said descriptions; calculating a total word score for each of said vocabulary words by combining the word scores for that word associated with each of the acoustic models which matches one of said acoustic descriptions; and using the total word score produced for each vocabulary words by combining the word scores for that word associated with each of the acoustic models which mathces one of said acoustic descriptions; and
said using the total word score produced for each vocabulary word to determine which of said vocabulary words most probably correspoinds to the utterance to be recognized;
in which;saids using of said total word scores to determine which of said vocabulary words most probably corresponds to the utterance to be recognized includes using said total word socres to select which sub-set of said vocabulary words appear to warrant more extensive comparison against that utterance; and performing such a more extensive comparison of the words of that sub-set against said utterance to determine which words in that sub-set most probably corresponbd to said utterance.
-
-
16. A speech recognition method comprising:
-
using a first method to calculate a first score for each of a plurality of vocabulary words based at least in part on descriptions of successive sounds from an utterance to be recognized, said scores indicating the probability that each vocabulary word corresponds to said utterance; using a second method to calculate a second score for each said vocabulary words baed at least in part on descriptions of the same successive sounds from said utterance, said second scores also indicating the probability that each vocabulary word corresponds to the utterance to be recognized; combining said first and second scores for each of said vocabulary words to produce a combined score for that word; using the combined score produced for each vocabulary word to determine which of said vocabulary words most probably corresponds to the utterance to be recognized; wherein; said first method calculates the first score for each vocabulary word by using a particular time alignment between the description of said successive sounds and a succession of acoustic models associated with that word and calculates a scores based on that particular time alignment; and said second method calculates the second score for each vocabulary word by comparing descriptions of said successive sounds against a plurality of acoustic models without using the particular time alignments used for each word by said first method.
-
-
17. A speech recognition method comprising:
-
using a first method to calculate a first score for each of a plurality of vocabulary words based at least in part on descriptions of successive sounds from an utterance to be recognized, said scores indicating the probability that each vocabulary word corresponds to said utterance; using a second method to calculate a second score for each said vocabulary words based at least in part on descriptions of the same successive sounds from said utterance, said second scores also indicating the probability that each vocabulary word corresponds to the utternace to be recognized; combining said first and second scores for each of said vocabulary words to produce a combined score for that word; using the combined score produced for each vocabulary word to determine which of said vocabulary words most probably corresponds to the utterance to be recognized; wherein; said first method calculates the first score for each vocabulary word by comparing descriptions of said successive portions of said utterance against a succession of acoustic models associated with that word in a manner which makes the score associated with that word depend upon the order in which the sounds of those successive portions of the utterance are said; and said second method calculates the second score for each vocabulary word by comparing descriptions of said successive sounds against a plurality of acoustic models in a manner which makes the score assocated with the word independent of the order in which those successive sounds are said. - View Dependent Claims (18)
-
-
19. A speech recognition method comprising:
-
using a first method to calculate a first score for each of a plurality of vocabulary words based at least in part on descriptions of successive sounds from an utterance to be recognized, said scores indicating the probability that each vocabulary word corresponds to said utterance; using a second method to calculate a second score for each of said vocabulary words based at least in part on descriptions of the same successive sounds from said utterance;
said second scores also indicating the probability that each vocabulary word corresponds to the utterance to be recognized;combining said first and second scores for each of said vocabulary words to produce a combined score for that word; using the combined score produced for each vocabulary word to determine which of said vocabulary words most probably corresponds to the utterance to be recognized;
wherein;the method starts with an initial currently active vocabulary comprising a plurality of vocabulary words; the method comprises a succession of word-selecting phaes, each of which selects a sub-set of words from the currently active vocabulary, and makes that sub-set into the new currently active vocabulary; one of said word-selecting phases comprises the using of said first and second methods to calculate first and second scores for each vocabulary word in the currently active vocabulary and then using the combined scores produced for each currently active vocabulary word to select said sub-set of that currently active vocabulary as the new currently active vocabulary. - View Dependent Claims (20)
-
-
21. A speech recognition method comprising:
-
storing a plurality of acoustic models, each of which represents a sound which occurs as part of one or more speech units; finding a plurality of matches between acoustic models and successive portions of speech to be recognized; in response to each such match, associating with a given period of the speech an evidence score for each of the one or more speech units; and combining the one or more evidence scores for a given speech unit which are associated with a given region of the speech as a result of a plurality of such matches, to determine the probability that the given speech unit corresponds to that region of speech, with this combination being performed independently of the order in which the region scores are associated with the region of speech; wherein each speech unit is a vocabulary word which the system is capable of recognizing. - View Dependent Claims (22)
-
-
23. A speech recognition method comprising:
-
storing a plurality of acoustic models, each of which represents a sound which occurs as part of one or more speech units; finding a plurality of matches between acoustic models and successive portions of speech to be recognized; in response to each such match, associating with a given period of the speech an evidence score for each of the or more speech units; and combining the one or more evidence scores for a given speech unit which are associated with a given region of the speech as a result of a plurality of such matches, to determine the probability that the given speech unit corresponds to that region of speech, with this combination being performed independently of the order in which the evidence scores are associated with the region of speech; wherein; the speech to be recognized is represented as a sequence of acoustic frames; individual frames from the speech to be recognized are compared against a plurality of acoustic models to determine which frames match which acoustic models; and the speech-unit-score information provided for a given speech unit and a given acoustic model is a value which indicates the probability that an individual acoustic frame taken randomly from a given portion of an utterance of the given speech unit will match the given acoustic model.
-
-
24. A speech recognition method comprising:
-
storing a plurality of acoustic models, each of which represents a sound which occurs as part of one or more speech units; finding a plurality of matches between acoustic models and successive portions of speech to be recognized; in response to each such match, associating with a given period of the speech an evidence score for each of the one or more speech units; and combining the one or more evidence scores for a given speech unit which are associated with a given region of the speech as a result of a plurality of such matches, to determine the probability that the given speech unit corresponds to that region of speech, with this combination being performed independently of the order in which the evidence scores are associated with the region of speech; in which the combining of evidence scores for each of a plurality of speech units is used to derive prefilter scores which are used to select which speech units receive a more detailed comparison against the speech to be recognized.
-
Specification