Method for speech recognition
First Claim
1. A method of determining the probability that a given portion of speech to be recognized corresponds to a speech pattern, representing a common sound sequence occurring in one or more words, the method comprising:
- time aligning a series of acoustic descriptions representing the speech to be recognized against a time-aligning model comprised of a series of acoustic sub-models;
deriving a time-aligned speech model having a series of acoustic sub-models, each of which is derived from the acoustic speech descriptions time aligned against a corresponding sub-model of the time-aligning model;
providing time-aligned acoustic models of each of a first class of speech patterns, each of which time-aligned pattern models is derived by;
time aligning a series of acoustic descriptions from one or more utterances of that speech pattern against the time-aligning model, andderiving, for that pattern model, a series of acoustic sub-models, each of which is derived from the acoustic descriptions from those one or more utterances time aligned against a corresponding sub-model of the time-aligning model;
comparing the time-aligned speech model against each of a plurality of the time-aligned pattern models so as to produce a score for each such comparison as a function of how closely each sub-model of the speech model compares to its corresponding sub-model of a given pattern model;
selecting which speech patterns warrant a more computationally intensive comparison against the speech to be recognized in response to the scores produced for the comparisons between the speech model and the pattern models; and
performing that more computationally intensive comparison for the selected speech patterns in order to determine which of the selected speech patterns most probably corresponds to said speech to be recognized.
1 Assignment
0 Petitions
Accused Products
Abstract
A method determines if a portion of speech corresponds to a speech pattern by time aligning both the speech and a plurality of speech pattern models against a common time-aligning model. This compensates for speech variation between the speech and the pattern models. The method then compares the resulting time-aligned speech model against the resulting time-aligned pattern models to determine which of the patterns most probably corresponds to the speech. Preferably there are a plurality of time-aligning models, each representing a group of somewhat similar sound sequences which occur in different words. Each of these time-aligning models is scored for similarity against a portion of speech, and the time-aligned speech model and time-aligned pattern models produced by time alignment with the best scoring time-aligning model are compared to determine the likelihood that each speech pattern corresponds to the portion of speech. This is performed for each successive portion of speech. When a portion of speech appears to correspond to a given speech pattern model, a range of likely start times is calculated for the vocabulary word associated with that model, and a word score is calculated to indicate the likelihood of that word starting in that range. The method uses a more computationally intensive comparison between the speech and selected vocabulary words, so as to more accurately determine which words correspond with which portions of the speech. When this more intensive comparison indicates the ending of a word at a given point in the speech, the method selects the best scoring vocabulary words whose range of start times overlaps that ending time, and performs the computationally intensive comparison on those selected words starting at that point in the speech.
85 Citations
20 Claims
-
1. A method of determining the probability that a given portion of speech to be recognized corresponds to a speech pattern, representing a common sound sequence occurring in one or more words, the method comprising:
-
time aligning a series of acoustic descriptions representing the speech to be recognized against a time-aligning model comprised of a series of acoustic sub-models; deriving a time-aligned speech model having a series of acoustic sub-models, each of which is derived from the acoustic speech descriptions time aligned against a corresponding sub-model of the time-aligning model; providing time-aligned acoustic models of each of a first class of speech patterns, each of which time-aligned pattern models is derived by; time aligning a series of acoustic descriptions from one or more utterances of that speech pattern against the time-aligning model, and deriving, for that pattern model, a series of acoustic sub-models, each of which is derived from the acoustic descriptions from those one or more utterances time aligned against a corresponding sub-model of the time-aligning model; comparing the time-aligned speech model against each of a plurality of the time-aligned pattern models so as to produce a score for each such comparison as a function of how closely each sub-model of the speech model compares to its corresponding sub-model of a given pattern model; selecting which speech patterns warrant a more computationally intensive comparison against the speech to be recognized in response to the scores produced for the comparisons between the speech model and the pattern models; and performing that more computationally intensive comparison for the selected speech patterns in order to determine which of the selected speech patterns most probably corresponds to said speech to be recognized. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method of recognizing continuous speech comprising:
-
providing a plurality of acoustic cluster models, each of which includes a series of acoustic sub-models, said cluster models being derived by using dynamic programming to divide one or more utterances of each of a plurality of vocabulary words into series of corresponding segments; deriving a series of sub-models for each such word, with each sub-model representing a group of corresponding segments from the one or more utterances of that word; dividing the series of sub-models from the different words into clusters of relatively similar series; and calculating a model for each such cluster which reflects the series of sub-models which have been grouped into that cluster; performing a comparison between a portion of the speech to be recognized and each of a plurality of the cluster models, selecting the one or more cluster models against which the portion of speech compares most closely; and performing further comparison between that portion of speech and the words whose series of sub-models have been associated with the one or more selected cluster models against which that portion of speech compares most closely.
-
-
15. A method of recognizing continuous speech comprising:
-
scanning a temporal acoustic representation of the speech for the occurrence of acoustic patterns, each of which occurs in one or more individual vocabulary words; when such patterns are detected in the speech representation, calculating a range of start times in the representation at which it is most likely each of the one or more vocabulary words associated with that pattern started; performing a computationally intensive comparison between a portion of the speech representation and each of a plurality of vocabulary words, including determining when there is a probability better than a given threshold that a match of a given vocabulary word against the speech representation has terminated at a given ending time in that representation, wherein said computationally intensive comparison between a given vocabulary word and a given portion of speech is more computationally intense than the scanning of that portion of speech for the occurrence of a given acoustic pattern; and using those vocabulary words which have a range of start times which overlaps the given ending time as words against which to perform said intensive comparisons, starting approximately at the given ending time. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification