Speech recognition method
First Claim
1. A method of speech labeling comprising:
- representing speech to be labeled as a sequence of acoustic frames;
storing a plurality of acoustic frame models, each of which represents a certain class of sounds, with the acoustic models representing at least twenty different phonetic classes of sound;
storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic model will neighbor a frame associated with a second, but not necessarily different, acoustic model;
associating one or more of the acoustic models with a given frame as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic models, (b) an indication of which one or more of the acoustic models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic model; and
labeling each frame in said sequence of acoustic frames with a label identifying said one or more associated acoustic models.
1 Assignment
0 Petitions
Accused Products
Abstract
Smoothed frame labeling associates phonetic frame labels with a given speech frame as a function of (a) the closeness with which the given frame compares to each of a plurality of acoustic models, (b) which frame labels correspond with a neighboring frame, and (c) transition probabilities which indicate, for the frame labels associated with the neighboring frame, which frame labels are probably associated with the given frame. The smoothed frame labeling is used to divide the speech into segments of frames having the same class of labels. The invention represents words as a collection of known diphone models, each of which models the sound before and after a boundary between segments derived by the smoothed frame labeling. At recognition time, the speech is divided into segments by smoothed frame labeling; diphone models are derived for each boundary between the resulting segments; and the resulting diphone models are compared against the known diphone models to determine which of the known diphone models match the segment boundaries in the speech. Then a combined-displaced-evidence method is used to determine which words occur in the speech. This method detects which acoustic patterns, in the form of the known diphone models, match various portions of the speech. In response to each such match, it associates with the speech an evidence score for each vocabulary word in which that pattern is known to occur. It displaces each such score from the location of its associated matched pattern by the known distance between that pattern and the beginning of the score'"'"'s word. Then all the evidence scores for a word located in a given portion of the speech are combined to produce a score which indicates the probability of that word starting in that portion of the speech. This score is combined with a score produced by comparing a histogram from a portion of the speech against a histogram of each word. The resulting combined score determines whether a given word should undergo a more detailed comparison against the speech to be recognized.
225 Citations
29 Claims
-
1. A method of speech labeling comprising:
-
representing speech to be labeled as a sequence of acoustic frames; storing a plurality of acoustic frame models, each of which represents a certain class of sounds, with the acoustic models representing at least twenty different phonetic classes of sound; storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic model will neighbor a frame associated with a second, but not necessarily different, acoustic model; associating one or more of the acoustic models with a given frame as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic models, (b) an indication of which one or more of the acoustic models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic model; and labeling each frame in said sequence of acoustic frames with a label identifying said one or more associated acoustic models.
-
-
2. A method of speech segmentation comprising:
-
representing speech to be segmented as a sequence of acoustic frames; storing a plurality of acoustic frame models, each of which represents a certain type of sound, with each such acoustic model being associated with a class of one or more such acoustic models; storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic model will neighbor a frame associated with a second, but not necessarily different, acoustic model; associating one or more of the acoustic models with each of a sequence of frames, with the acoustic model associated with each given frame of the sequence being selected as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic models, (b) an indication of which one or more of the acoustic models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic model; comparing the class of the acoustic models associated with neighboring frames in the sequence of frames to detect where in that sequence one or more boundaries occur between regions associated with different classes of acoustic models; marking the subsequence of frames between each boundary in said sequence of frames and the next boundary in said sequence as a segment.
-
-
3. A method of word hypothesization in continuous speech comprising:
-
representing speech to be analyzed as a sequence of acoustic frames; storing a plurality of acoustic frame models, each of which represents a certain type of sound; storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic model will neighbor a frame associated with a second, but not necessarily different, acoustic model; associating one or more of the acoustic models with each of a sequence of frames, with the acoustic model associated with each given frame of the sequence being selected as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic models, (b) an indication of which one or more of the acoustic models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic model; using the one or more acoustic models associated with the individual frames in the sequence of frames to select which of a plurality of vocabulary word models are to be hypothesized as occurring at or near the speech represented by the said frames associated with said one or more acoustic models; and hypothesizing said selected vocabulary word models. - View Dependent Claims (4)
-
-
5. A method of speech-unit hypothesization in continous speech comprising:
-
storing a model of each of a plurality of speech units, where each such speech unit model associates an individual diphone model with each of a plurality of segment boundaries associated with its speech unit, where each such segment boundary is located between two sound segments, each of which represents a succession of sounds from the speech unit which are relatively similar to each other, and where each diphone model includes a pre-boundary model of the sound preceding its associated segment boundary and a post-boundary model of the sound following that boundary; dividing a portion of speech to be analyzed into a plurality of segments of relatively acoustically similar portions; deriving a diphone model of each of a plurality of boundaries between such segments, with each such diphone model including a pre-boundary model of the sound preceding its associated segment boundary and a post-boundary model of the sound following that boundary; matching the diphone models derived from the speech to be analyzed against the diphone models associated with speech-unit models to determine which speech units most probably correspond to a given portion of the speech to be analyzed; and hypothesizing said most probable speech units. - View Dependent Claims (6, 7, 8, 9)
-
-
10. A method of speech-unit hypothesization in continuous speech comprising:
-
storing a plurality of acoustic patterns; associating with each such acoustic pattern the occurrences of that pattern which are known to occur in one or more speech units, and storing for each such known occurrence the speech unit in which it occurs and a temporal displacement indication which indicates the temporal distance, during utterances of that speech unit, between that occurrence and a given reference point in the speech unit; detecting a plurality of matches between such acoustic patterns and various portions of the speech to be analyzed; producing, in response to each such detected match, a temporal distribution of one or more evidence scores in association with each of one or more of said known occurrences of the matched acoustic pattern, with each such temporal distribution being produced for the given speech unit in which its associated known occurrence of the matched pattern occurs, and with each such temporal distribution being displaced in the speech relative to its associated matched acoustic pattern as a function of the temporal displacement indication associated with its known occurrence of the matched pattern; calculating a speech-unit-probability score indicating the probability that a given speech unit occurs in the speech to be recognized in association with a given combining time, including combining the evidence scores for the given speech unit, if any, which are associated with the given combining time in the speech, with the combining of evidence scores including, in some instances, the combining of evidence scores for the given speech unit from different temporal distributions which have different locations relative to the given portion of speech to be analyzed; and hypothesizing as occurring at the point in time corresponding to said given portion of speech, one or more speech units which have the highest speech-unit-probability scores for said given portion of speech. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A method of speech-unit hypothesization comprising:
-
storing a plurality of acoustic models, each of which represents a given class of sounds which occurs as part of one or more speech units; associating with each of a plurality of speech units one or more of the acoustic models, with each occurrence of an acoustic model associated with a given speech unit having a corresponding temporal displacement indication which indicates the temporal displacement, during utterances of the speech unit, between the occurrence of that acoustic model and a given reference point in that speech unit; finding, for each of a plurality of the acoustic models, evidence for one or more matches of that acoustic model against successive portions of the speech to be analyzed; calculating a speech-unit-probability score, which indicates the probability that a given speech unit occurs in a given region of the speech, as a function of (a) the evidence found that one or more acoustic models associated with the given speech unit match one or more portions in the speech, (b) the location in the speech at which the evidence for such matches is found, and (c) the temporal displacement indications associated, for the given speech unit, with the acoustic model for which such evidence is found, where whether or not a contribution is made to a given speech-unit-probability score as a result of the evidence found for each of a plurality of matches of acoustic models against the speech to be analyzed is independent of what, if any, evidence is found for the match of the other acoustic models associated with that speech unit; and hypothesizing as occurring at or near the point in time corresponding to said given region of speech, one or more speech units which have the highest speech-unit-probability scores for said given region of speech. - View Dependent Claims (24, 25)
-
-
26. A method of speech-unit hypothesization comprising:
-
storing a plurality of acoustic models, each of which represents a given class of sound which occurs as part of one or more speech units; associating with each of a plurality of speech units one or more of the acoustic models, with each occurrence of an acoustic model associated with a given speech unit having a corresponding temporal displacement indication which indicates the temporal displacement, during utterances of the speech unit, between the occurrence of that acoustic model and a given reference point in that speech unit; finding, for each of a plurality of the acoustic models, evidence for one or more matches of that acoustic model against successive portions of speech to be analyzed; and as a result of the evidence found in the speech for each of a plurality of matches of acoustic models associated with a given speech unit, associating one or more evidence scores for the speech unit with the speech in a temporal distribution determined as a function of the temporal displacement indication for that acoustic model in that speech unit, with that temporal distribution being independent of the temporal distribution of the evidence scores associated with the speech as a result of any other matches of acoustic models associated with the given speech unit; combining the resulting one or more evidence scores for the given speech unit which are associated with a given combining time within the speech so as to calculate a speech-unit-probability score that indicates the probability that the given speech unit occurs in the speech in association with the combining time; and hypothesizing as occurring at said combining time one or more speech units which have the highest speech-unit-probability scores for said given combining time.
-
-
27. A method of speech-unit hypothesization comprising:
-
storing a plurality of acoustic models, each of which represents a given class of sounds which occurs as part of one or more speech units; associating with each of a plurality of speech units one or more of the acoustic models, with each occurrence of an acoustic model associated with a given speech unit having a corresponding temporal displacement indication which indicates the temporal displacement, during utterances of the speech unit, between the occurrence of that acoustic model and a given reference point in that speech unit; finding, for each of a plurality of the acoustic models, evidence for one or more matches of that acoustic model against successive portions of speech to be analyzed; calculating a speech-unit-probability score for a given speech unit, which indicates the probability that the given speech unit occurs in the speech in association with a given scoring time in the speech, said speech-unit-probability score being calculated by (1) associating with each of a plurality of acoustic models associated with the speech unit a range of expected times determined relative to the scoring time as a function of the temporal displacement indication associated with each such acoustic model for the given speech unit;
(2) producing an evidence score for each acoustic model for which evidence of a match is found during the range of expected times associated with that acoustic model and the given speech unit; and
(3) combining the evidence scores so produced for the given speech unit; andhypothesizing as occurring at said given scoring time one or more speech units which have the highest speech-unit-probability scores for said given scoring time.
-
-
28. A method of speech-unit hypothesization in continuous speech comprising:
-
storing a plurality of acoustic models, each of which represent a given class of sound which occurs as part of one or more speech units; storing a histogram for each of a plurality of speech units, with each such histogram indicating, for each of a plurality of acoustic models, the total probable number of matches between that acoustic model and a given portion of one or more utterances of that speech unit; calculating a histogram for a portion of speech to be analyzed which indicates the total probable number of matches between each of the given plurality of acoustic models and that portion of speech; comparing the histogram calculated for the portion of speech to be analyzed against the histograms for each of a plurality of speech units to calculate speech-unit-probability scores which indicate which speech units most probably corresponds to the speech to be analyzed; and hypothesizing as occurring at or near the time corresponding to said portion of speech one or more speech units with the highest speech-unit-probability scores for said portion of speech;
wherein;the histogram which is calculated for the speech to be analyzed derives most of its information from a sampling window, that is, a portion of the speech to be analyzed which is approximately the same length as the portions of speech from which the histograms of the speech-unit models are made; this histogram is repeatedly re-calculated, with the sampling window being shifted relative to the speech be analyzed in successive re-calculations; the resulting histograms which are calculated for each of a plurality of different positions of the sampling window are each compared against the histograms of the speech units to determining which of those speech units most probably correspond to each of those different sampling window positions. - View Dependent Claims (29)
-
Specification