Speech recognition method

US 4,803,729 A
Filed: 04/03/1987
Issued: 02/07/1989
Est. Priority Date: 04/03/1987
Status: Expired due to Fees

First Claim

Patent Images

1. A method of speech labeling comprising:

representing speech to be labeled as a sequence of acoustic frames;

storing a plurality of acoustic frame models, each of which represents a certain class of sounds, with the acoustic models representing at least twenty different phonetic classes of sound;

storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic model will neighbor a frame associated with a second, but not necessarily different, acoustic model;

associating one or more of the acoustic models with a given frame as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic models, (b) an indication of which one or more of the acoustic models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic model; and

labeling each frame in said sequence of acoustic frames with a label identifying said one or more associated acoustic models.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Smoothed frame labeling associates phonetic frame labels with a given speech frame as a function of (a) the closeness with which the given frame compares to each of a plurality of acoustic models, (b) which frame labels correspond with a neighboring frame, and (c) transition probabilities which indicate, for the frame labels associated with the neighboring frame, which frame labels are probably associated with the given frame. The smoothed frame labeling is used to divide the speech into segments of frames having the same class of labels. The invention represents words as a collection of known diphone models, each of which models the sound before and after a boundary between segments derived by the smoothed frame labeling. At recognition time, the speech is divided into segments by smoothed frame labeling; diphone models are derived for each boundary between the resulting segments; and the resulting diphone models are compared against the known diphone models to determine which of the known diphone models match the segment boundaries in the speech. Then a combined-displaced-evidence method is used to determine which words occur in the speech. This method detects which acoustic patterns, in the form of the known diphone models, match various portions of the speech. In response to each such match, it associates with the speech an evidence score for each vocabulary word in which that pattern is known to occur. It displaces each such score from the location of its associated matched pattern by the known distance between that pattern and the beginning of the score'"'"'s word. Then all the evidence scores for a word located in a given portion of the speech are combined to produce a score which indicates the probability of that word starting in that portion of the speech. This score is combined with a score produced by comparing a histogram from a portion of the speech against a histogram of each word. The resulting combined score determines whether a given word should undergo a more detailed comparison against the speech to be recognized.

225 Citations

29 Claims

1. A method of speech labeling comprising:
- representing speech to be labeled as a sequence of acoustic frames;
  
  storing a plurality of acoustic frame models, each of which represents a certain class of sounds, with the acoustic models representing at least twenty different phonetic classes of sound;
  
  storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic model will neighbor a frame associated with a second, but not necessarily different, acoustic model;
  
  associating one or more of the acoustic models with a given frame as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic models, (b) an indication of which one or more of the acoustic models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic model; and
  
  labeling each frame in said sequence of acoustic frames with a label identifying said one or more associated acoustic models.

2. A method of speech segmentation comprising:
- representing speech to be segmented as a sequence of acoustic frames;
  
  storing a plurality of acoustic frame models, each of which represents a certain type of sound, with each such acoustic model being associated with a class of one or more such acoustic models;
  
  storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic model will neighbor a frame associated with a second, but not necessarily different, acoustic model;
  
  associating one or more of the acoustic models with each of a sequence of frames, with the acoustic model associated with each given frame of the sequence being selected as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic models, (b) an indication of which one or more of the acoustic models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic model;
  
  comparing the class of the acoustic models associated with neighboring frames in the sequence of frames to detect where in that sequence one or more boundaries occur between regions associated with different classes of acoustic models;
  
  marking the subsequence of frames between each boundary in said sequence of frames and the next boundary in said sequence as a segment.

3. A method of word hypothesization in continuous speech comprising:
- representing speech to be analyzed as a sequence of acoustic frames;
  
  storing a plurality of acoustic frame models, each of which represents a certain type of sound;
  
  storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic model will neighbor a frame associated with a second, but not necessarily different, acoustic model;
  
  associating one or more of the acoustic models with each of a sequence of frames, with the acoustic model associated with each given frame of the sequence being selected as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic models, (b) an indication of which one or more of the acoustic models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic model;
  
  using the one or more acoustic models associated with the individual frames in the sequence of frames to select which of a plurality of vocabulary word models are to be hypothesized as occurring at or near the speech represented by the said frames associated with said one or more acoustic models; and
  
  hypothesizing said selected vocabulary word models.
- View Dependent Claims (4)
- - 4. A continuous speech recognition method comprising said method of word hypothesization as described in claim 3 and further comprising:
    - using dynamic programming to match each of said hypothesized words models against the portion of said sequence of acoustic frames associated with said hypothesized word models to determine the probability of the word represented by each of said hypothesized word models as being the word corresponding to said portion of said sequence of acoustic frames; and
      
      selecting the sequence of words with the highest probability as computed by said dynamic programming as the recognized sequence of words.

5. A method of speech-unit hypothesization in continous speech comprising:
- storing a model of each of a plurality of speech units, where each such speech unit model associates an individual diphone model with each of a plurality of segment boundaries associated with its speech unit, where each such segment boundary is located between two sound segments, each of which represents a succession of sounds from the speech unit which are relatively similar to each other, and where each diphone model includes a pre-boundary model of the sound preceding its associated segment boundary and a post-boundary model of the sound following that boundary;
  
  dividing a portion of speech to be analyzed into a plurality of segments of relatively acoustically similar portions;
  
  deriving a diphone model of each of a plurality of boundaries between such segments, with each such diphone model including a pre-boundary model of the sound preceding its associated segment boundary and a post-boundary model of the sound following that boundary;
  
  matching the diphone models derived from the speech to be analyzed against the diphone models associated with speech-unit models to determine which speech units most probably correspond to a given portion of the speech to be analyzed; and
  
  hypothesizing said most probable speech units.
- View Dependent Claims (6, 7, 8, 9)
- - 6. A continuous speech recognition method comprising said method of speech-unit hypothesization as described in claim 5 wherein said speech-units are words, and further comprising:
    - using dynamic programming to match each of said hypothesized words against the portion of said sequence of acoustic frames associated with said hypothesized word to determine the probability of each of said hypothesized words as being the word corresponding to said portion of said sequence of acoustic frames; and
      
      selecting the sequence of words with the highest probability as computed by said dynamic programming as the recognized sequence of words.
  - 7. A method of speech-unit hypothesization as described in claim 5, wherein said dividing of a portion of speech to be recognized into a plurality of segments comprises:
    - representing the speech to be analyzed as a sequence of acoustic frames;
      
      storing a plurality of acoustic frame models, each of which represents a certain type of sound, with each such acoustic frame model being associated with a class of one or more such acoustic frame models;
      
      storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic frame model will neighbor a frame associated with a second, but not necessarily different, acoustic frame model;
      
      associating one or more of the acoustic frame models with each of a sequence of frames, with the acoustic frame model associated with each given frame of the sequence being selected as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic frame models, (b) an indication of which one or more of the acoustic frame models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic model; and
      
      comparing the class of the acoustic frame models associated with neighboring frames in the sequence of frames to detect where in that sequence one or more boundaries occur between regions associated with different classes of acoustic frame models, wherein said boundaries are used in said step of dividing said portion of speech into a plurality of segments.
  - 8. A method of speech-unit hypothesization as described in claim 5, wherein the speech units are words.
  - 9. A method of speech-unit hypothesization as described in claim 5, wherein the diphone models associated with each speech unit which are compared against diphone models from the speech to be recognized are diphone-type models which are derived by:
    - dividing one or more prior utterances of the speech unit into segments of relatively acoustically similar portions;
      
      deriving initial diphone models for the boundaries which result from such segmentation, with each such initial diphone model having a pre-boundary model of the sound preceding its segment boundary and a post-boundary model of the sound following that boundary; and
      
      representing groups of relatively similar initial diphone models from different speech units with diphone-type models, with each initial diphone model in such a group being represented by a common diphone-type model, and with each diphone-type model having two sub-models, one representing the pre-boundary models of its associated initial diphone models, and one representing the post-boundary models of those initial diphone models.

10. A method of speech-unit hypothesization in continuous speech comprising:
- storing a plurality of acoustic patterns;
  
  associating with each such acoustic pattern the occurrences of that pattern which are known to occur in one or more speech units, and storing for each such known occurrence the speech unit in which it occurs and a temporal displacement indication which indicates the temporal distance, during utterances of that speech unit, between that occurrence and a given reference point in the speech unit;
  
  detecting a plurality of matches between such acoustic patterns and various portions of the speech to be analyzed;
  
  producing, in response to each such detected match, a temporal distribution of one or more evidence scores in association with each of one or more of said known occurrences of the matched acoustic pattern, with each such temporal distribution being produced for the given speech unit in which its associated known occurrence of the matched pattern occurs, and with each such temporal distribution being displaced in the speech relative to its associated matched acoustic pattern as a function of the temporal displacement indication associated with its known occurrence of the matched pattern;
  
  calculating a speech-unit-probability score indicating the probability that a given speech unit occurs in the speech to be recognized in association with a given combining time, including combining the evidence scores for the given speech unit, if any, which are associated with the given combining time in the speech, with the combining of evidence scores including, in some instances, the combining of evidence scores for the given speech unit from different temporal distributions which have different locations relative to the given portion of speech to be analyzed; and
  
  hypothesizing as occurring at the point in time corresponding to said given portion of speech, one or more speech units which have the highest speech-unit-probability scores for said given portion of speech.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 11. A continuous speech recognition method comprising said method of speech-unit hypothesization as described in claim 10 wherein said speech-units are words, and further comprising:
    - using dynamic programming to match each of said hypothesized words against the portion of speech to be analyzed associated with said hypothesized word to determine the probability of each of said hypothesized words as being the word corresponding to said portion of speech; and
      
      selecting the sequence of words with the highest probability as computed by said dynamic programming as the recognized sequence of words.
  - 12. A method of speech-unit hypothesization as described in claim 10, wherein:
    - the speech to be analyzed is divided into a plurality of acoustic segments, each of which represents a portion of speech associated with a given class of sounds, with the length of time associated with a given segment being determined by the length of the speech signal associated with that segment'"'"'s class of sound;
      
      the temporal displacement indication associated with a known occurrence of a given acoustic pattern in a given speech unit indicates the number of such segments, during utterances of that speech unit, which exist between the occurrence of the given acoustic pattern and a given reference point in the speech unit; and
      
      the temporal distribution of evidence scores associated with the speech for a given speech unit in association with the match of a given acoustic pattern is displaced relative to that match by a number of such segments determined as a function of the temporal displacement indication associated with that acoustic pattern for that speech unit.
  - 13. A method of speech-unit hypothesization as described in claim 10, wherein:
    - the speech to be analyzed is represented as a sequence of evenly timed acoustic frames;
      
      the temporal displacement indication associated with a known occurrence of a given acoustic pattern in a given speech unit indicates the number of frames, during utterances of the speech unit, between the known occurrence of the given acoustic pattern and a given reference point in the speech unit; and
      
      the temporal distribution of evidence scores associated with the speech for a given speech unit in association with the match of a given acoustic pattern is displaced relative to that match by a number of frames determined as a function of the temporal displacement indication associated with that acoustic pattern for that speech unit.
  - 14. A method of speech-unit hypothesization as described in claim 10, wherein the speech-unit-probability score calculated for a given speech unit is normalized as a function of the number of acoustic patterns for which matches are normally found as a result of an occurrence of the given speech unit in the speech to be analyzed.
  - 15. A method of speech-unit hypothesization as described in claim 10, wherein:
    - the acoustic patterns are diphone models;
      
      the diphone models associated with each speech unit are derived by dividing one or more prior utterances of the speech unit into segments of relatively acoustically similar portions, and each diphone model includes a pre-boundary model of the sound preceding one boundary between such segments and a post-boundary model of the sound following that boundary;
      
      said detecting of a plurality of matches between acoustic patterns and various portions of the speech includes;
      
      (1) dividing a portion of speech to be analyzed into a plurality of segments of relatively acoustically similar portions;
      
      (2) deriving a diphone model for each of a plurality of boundaries between such segments, with each such diphone model including a pre-boundary model of the sound preceding its segment boundary and a post-boundary model of the sound following that boundary; and
      
      (3) comparing the diphone models derived from the speech to be analyzed against the diphone models associated with speech units to determine which speech units most probably match the portions of the speech represented by the diphone model derived from the speech.
  - 16. A method of speech-unit hypothesization as described in claim 15, wherein said dividing of a portion of speech to be analyzed into a plurality of segments comprises:
    - representing the speech to be analyzed as a sequence of acoustic frames;
      
      storing a plurality of acoustic frame models, each of which represents a certain type of sound, with each such acoustic frame model being associated with a class of one or more such acoustic frame models;
      
      storing a plurality of transition probabilities, each of which indicates the probability that a frame associated with a first acoustic frame model will neighbor a frame associated with a second, but not necessarily different, acoustic frame model;
      
      associated one or more of the acoustic frame models with each of a sequence of frames, with the acoustic frame model associated with each given frame of the sequence being selected as a function of (a) the closeness with which the given frame compares to each of a plurality of the acoustic frame models, (b) an indication of which one or more of the acoustic frame models most probably correspond with a frame which neighbors the given frame, and (c) one or more transition probabilities which indicate, for one or more acoustic frame models associated with the neighboring frame, the probability that the given frame is associated with a given acoustic frame model;
      
      comparing the class of the acoustic frame models associated with neighboring frames in the sequence of frames to detect where in that sequence one or more boundaries occur between regions associated with different classes of acoustic frame models.
  - 17. A method of speech-unit hypothesization as described in claim 10, wherein each of a plurality of the speech units represents a word.
  - 18. A continuous spech recognition method comprising said method of speech-unit hypothesization as described in claim 10, wherein:
    - the calculating of a speech-unit-probability score of a given speech unit is repeated for each of one or more speech units for each of a sequence of said combining times associated with successive portions of said speech to be analyzed;
      
      the speech units with the best speech-unit-probability scores for each of said combining times are selected for a more intensive comparison against the speech to be recognized in the vicinity of their corresponding combining time; and
      
      the sequence of words with the best scores from said more intensive comparisons is selected as the recognized sequence of words.
  - 19. A method of speech-unit hypothesization as described in claim 10, wherein:
    - the speech to be analyzed is represented as a sequence of acoustic frames;
      
      the temporal distribution of one or more evidence scores, which is produced in association with a known occurrence of a given acoustic pattern in a given speech unit in response to a match of that acoustic pattern against a portion of the speech, associates all its evidence score with a single frame in the speech to be analyzed; and
      
      the combining time covers a range of frames, so that evidence scores, if any, for a given speech unit which are associated with different frames within the combining time'"'"'s range of frames are combined to calculate the speech-unit-probability score.
  - 20. A method of speech-unit hypothesization as described in claim 10, wherein:
    - the speech to be analyzed is represented as a sequence of acoustic frames;
      
      the temporal distribution of one or more evidence scores, which is produced in association with a known occurrence of a given acoustic pattern in a given speech unit in response to a match of that acoustic pattern against a portion of the speech, associates its evidence score with a range of frames in the speech to be analyzed.
  - 21. A method of speech-unit hypothesization as described in claim 20, wherein the combining time is only one frame in length and the combining of evidence scores combines evidences scores whose range of frames overlap the frame of the combining time.
  - 22. A method of speech-unit hypothesization as described in claim 20, wherein said temporal distribution causes the amount of the evidence score associated with a given frame in said range of frames to be determined as a function of a probability distribution which represents the probability that the reference point of the given speech unit would be located, in a given utterance of the speech unit, at each of a plurality of temporal distances from the known occurrence of the acoustic pattern associated with the temporal distribution.

23. A method of speech-unit hypothesization comprising:
- storing a plurality of acoustic models, each of which represents a given class of sounds which occurs as part of one or more speech units;
  
  associating with each of a plurality of speech units one or more of the acoustic models, with each occurrence of an acoustic model associated with a given speech unit having a corresponding temporal displacement indication which indicates the temporal displacement, during utterances of the speech unit, between the occurrence of that acoustic model and a given reference point in that speech unit;
  
  finding, for each of a plurality of the acoustic models, evidence for one or more matches of that acoustic model against successive portions of the speech to be analyzed;
  
  calculating a speech-unit-probability score, which indicates the probability that a given speech unit occurs in a given region of the speech, as a function of (a) the evidence found that one or more acoustic models associated with the given speech unit match one or more portions in the speech, (b) the location in the speech at which the evidence for such matches is found, and (c) the temporal displacement indications associated, for the given speech unit, with the acoustic model for which such evidence is found, where whether or not a contribution is made to a given speech-unit-probability score as a result of the evidence found for each of a plurality of matches of acoustic models against the speech to be analyzed is independent of what, if any, evidence is found for the match of the other acoustic models associated with that speech unit; and
  
  hypothesizing as occurring at or near the point in time corresponding to said given region of speech, one or more speech units which have the highest speech-unit-probability scores for said given region of speech.
- View Dependent Claims (24, 25)
- - 24. A continuous speech recognition method comprising said method of speech-unit hypothesization as described in claim 23 wherein said speech-units are words, and further comprising:
    - using dynamic programming to match each of said hypothesized words against the portion of said speech to be analyzed associated with said hypothesized word to determine the probability of each of said hypothesized words as being the word corresponding to said portion of speech; and
      
      selecting the sequence of words with the highest probability as computed by said dynamic programming as the recognized sequence of words.
  - 25. A method of speech-unit hypothesization as in claim 23, wherein the amount of the contribution made as a result of the evidence of each of said plurality of matches to a given speech-unit probability score is independent of the amount of the contributions made to that probability score as a result of the evidence of the other of said matches.

26. A method of speech-unit hypothesization comprising:
- storing a plurality of acoustic models, each of which represents a given class of sound which occurs as part of one or more speech units;
  
  associating with each of a plurality of speech units one or more of the acoustic models, with each occurrence of an acoustic model associated with a given speech unit having a corresponding temporal displacement indication which indicates the temporal displacement, during utterances of the speech unit, between the occurrence of that acoustic model and a given reference point in that speech unit;
  
  finding, for each of a plurality of the acoustic models, evidence for one or more matches of that acoustic model against successive portions of speech to be analyzed; and
  
  as a result of the evidence found in the speech for each of a plurality of matches of acoustic models associated with a given speech unit, associating one or more evidence scores for the speech unit with the speech in a temporal distribution determined as a function of the temporal displacement indication for that acoustic model in that speech unit, with that temporal distribution being independent of the temporal distribution of the evidence scores associated with the speech as a result of any other matches of acoustic models associated with the given speech unit;
  
  combining the resulting one or more evidence scores for the given speech unit which are associated with a given combining time within the speech so as to calculate a speech-unit-probability score that indicates the probability that the given speech unit occurs in the speech in association with the combining time; and
  
  hypothesizing as occurring at said combining time one or more speech units which have the highest speech-unit-probability scores for said given combining time.

27. A method of speech-unit hypothesization comprising:
- storing a plurality of acoustic models, each of which represents a given class of sounds which occurs as part of one or more speech units;
  
  associating with each of a plurality of speech units one or more of the acoustic models, with each occurrence of an acoustic model associated with a given speech unit having a corresponding temporal displacement indication which indicates the temporal displacement, during utterances of the speech unit, between the occurrence of that acoustic model and a given reference point in that speech unit;
  
  finding, for each of a plurality of the acoustic models, evidence for one or more matches of that acoustic model against successive portions of speech to be analyzed;
  
  calculating a speech-unit-probability score for a given speech unit, which indicates the probability that the given speech unit occurs in the speech in association with a given scoring time in the speech, said speech-unit-probability score being calculated by (1) associating with each of a plurality of acoustic models associated with the speech unit a range of expected times determined relative to the scoring time as a function of the temporal displacement indication associated with each such acoustic model for the given speech unit;
  
  (2) producing an evidence score for each acoustic model for which evidence of a match is found during the range of expected times associated with that acoustic model and the given speech unit; and
  
  (3) combining the evidence scores so produced for the given speech unit; and
  
  hypothesizing as occurring at said given scoring time one or more speech units which have the highest speech-unit-probability scores for said given scoring time.

28. A method of speech-unit hypothesization in continuous speech comprising:
- storing a plurality of acoustic models, each of which represent a given class of sound which occurs as part of one or more speech units;
  
  storing a histogram for each of a plurality of speech units, with each such histogram indicating, for each of a plurality of acoustic models, the total probable number of matches between that acoustic model and a given portion of one or more utterances of that speech unit;
  
  calculating a histogram for a portion of speech to be analyzed which indicates the total probable number of matches between each of the given plurality of acoustic models and that portion of speech;
  
  comparing the histogram calculated for the portion of speech to be analyzed against the histograms for each of a plurality of speech units to calculate speech-unit-probability scores which indicate which speech units most probably corresponds to the speech to be analyzed; and
  
  hypothesizing as occurring at or near the time corresponding to said portion of speech one or more speech units with the highest speech-unit-probability scores for said portion of speech;
  
  wherein;
  
  the histogram which is calculated for the speech to be analyzed derives most of its information from a sampling window, that is, a portion of the speech to be analyzed which is approximately the same length as the portions of speech from which the histograms of the speech-unit models are made;
  
  this histogram is repeatedly re-calculated, with the sampling window being shifted relative to the speech be analyzed in successive re-calculations;
  
  the resulting histograms which are calculated for each of a plurality of different positions of the sampling window are each compared against the histograms of the speech units to determining which of those speech units most probably correspond to each of those different sampling window positions.
- View Dependent Claims (29)
- - 29. A speech recognition method comprising a method of speech-unit hypothesization as described in claim 28, wherein:
    - the speech-unit-probability scores calculated by comparing histograms of the speech to be analyzed against histograms of speech-unit models is used to select which one or more speech units receive a more computationally intensive comparison against the speech to be analyzed; and
      
      selecting as the recognized word sequence the word sequence which is determined by said more computationally intensive comparison to be the most probable word sequence corresponding to the speech to be analyzed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Dragon Systems, Inc. (Microsoft Corporation)
Original Assignee
Dragon Systems, Inc. (Microsoft Corporation)
Inventors
Baker, James K.
Primary Examiner(s)
Roskoski, Bernard

Application Number

US07/034,843
Time in Patent Office

676 Days
Field of Search

381/41, 381/43, 381/45, 364/513.5
US Class Current

704/241
CPC Class Codes

G10L 15/04 Segmentation; Word boundary...

Speech recognition method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

225 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

225 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links