Hidden Markov model speech recognition arrangement
First Claim
1. A speech analyzer for recognizing an utterance as one of a plurality of reference patterns each having a frame sequence of acoustic feature signals comprising:
- means for storing a set of K signals each representative of a prescribed acoustic feature of said plurality of reference patterns;
means for storing a plurality of templates each representative of an identified spoken reference pattern, the template of each spoken reference pattern comprising signals representative of a first state, a last state and a preselected number N-2 intermediate states between said first and last states of a constrained hidden Markov model of said spoken reference pattern, N being independent of the number of acoustic feature frames in the acoustic feature frame sequence of the identified spoken reference pattern, a plurality of first type signals each representative of the likelihood of a prescribed acoustic feature signal of a reference pattern frame being in a predetermined one of said states, and a plurality of second type signals each representative of the likelihood of a transition from a prescribed acoustic feature signal in one of said states to another of said states of said template;
means responsive to the utterance for forming a time frame sequence of acoustic feature signals representative of the speech pattern of the utterance;
means responsive to said utterance feature signal sequence and said stored prescribed acoustic feature signals for selecting a sequence of said prescribed feature signals representative of the utterance speech pattern;
means jointly responsive to said sequence of prescribed feature signals representative of the utterance and the reference pattern template N state constrained hidden Markov model signals for combining said utterance representative sequence of prescribed feature signal sequence with said reference pattern N state Markov model template signals to form a third type signal representative of the likelihood of the unknown utterance being the spoken reference pattern; and
means responsive to the third type signals for the plurality of reference patterns for generating a signal to identify the utterance as one of the plurality of reference patterns.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognizer includes a plurality of stored constrained hidden Markov model reference templates and a set of stored signals representative of prescribed acoustic features of the said plurality of reference patterns. The Markov model template includes a set of N state signals. The number of states is preselected to be independent of the reference pattern acoustic features and preferably substantially smaller than the number of acoustic feature frames of the reference patterns. An input utterance is analyzed to form a sequence of said prescribed feature signals representative of the utterance. The utterance representative prescribed feature signal sequence is combined with the N state constrained hidden Markov model template signals to form a signal representative of the probability of the utterance being each reference pattern. The input speech pattern is identified as one of the reference patterns responsive to the probability representative signals.
-
Citations
15 Claims
-
1. A speech analyzer for recognizing an utterance as one of a plurality of reference patterns each having a frame sequence of acoustic feature signals comprising:
-
means for storing a set of K signals each representative of a prescribed acoustic feature of said plurality of reference patterns; means for storing a plurality of templates each representative of an identified spoken reference pattern, the template of each spoken reference pattern comprising signals representative of a first state, a last state and a preselected number N-2 intermediate states between said first and last states of a constrained hidden Markov model of said spoken reference pattern, N being independent of the number of acoustic feature frames in the acoustic feature frame sequence of the identified spoken reference pattern, a plurality of first type signals each representative of the likelihood of a prescribed acoustic feature signal of a reference pattern frame being in a predetermined one of said states, and a plurality of second type signals each representative of the likelihood of a transition from a prescribed acoustic feature signal in one of said states to another of said states of said template; means responsive to the utterance for forming a time frame sequence of acoustic feature signals representative of the speech pattern of the utterance; means responsive to said utterance feature signal sequence and said stored prescribed acoustic feature signals for selecting a sequence of said prescribed feature signals representative of the utterance speech pattern; means jointly responsive to said sequence of prescribed feature signals representative of the utterance and the reference pattern template N state constrained hidden Markov model signals for combining said utterance representative sequence of prescribed feature signal sequence with said reference pattern N state Markov model template signals to form a third type signal representative of the likelihood of the unknown utterance being the spoken reference pattern; and means responsive to the third type signals for the plurality of reference patterns for generating a signal to identify the utterance as one of the plurality of reference patterns. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for recognizing an utterance as one of a plurality of reference patterns each having a time frame sequence of acoustic feature signals comprising the steps of:
-
storing a set of K signals each representative of a prescribed acoustic feature of said plurality of reference patterns; storing a plurality of templates each representative of an identified spoken reference pattern, the template of each spoken reference pattern comprising signals representative of a first state, a last state and a preselected number N-2 of intermediate states between said first and last states of a constrained hidden Markov model of said spoken reference pattern, N being independent of the number of acoustic feature frames in the acoustic feature frame sequences of the identified spoken reference patterns, a plurality of first type signals each representative of the likelihood of a prescribed acoustic feature of a reference pattern frame being in a predetermined one of said states, and a plurality of second type signals each representative of the likelihood of a transition from a prescribed acoustic feature signal in one of said states to another of said states of said template; forming a time frame sequence of acoustic feature signals representative of the speech pattern of the utterance; selecting a sequence of said prescribed feature signals representative of the utterance speech pattern responsive to the utterance feature signal sequence and the K stored prescribed acoustic feature signals; combining said sequence of prescribed feature signals representative of the utterance and the N state constrained hidden Markov model signals of the reference pattern template to form a third type signal representative of the likelihood of the unknown utterance being the spoken reference pattern; and generating a signal to identify the utterance as one of the reference patterns responsive to the third type signals for the plurality of reference patterns. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A speech analyzer for recognizing an utterance as one of a plurality of vocabulary words comprising:
-
a first memory for storing a set of K vector quantized prototype signals each representative of a linear predictive acoustic feature in the frame sequence of acoustic features of utterances of said plurality of vocabulary words; a second memory for storing a plurality of vocabulary reference templates, each template corresponding to an N state constrained hidden Markov model of a vocabulary word and including; a signal corresponding to an initial state of said constrained hidden Markov model, signals corresponding to N-2 intermediate states of said constrained hidden Markov model, a signal corresponding to the Nth final state of said constrained hidden Markov model, the number of states N being preselected to be less than the number of acoustic features in the sequence of acoustic features of the shortest vocabulary word, a set of first type signals each representative of the probability of a prototype feature signal being in a predetermined state of said constrained hidden Markov model, and a set of second type signals each representative of the probability of transition between a predetermined pair of said vocabulary word constrained hidden Markov model states; first means responsive to the utterance for forming an M time frame sequence of linear predictive acoustic feature signals representative of the speech pattern of the utterance; second means operative responsive to said speech pattern feature signals and said stored prototype acoustic feature signals for generating a sequence of M prototype acoustic feature signals representative of said utterance speech pattern; said second means being jointly responsive to said sequence of M prototype feature signals representative of the utterance and the signals of the N state constrained hidden Markov model of the vocabulary word template for forming a third type signal representative of the likelihood of the unknown utterance being the vocabulary word including means for producing a sequence of speech pattern frame processing interval signals, said second means being operative in the first frame processing interval responsive to the first frame prototype feature signal, the vocabulary word Markov model first state and first type signals for forming a signal representative of the likelihood of the first frame prototype feature signal being in the vocabulary word Markov model first state, and operative in each of the second to the Mth speech pattern frame processing intervals responsive to the Markov model state signals, the current frame prototype feature signals, the first type and second type signals, and the likelihood signals of the immediately preceding frame processing interval for forming a set of signals each representative of the likelihood of the current frame prototype feature signal being in a prescribed state of the vocabulary word Markov model, and means responsive to the likelihood signal corresponding to the Nth final state in the Mth speech pattern frame processing interval for generating the third type signal for said vocabulary word representative of the likelihood of the utterance being the vocabulary word; and means responsive to the third type signals for the plurality of vocabulary words for generating a signal identifying the utterance as the vocabulary word having the largest third type signal.
-
Specification