Speaker-independent word recognizer
First Claim
1. A word recognition system for identifying a spoken word represented by an analog speech signal, said word recognition system comprising:
- signal processing means for receiving an analog input speech signal and for producing feature vectors from the input speech signal to provide a sequence of feature vectors at predetermined speech frame intervals as an output therefrom;
memory means storing a plurality of reference templates of digital speech data respectively representative of individual words and comprising the vocabulary of the word recognition system, each of said reference templates being defined by a predetermined plurality of reference vectors arranged in a predetermined sequence and comprising an acoustic description of an individual word in a time-ordered sequence, each of said reference templates being further defined by at least one mask vector respectively associated with each said sequence of reference templates being further defined by at least one mask vector respectively associated with each said sequence of reference vectors and being indicative of the significance of portions of the reference vector sequence association therewith in establishing the identity of the word represented by the reference template of which said at least one mask vector is a component;
means operably associated with said signal processing means for comparing each feature vector of said input speech signal with the corresponding reference vectors of each of said reference templates to provide a distance measure with respect to each of the feature vectors and the predetermined reference vector sequences defining acoustic descriptions of the respective words included in the vocabulary of the word recognition system, said comparing means being responsive to the status of the respective mask vectors comprising components of said plurality of reference templates to ignore elements of reference vectors included in respective reference templates which are indicated by the associated mask vector to be insignificant so as to provide said distance measure based upon significant elements of the reference vectors as included in the predetermined reference vector sequences; and
word recognizing means operably associated with said comparing means for determining which one of the plurality of the reference templates is the closest match to said input speech signal based upon the distance measures of said reference vector sequences and successively received feature vectors corresponding to respective speech frames.
1 Assignment
0 Petitions
Accused Products
Abstract
Speaker-independent word recognition is performed, based on a small acoustically distinct vocabulary, with minimal hardware requirements. After a simple preconditioning filter, the zero crossing intervals of the input speech are measured and sorted by duration, to provide a rough measure of the frequency distribution within each input frame. The distribution of zero crossing intervals is transformed into a binary feature vector, which is compared with each reference template using a modified Hamming distance measure. A dynamic time warping algorithm is used to permit recognition of various speaker rates, and to economize on the reference template storage requirements. A mask vector with each reference vector on a template is used to ignore insignificant (or speaker-dependent) features of the words detected.
45 Citations
16 Claims
-
1. A word recognition system for identifying a spoken word represented by an analog speech signal, said word recognition system comprising:
-
signal processing means for receiving an analog input speech signal and for producing feature vectors from the input speech signal to provide a sequence of feature vectors at predetermined speech frame intervals as an output therefrom; memory means storing a plurality of reference templates of digital speech data respectively representative of individual words and comprising the vocabulary of the word recognition system, each of said reference templates being defined by a predetermined plurality of reference vectors arranged in a predetermined sequence and comprising an acoustic description of an individual word in a time-ordered sequence, each of said reference templates being further defined by at least one mask vector respectively associated with each said sequence of reference templates being further defined by at least one mask vector respectively associated with each said sequence of reference vectors and being indicative of the significance of portions of the reference vector sequence association therewith in establishing the identity of the word represented by the reference template of which said at least one mask vector is a component; means operably associated with said signal processing means for comparing each feature vector of said input speech signal with the corresponding reference vectors of each of said reference templates to provide a distance measure with respect to each of the feature vectors and the predetermined reference vector sequences defining acoustic descriptions of the respective words included in the vocabulary of the word recognition system, said comparing means being responsive to the status of the respective mask vectors comprising components of said plurality of reference templates to ignore elements of reference vectors included in respective reference templates which are indicated by the associated mask vector to be insignificant so as to provide said distance measure based upon significant elements of the reference vectors as included in the predetermined reference vector sequences; and word recognizing means operably associated with said comparing means for determining which one of the plurality of the reference templates is the closest match to said input speech signal based upon the distance measures of said reference vector sequences and successively received feature vectors corresponding to respective speech frames. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 16)
-
-
9. A word recognition system for identifying a spoken word represented by an analog speech signal, said word recognition system comprising:
-
signal conditioning means for receiving an analog input speech signal and performing filtering and signal processing operations thereon to place the input speech signal in a format compatible with the determination of feature aspects thereof; means operably coupled to the output of said signal conditioning means for extracting feature vectors from said conditioned input speech signal to provide a sequence of feature vectors at predetermined speech frame intervals; memory means storing a plurality of reference templates of digital speech data respectively representative of individual words and comprising the vocabulary of the word recognition system, each of said reference templates being defined by a predetermined plurality of reference vectors arranged in a predetermined sequence and comprising an acoustic description of an individual word in a time-ordered sequence, each of said reference templates being further defined by a plurality of mask vectors corresponding in number to said predetermined plurality of reference vectors and respectively associated with a corresponding reference vector of said plurality of reference vectors, each said mask vector being indicative of the significance of the reference vector associated therewith in establishing the identity of the word represented by the reference template in which the reference vector occurs; means operably associated with said feature vector extracting means for comparing each feature vector of said input speech signal with the corresponding reference vectors of each of said reference templates to provide a distance measure with respect to each of said feature vectors and the predetermined reference vector sequences defining acoustic descriptions of the respective words included in the vocabulary of the word recognition system, said comparing means being responsive to the status of the respective mask vectors associated with the reference vectors to ignore elements of each said reference vector which are indicated by the associated mask vector corresponding thereto to be insignificant so as to provide said distance measure based upon significant elements of the reference vectors as included in the predetermined reference vector sequences; and word recognition means for determining which one of the plurality of the reference templates is the closest match to said input speech signal based upon the distance measures between each of said reference vector sequences and successively received feature vectors corresponding to respective speech frames.
-
-
10. A method for recognizing speech comprising:
-
receiving an analog input speech signal; processing said analog input speech signal to provide a sequence of feature vectors from said input speech signal at predetermined speech frame intervals; associating at least one mask vector with each sequence of a plurality of reference vectors which have been organized in sequence with each of said reference vector sequences corresponding to a word which can be recognized, with said mask vector being indicative of the significance of portions of the reference vector sequence with which it is associated in establishing the identity of the word to which the respective reference vector sequence corresponds; comparing each of said feature vectors with each of said plurality of reference vectors in relation to the status of the respective mask vector associated with each said reference vector sequence; determining a distance measure with respect to each of said reference vectors for each successive feature vector in said sequence of feature vectors in response to the comparison therebetween wherein portions of each said reference vector sequence indicated by the associated at least one mask vector corresponding thereto to be insignificant are ignored such that said distance measure is based upon significant portions of the reference vector sequence; and recognizing words in accordance with the distance measures between each of said reference vector sequences and successively received feature vectors corresponding to respective speech frames. - View Dependent Claims (11, 12, 13)
-
-
14. A method for recognizing speech comprising:
-
receiving an analog input speech signal; conditioning said analog speech signal to produce a sequence of rectangular waveforms of polarity signs alternating between plus and minus polarities as a digital waveform signal; counting the number of polarity transitions in the digital waveform signal to obtain a zero-crossing count for each frame of the digital waveform signal; measuring the time duration intervals between zero-crossings of the digital waveform signal; providing a sequence of binary feature vectors based upon the measurements of the time duration intervals between zero-crossings of the digital waveform signal and corresponding to respective frames of the digital waveform signal; associating at least one mask vector with each sequence of a plurality of reference vectors which have been organized in sequences with each of said reference vector sequences corresponding to a word which can be recognized, wherein said at least one mask vector is indicative of the significance of portions of the reference vector sequence with which it is associated in establishing the identity of the word to which the respective reference vector sequence corresponds; comparing each of said feature vectors with each of said plurality of reference vectors organized in sequences and said at least one mask vector associated therewith; determining a distance measure with respect to each of said reference vectors for each successive feature vector in said sequence of said feature vectors in response to the comparison therebetween, wherein portions of each said reference vector sequence indicated by the associated at least one mask vector corresponding thereto as being insignificant are ignored such that said distance measure is based upon significant portions of the respective reference vector sequence; and recognizing words in accordance with the distance measures between each of said reference vector sequences and successively received feature vectors corresponding to respective speech frames. - View Dependent Claims (15)
-
Specification