Speaker-independent word recognition method and system based upon zero-crossing rate and energy measurement of analog speech signal
First Claim
1. A word recognition system for identifying a spoken word independent of the speaker thereof, wherein the spoken word is represented by an analog speech signal, said word recognition system comprising:
- signal conditioning means including energy measuring circuit means and zero-crossing detector means for receiving an input analog speech signal and providing word-discrimination information as a sequence of feature vectors based solely upon enery measurements as provided by said energy measuring circuit means and the zero-crossing rate of the input analog speech signal as determined by said zero-crossing detector means to the exclusion of other speech parameters;
memory means storing a plurality of reference templates of digital speech data respectively representative of individual words comprising the vocabulary of the word recognition system, the vocabulary consisting of a relatively small number of words with each of the words included in the vocabulary being represented by a reference template, each of said reference templates corresponding to a word acoustically distinct from other words included in the vocabulary;
each of said reference templates being defined by a predetermined plurality of reference vectors arranged in a predetermined sequence and comprising an acoustic description of an individual word in a time-ordered sequence of acoustic events,each reference vector corresponding to one of the acoustic events as determined by a zero-crossing rate and an energy measurement of a reference analog speech signal corresponding to an individual word and representing a plurality of probabilistic events corresponding in number to the total number of values potentially assumable by a feature vector such that each of the probabilistic events is based upon the relative likelihood of occurrence of an acoustic event therein as compared to the other probabilistic events of the same reference vector;
means operably coupled to the outputs of said energy measuring circuit means and said zero-crossing detector means of said signal conditioning means for extracting feature vectors from said input analog speech signal, an acoustic event being described by the value of each feature vector;
means operably associated with said feature vector extracting means for comparing each feature vector of said input analog speech signal with the corresponding reference vectors of each of said reference templates to provide a distance measure with respect to each of the reference vectors in the predetermined sequences defining acoustic descriptions of the respective words included in the vocabulary of the word recognition system; and
means for determining which one of the plurality of reference templates is the closest match to said input analog speech signal based upon a cumulative cost profile as defined by the respective distance measures provided by comparisons of each feature vector of said input analog speech signal with the reference vectors included in the predetermined sequences of reference vectors defining the plurality of reference templates.
1 Assignment
0 Petitions
Accused Products
Abstract
Speaker-independent word recognition method and system for identifying individual spoken words based upon an acoustically distinct vocabulary of a limited number of words. The word recognition system may employ memory storage associated with a microprocessor or microcomputer in which reference templates of digital speech data representative of a limited number of words comprising the word vocabulary are stored. The word recognition system accepts an input analog speech signal from a microphone as derived from a single word-voice command spoken by any speaker. The analog speech signal is directed to an energy measuring circuit and a zero-crossing detector for determining a sequence of feature vectors based upon the zero-crossing rate and energy measurements of the sampled analog speech signal. The sequence of feature vectors are then input to the microprocessor or microcomputer for individual comparison with the feature vectors included in each of the reference templates as stored in the memory portion of the microprocessor or microcomputer. Comparison of the sequence of feature vectors as determined from the input analog speech signal with the feature vectors included in the plurality of reference templates produces a cumulative cost profile for enabling logic circuitry within the microprocessor or microcomputer to make a decision as to the identity of the spoken word. The work recognition system may be incorporated within an electronic device which is also equipped with speech synthesis capability such that the electronic device is able to recognize simple words as spoken thereto and to provide an audible comment via speech synthesis which is related to the spoken word.
84 Citations
20 Claims
-
1. A word recognition system for identifying a spoken word independent of the speaker thereof, wherein the spoken word is represented by an analog speech signal, said word recognition system comprising:
-
signal conditioning means including energy measuring circuit means and zero-crossing detector means for receiving an input analog speech signal and providing word-discrimination information as a sequence of feature vectors based solely upon enery measurements as provided by said energy measuring circuit means and the zero-crossing rate of the input analog speech signal as determined by said zero-crossing detector means to the exclusion of other speech parameters; memory means storing a plurality of reference templates of digital speech data respectively representative of individual words comprising the vocabulary of the word recognition system, the vocabulary consisting of a relatively small number of words with each of the words included in the vocabulary being represented by a reference template, each of said reference templates corresponding to a word acoustically distinct from other words included in the vocabulary; each of said reference templates being defined by a predetermined plurality of reference vectors arranged in a predetermined sequence and comprising an acoustic description of an individual word in a time-ordered sequence of acoustic events, each reference vector corresponding to one of the acoustic events as determined by a zero-crossing rate and an energy measurement of a reference analog speech signal corresponding to an individual word and representing a plurality of probabilistic events corresponding in number to the total number of values potentially assumable by a feature vector such that each of the probabilistic events is based upon the relative likelihood of occurrence of an acoustic event therein as compared to the other probabilistic events of the same reference vector; means operably coupled to the outputs of said energy measuring circuit means and said zero-crossing detector means of said signal conditioning means for extracting feature vectors from said input analog speech signal, an acoustic event being described by the value of each feature vector; means operably associated with said feature vector extracting means for comparing each feature vector of said input analog speech signal with the corresponding reference vectors of each of said reference templates to provide a distance measure with respect to each of the reference vectors in the predetermined sequences defining acoustic descriptions of the respective words included in the vocabulary of the word recognition system; and means for determining which one of the plurality of reference templates is the closest match to said input analog speech signal based upon a cumulative cost profile as defined by the respective distance measures provided by comparisons of each feature vector of said input analog speech signal with the reference vectors included in the predetermined sequences of reference vectors defining the plurality of reference templates. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. An electronic device comprising:
-
integrated circuit means including memory means having digital speech data stored therein, a first portion of said memory means being devoted to a plurality of reference templates of digital speech data respectively representative of individual words comprising the vocabulary of a word recognition capability, the vocabulary consisting of a relatively small number of words with each of the words included in the vocabulary being represented by a reference template defined by a predetermined plurality of reference vectors arranged in a predetermined sequence and comprising an acoustic description of an individual word in a time-ordered sequence, each of said reference templates corresponding to a word acoustically distinct from other words included in the vocabulary, said memory means having a second portion thereof devoted to digital speech data from which words, phrases and sentences of synthesized speech may be derived, controller means for selectively accessing digital speech data from said first portion of said memory means devoted to said plurality of reference templates and from said second portion of said memory means devoted to said digital speech data from which synthesized speech may be derived, and speech synthesizer means operably coupled to said controller means and to said memory means for selectively accessing digital speech data in response to instructions from said controller means and generating analog speech signals representative of human speech in response to the selectively accessed digital speech data from said memory means; signal conditioning means for receiving an input analog speech signal representative of a spoken word and providing word-discriminationn information as a sequence of feature vectors defining acoustic descriptions of the word; said controller means and said first portion of said memory means devoted to said plurality of reference templates cooperating to define word recognition means for receiving said word-discrimination information representative of said input analog speech signal; said controller means including comparator means for comparing each feature vector of said input analog speech signal with the corresponding reference vectors of each of said reference templates stored within said first portion of said memory means to provide a distance measure with respect to each of the reference vectors in the predetermined sequences defining acoustic descriptions of the respective words as represented by said plurality of reference templates; said controller means further including logic circuit means for determining which one of the plurality of reference templates is the closest match to said input analog speech signal based upon a cumulative cost profile as defined by the respective distance measures provided by comparisons of each feature vector of said input analog speech signal with the reference vectors included in the predetermined sequences of reference vectors defining the plurality of reference templates; said controller means being responsive to the recognition of the word represented by said input analog speech signal based upon the particular reference template decided upon by said logic circuit means to selectively access digital speech data from the second portion of said memory means reflective of the word recognition; said speech synthesizer means being responsive to the selectively accessed digital speech data reflective of the word recognition for generating analog speech signals representative of human speech in some way related to the recognized word; and audio means coupled to the output of said speech synthesizer means for producing audible human speech from said analog speech signals generated by said speech synthesizer means having some relationship to the recognized word. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A method for recognizing individual words of speech independent of the speaker thereof from a vocabulary consisting of a limited number of words, said method comprising the steps of:
-
establishing a data base from a population of different speakers uttering the same list of a limited number of individual words to be included in the vocabulary for which speaker-independent word recognition is to be applicable; determining a sequence of time-ordered acoustic events based upon a zero-crossing rate and an energy measurement as the sole speech parameters corresponding to each word for each of the different speakers whose utterances are included in the data base; assigning a probability distribution function of each acoustic event included in each of the sequences of time-ordered acoustic events corresponding to respective individual words to be included in the vocabulary based upon the statistical averages obtained from the data base of the population of different speakers; representing a plurality of probabilistic acoustic events based upon the probability distribution function of a given acoustic event included in the sequence of acoustic events corresponding to an individual word as a reference vector such that each of the probabilistic acoustic events is based upon the relative likelihood of occurrence of an acoustic event therein; arranging a set of a plurality of reference vectors in a sequence comprising an acoustic description of an individual word; generating a reference template of digital speech data representative of an individual word from each set of reference vectors; measuring an input analog speech signal by substantially simultaneously obtaining energy measurements of said input analog speech signal and sensing zero-crossings of said input analog speech signal to provide word-discrimination information as a sequence of feature vectors, each of which is based solely upon an energy measurement and a zero-crossing rate and defines an acoustic event; extracting said sequence of feature vectors; comparing each feature vector of said input analog speech signal as extracted with the corresponding reference vectors of each of a plurality of said reference templates of digital speech data respectively representative of individual words comprising the vocabulary of the limited number of words; determining a distance measure with respect to the corresponding reference vectors in the predetermined sequences defining acoustic descriptions of the respective words for each feature vector of said input analog speech signal as a result of the comparisons therebetween; and recognizing a word as represented by the input analog speech signal on the basis of determining which one of said plurality of reference templates is the closest match to the input analog speech signal based upon a cumulative cost profile as defined by the respective distance measures provided by the comparison of each feature vector of said input analog speech signal with the reference vectors included in predetermined sequences of reference vectors defining said plurality of reference templates. - View Dependent Claims (18, 19, 20)
-
Specification