Low cost speech recognition system and method
First Claim
Patent Images
1. A system for recognizing speech, comprising:
- a digitizer for sampling analog speech signals at predetermined intervals and generating a digital representation thereof as digital speech signals;
a feature extractor coupled to said digitizer for grouping the digital speech signals into frames and generating a transform of the digital speech signals as grouped in each frame, wherein the transform has a plurality of feature coefficients, and wherein each feature coefficient has a corresponding binary feature coefficient indicating whether the feature coefficient has a value greater or less than a preselected threshold for that feature coefficient;
a queue coupled to said feature extractor for receiving frames of binary feature coefficients as speech frames and arranging them in consecutive order;
a comparator coupled to said queue for comparing a plurality of speech frames with a plurality of reference templates having frames of binary feature coefficients and generating a plurality of error values indicating the closeness of the match therebetween, wherein the reference templates are respectively representative of different words; and
a decision controller coupled to said comparator for receiving the results of the comparisons, and for selecting a best match between a portion of a speech utterance as represented by said speech frames and the reference templates.
2 Assignments
0 Petitions
Accused Products
Abstract
A low cost speech recognition system generates frames of received speech having binary feature components. The received speech frames are compared with reference templates, and error values representing the difference between the received speech and the reference templates are generated. At the end of an utterance, if one template resulted in a sufficiently small error value, the word represented by that template is selected as the recognized word.
-
Citations
22 Claims
-
1. A system for recognizing speech, comprising:
-
a digitizer for sampling analog speech signals at predetermined intervals and generating a digital representation thereof as digital speech signals; a feature extractor coupled to said digitizer for grouping the digital speech signals into frames and generating a transform of the digital speech signals as grouped in each frame, wherein the transform has a plurality of feature coefficients, and wherein each feature coefficient has a corresponding binary feature coefficient indicating whether the feature coefficient has a value greater or less than a preselected threshold for that feature coefficient; a queue coupled to said feature extractor for receiving frames of binary feature coefficients as speech frames and arranging them in consecutive order; a comparator coupled to said queue for comparing a plurality of speech frames with a plurality of reference templates having frames of binary feature coefficients and generating a plurality of error values indicating the closeness of the match therebetween, wherein the reference templates are respectively representative of different words; and a decision controller coupled to said comparator for receiving the results of the comparisons, and for selecting a best match between a portion of a speech utterance as represented by said speech frames and the reference templates. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for recognizing spoken words, comprising the steps of:
-
(a) digitizing an analog speech signal representing an utterance of speech by sampling the analog speech signal at preselected intervals to generate digital speech signals; grouping the digital speech signals into frames and transforming each frame of digital speech signals into a speech frame comprising a plurality of binary coefficients indicating acoustic features; providing a plurality of reference templates respectively representative of different words, each reference template having a plurality of frames of binary coefficients; comparing respective speech frames of binary coefficients with the reference templates, and generating error values indicating the magnitude of the differences therebetween; and selecting a reference template which generates the lowest error value as the recognized word. - View Dependent Claims (8, 9, 10, 15, 16, 17)
-
-
11. A method for enrolling speech for use with a speech recognition system, comprising the steps of:
-
selecting a word to be enrolled and determining an expected length in speech frames necessary for the representation thereof; receiving an utterance of the selected word in the form of an analog speech signal; digitizing the analog speech signal representative of the utterance by collecting samples thereof at preselected intervals as digital speech signals; grouping the digital speech signals into frames having a predetermined time duration; extracting binary features for each frame of digital speech signals to form respective speech frames of binary features corresponding to each of the frames of digital speech signals; comparing the length of the utterance as represented by speech frames of binary features to the expected length; and if the utterance as represented by speech frames of binary features has a length in speech frames within a preselected amount of the expected length, enrolling the speech frames representing the utterance as a reference template. - View Dependent Claims (12, 13, 14)
-
-
18. A method for recognizing spoken words, comprising the steps of:
-
digitizing an analog speech signal representing an utterance of speech by sampling the analog speech signal at preselected intervals to generate digital speech data; grouping the samples of digital speech data into frames having a plurality of samples of digital speech data; transforming the frames of digital speech data into a cepstrum transform having a plurality of cepstral parameters for each frame which define respective feature coefficients; comparing each cepstral parameter with a preselected threshold value; assigning a first or a second value to a binary feature coefficient depending upon whether the cepstral parameter is greater or less than the preselected threshold value corresponding thereto; assembling a plurality of said binary feature coefficients obtained from respective comparisons of all of the cepstral parameters included in a frame with preselected threshold values as a string of said binary feature coefficients representing a frame of speech data; providing a plurality of reference templates respectively representative of different words, each reference template having a plurality of frames of binary feature coefficients; comparing speech frames of binary feature coefficients with the reference templates, and generating error values indicating the magnitude of the differences therebetween; and selecting a reference template which generates the lowest error value as the recognized word. - View Dependent Claims (19)
-
-
20. A method for recognizing spoken words, comprising the steps of:
-
digitizing an analog speech signal representing an utterance of speech by sampling the analog speech signal at preselected intervals to generate digital speech data; grouping the samples of digital speech data into frames having a plurality of samples of digital speech data; transforming each frame of digital speech data into a plurality of speech parameters defining respective feature coefficients; comparing each of the plurality of speech parameters included in a respective speech frame with a preselected threshold value; assigning a first or a second value as a binary coefficient corresponding to the respective speech parameter depending upon whether the speech parameter is greater or less than the preselected threshold value corresponding thereto; assembling a plurality of said binary coefficients obtained from respective comparisons of all of the speech parameters included in a speech frame with preselected threshold values as a string of said binary coefficients representing a speech frame; arranging a plurality of speech frames, each comprising a plurality of binary coefficients, in consecutive order in a queue; providing a plurality of reference templates respectively representative of different words, each reference template having a plurality of frames of binary coefficients, wherein the reference template frames have a time duration twice as long as the speech frames of binary coefficients but are represented by the same number of binary coefficients; comparing alternate ones of the consecutive speech frames of binary coefficients as arranged in the queue with the reference templates, and generating error values indicating the magnitude of the differences therebetween; and selecting a reference template which generates the lowest error value as the recognized word. - View Dependent Claims (21, 22)
-
Specification