Low cost speech recognition system and method

US 4,910,784 A
Filed: 07/30/1987
Issued: 03/20/1990
Est. Priority Date: 07/30/1987
Status: Expired due to Term

First Claim

Patent Images

1. A system for recognizing speech, comprising:

a digitizer for sampling analog speech signals at predetermined intervals and generating a digital representation thereof as digital speech signals;

a feature extractor coupled to said digitizer for grouping the digital speech signals into frames and generating a transform of the digital speech signals as grouped in each frame, wherein the transform has a plurality of feature coefficients, and wherein each feature coefficient has a corresponding binary feature coefficient indicating whether the feature coefficient has a value greater or less than a preselected threshold for that feature coefficient;

a queue coupled to said feature extractor for receiving frames of binary feature coefficients as speech frames and arranging them in consecutive order;

a comparator coupled to said queue for comparing a plurality of speech frames with a plurality of reference templates having frames of binary feature coefficients and generating a plurality of error values indicating the closeness of the match therebetween, wherein the reference templates are respectively representative of different words; and

a decision controller coupled to said comparator for receiving the results of the comparisons, and for selecting a best match between a portion of a speech utterance as represented by said speech frames and the reference templates.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A low cost speech recognition system generates frames of received speech having binary feature components. The received speech frames are compared with reference templates, and error values representing the difference between the received speech and the reference templates are generated. At the end of an utterance, if one template resulted in a sufficiently small error value, the word represented by that template is selected as the recognized word.

Citations

22 Claims

1. A system for recognizing speech, comprising:
- a digitizer for sampling analog speech signals at predetermined intervals and generating a digital representation thereof as digital speech signals;
  
  a feature extractor coupled to said digitizer for grouping the digital speech signals into frames and generating a transform of the digital speech signals as grouped in each frame, wherein the transform has a plurality of feature coefficients, and wherein each feature coefficient has a corresponding binary feature coefficient indicating whether the feature coefficient has a value greater or less than a preselected threshold for that feature coefficient;
  
  a queue coupled to said feature extractor for receiving frames of binary feature coefficients as speech frames and arranging them in consecutive order;
  
  a comparator coupled to said queue for comparing a plurality of speech frames with a plurality of reference templates having frames of binary feature coefficients and generating a plurality of error values indicating the closeness of the match therebetween, wherein the reference templates are respectively representative of different words; and
  
  a decision controller coupled to said comparator for receiving the results of the comparisons, and for selecting a best match between a portion of a speech utterance as represented by said speech frames and the reference templates.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1, wherein said decision controller further includes means for detecting the beginning and end of an utterance as defined by its acounstic energy levels, and wherein said decision controller selects a best match only after an utterance is completed.
  - 3. The system of claim 2, wherein said decision controller selects a best match only if at least one comparison in said queue has an error less than a predetermined threshold, and wherein an utterance is rejected otherwise.
  - 4. The system of claim 3, wherein an utterance is rejected if the two comparisons having the lowest errors have error values which are within a preselected range of each other.
  - 5. The system of claim 1, wherein said comparator computes an exclusive-OR between each frame of each reference template and a corresponding speech frame of binary feature coefficients in said queue, and wherein the error value indicates the number of bits which do not match between respective reference templates and the corresponding speech frame of binary feature coefficients.
  - 6. The system of claim 1, wherein only alternate speech frames of binary feature coefficients in said queue are used by said comparator for each comparison step with the plurality of reference templates.

7. A method for recognizing spoken words, comprising the steps of:
- (a) digitizing an analog speech signal representing an utterance of speech by sampling the analog speech signal at preselected intervals to generate digital speech signals;
  
  grouping the digital speech signals into frames and transforming each frame of digital speech signals into a speech frame comprising a plurality of binary coefficients indicating acoustic features;
  
  providing a plurality of reference templates respectively representative of different words, each reference template having a plurality of frames of binary coefficients;
  
  comparing respective speech frames of binary coefficients with the reference templates, and generating error values indicating the magnitude of the differences therebetween; and
  
  selecting a reference template which generates the lowest error value as the recognized word.
- View Dependent Claims (8, 9, 10, 15, 16, 17)
- - 8. The method of claim 7, wherein a reference template is selected as the recognized word only if its error value is less than a predetermined value.
  - 9. The method of claim 7, wherein the reference template frames have a time duration twice as long as the speech frames of binary coefficients but are represented by the same number of binary coefficients, wherein only alternate ones of consecutive speech frames of binary coefficients are compared with the reference templates in the generation of said error values.
  - 10. The method of claim 7, wherein the comparison of speech frames of binary coefficients with the reference templates comprises performing an exclusive OR between corresponding speech and reference template frames, whereby the generated error values are the Hamming distance between the corresponding speech and reference template frames.
  - 15. The method of claim 7, wherein the transforming of each frame of digital speech signals into a speech frame comprising a plurality of binary coefficients comprises initially transforming each frame of digital speech signals into a plurality of speech parameters defining respective feature coefficients;
    - comparing each of the plurality of speech parameters included in a respective speech frame with a preselected threshold value;
      
      assigning a first or a second value as a binary coefficient corresponding to the respective speech parameter depending upon whether the speech parameter is greater or less than the preselected threshold value corresponding thereto; and
      
      assembling a plurality of said binary coefficients obtained from respective comparisons of all of the speech parameters included in a speech frame with preselected threshold values as a string of said binary coefficients representing a speech frame.
  - 16. The method of claim 15, wherein the first and second values assignable as a binary coefficient are respectively "1" and "0".
  - 17. The method of claim 16, wherein each string of binary coefficients representing a speech frame is of eight data bits in length.

11. A method for enrolling speech for use with a speech recognition system, comprising the steps of:
- selecting a word to be enrolled and determining an expected length in speech frames necessary for the representation thereof;
  
  receiving an utterance of the selected word in the form of an analog speech signal;
  
  digitizing the analog speech signal representative of the utterance by collecting samples thereof at preselected intervals as digital speech signals;
  
  grouping the digital speech signals into frames having a predetermined time duration;
  
  extracting binary features for each frame of digital speech signals to form respective speech frames of binary features corresponding to each of the frames of digital speech signals;
  
  comparing the length of the utterance as represented by speech frames of binary features to the expected length; and
  
  if the utterance as represented by speech frames of binary features has a length in speech frames within a preselected amount of the expected length, enrolling the speech frames representing the utterance as a reference template.
- View Dependent Claims (12, 13, 14)
- - 12. The method of claim 11, wherein the reception of an utterance of the selected word in the form of an analog speech signal and the digitizing of the analog speech signal are performed a plurality of times, and wherein the extraction of binary features with respect to each set of digital speech signals resulting therefrom is used to create a composite set of binary feature frames.
  - 13. The method of claim 12, wherein the plurality of utterances as received in the form of analog speech signals are made by a single speaker.
  - 14. The method of claim 12, wherein the plurality of utterances as received in the form of analog speech signals are made by different speakers.

18. A method for recognizing spoken words, comprising the steps of:
- digitizing an analog speech signal representing an utterance of speech by sampling the analog speech signal at preselected intervals to generate digital speech data;
  
  grouping the samples of digital speech data into frames having a plurality of samples of digital speech data;
  
  transforming the frames of digital speech data into a cepstrum transform having a plurality of cepstral parameters for each frame which define respective feature coefficients;
  
  comparing each cepstral parameter with a preselected threshold value;
  
  assigning a first or a second value to a binary feature coefficient depending upon whether the cepstral parameter is greater or less than the preselected threshold value corresponding thereto;
  
  assembling a plurality of said binary feature coefficients obtained from respective comparisons of all of the cepstral parameters included in a frame with preselected threshold values as a string of said binary feature coefficients representing a frame of speech data;
  
  providing a plurality of reference templates respectively representative of different words, each reference template having a plurality of frames of binary feature coefficients;
  
  comparing speech frames of binary feature coefficients with the reference templates, and generating error values indicating the magnitude of the differences therebetween; and
  
  selecting a reference template which generates the lowest error value as the recognized word.
- View Dependent Claims (19)
- - 19. The method of claim 18, further including analyzing each frame of digital speech data to determine linear predictive coding speech parameters for each frame;
    - andthereafter transforming the linear predictive coding speech parameters for each frame into a cepstrum transform having a plurality of cepstral parameters for each frame which define the respective feature coefficients.

20. A method for recognizing spoken words, comprising the steps of:
- digitizing an analog speech signal representing an utterance of speech by sampling the analog speech signal at preselected intervals to generate digital speech data;
  
  grouping the samples of digital speech data into frames having a plurality of samples of digital speech data;
  
  transforming each frame of digital speech data into a plurality of speech parameters defining respective feature coefficients;
  
  comparing each of the plurality of speech parameters included in a respective speech frame with a preselected threshold value;
  
  assigning a first or a second value as a binary coefficient corresponding to the respective speech parameter depending upon whether the speech parameter is greater or less than the preselected threshold value corresponding thereto;
  
  assembling a plurality of said binary coefficients obtained from respective comparisons of all of the speech parameters included in a speech frame with preselected threshold values as a string of said binary coefficients representing a speech frame;
  
  arranging a plurality of speech frames, each comprising a plurality of binary coefficients, in consecutive order in a queue;
  
  providing a plurality of reference templates respectively representative of different words, each reference template having a plurality of frames of binary coefficients, wherein the reference template frames have a time duration twice as long as the speech frames of binary coefficients but are represented by the same number of binary coefficients;
  
  comparing alternate ones of the consecutive speech frames of binary coefficients as arranged in the queue with the reference templates, and generating error values indicating the magnitude of the differences therebetween; and
  
  selecting a reference template which generates the lowest error value as the recognized word.
- View Dependent Claims (21, 22)
- - 21. The method as set forth in claim 20, further including shifting all speech frames already arranged in the queue one position within the queue in response to the insertion of a new speech frame into the queue during the filling of the queue to maintain the plurality of speech frames as arranged therein in consecutive order.
  - 22. The method of claim 21, wherein the comparing of alternate ones of the consecutive speech frames of binary coefficients as arranged in the queue with the reference templates continues until all of the reference templates have been compared to the alternate speech frames of the queue;
    - thereafter arranging a subsequent plurality of speech frames in consecutive order in a succeeding queue; and
      
      comparing alternate ones of the consecutive speech frames in the succeeding queue with the reference templates in the generation of said error values.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Texas Instruments, Inc.
Original Assignee
Texas Instruments, Inc.
Inventors
Rajasekaran, P. K., McMahan, Michael L., Doddington, George R., Anderson, Wallace
Primary Examiner(s)
Kemeny, Emanuel S.

Application Number

US07/079,563
Time in Patent Office

964 Days
Field of Search

381/42-43
US Class Current

704/251
CPC Class Codes

G10L 15/00 Speech recognition G10L17/0...

Low cost speech recognition system and method

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Low cost speech recognition system and method

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links