Speech recognition apparatus for consumer electronic applications

US 5,790,754 A
Filed: 10/21/1994
Issued: 08/04/1998
Est. Priority Date: 10/21/1994
Status: Expired due to Term

First Claim

Patent Images

1. A method for recognizing an utterance spoken by a user comprising the steps of:

capturing the utterance as an input audio signal;

converting the input audio signal to a digitized representation;

using a single pole digital difference filter repeatedly to obtain a plurality of filtered waveforms from the digitized representation, wherein said single pole digital difference filter is in the form;
space="preserve" listing-type="equation">Y(n)=AY(n-1)+BX(n)+CX(n-1);

extracting estimates of a plurality of acoustic parameters from the plurality of filtered waveforms at successive sampling points;

determining an end time of the spoken utterance and a duration of the spoken utterance;

thereaftertime-normalizing said estimates so that the spoken utterance extends over a predetermined number of time intervals; and

analyzing the time-normalized estimates to identify the utterance.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A spoken word or phrase recognition device. The device does not require a digital signal processor, large RAM, or extensive analog circuitry. The input audio signal is digitized and passed recursively through a digital difference filter to produce a multiplicity of filtered output waveforms. These waveforms are processed in real time by a microprocessor to generate a pattern that is recognized by a neural network pattern classifier that operates in software in the microprocessor. By application of additional techniques, this device has been shown to recognize an unknown speaker saying a digit from zero through nine with an accuracy greater than 99%. Because of the recognition accuracy and cost-effective design, the device may be used in cost sensitive applications such as toys, electronic learning aids, and consumer electronic products.

Citations

21 Claims

1. A method for recognizing an utterance spoken by a user comprising the steps of:
- capturing the utterance as an input audio signal;
  
  converting the input audio signal to a digitized representation;
  using a single pole digital difference filter repeatedly to obtain a plurality of filtered waveforms from the digitized representation, wherein said single pole digital difference filter is in the form;
  space="preserve" listing-type="equation">Y(n)=AY(n-1)+BX(n)+CX(n-1);
  extracting estimates of a plurality of acoustic parameters from the plurality of filtered waveforms at successive sampling points;
  
  determining an end time of the spoken utterance and a duration of the spoken utterance;
  
  thereaftertime-normalizing said estimates so that the spoken utterance extends over a predetermined number of time intervals; and
  
  analyzing the time-normalized estimates to identify the utterance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 18)
- - 2. The method of claim 1 wherein said step of using a single pole digital difference filter comprises the substeps of:
    - configuring the single pole digital difference filter using a predetermined set of filter parameters;
      
      applying the digitized representation to an input of the single pole digital difference filter;
      
      thereaftermodifying the set of filter parameters of the single pole digital difference filter to reconfigure the digital difference filter;
      
      directing a previous output signal of the single pole digital difference filter to the input of the single pole digital difference filter to obtain a new output signal of the single pole digital filter; and
      
      repeating said modifying step and said directing step to simulate a bandpass filter having one of said plurality of waveforms as an output as a cascade of successive filter stages.
  - 3. The method of claim 1 wherein said extracting step occurs concurrently with said capturing step.
  - 4. The method of claim 1 wherein said analyzing step is performed using a pretrained neural network.
  - 5. The method of claim 4 wherein said neural network is a feedforward, multilayer, nonlinear classifier.
  - 6. The method of claim 1 wherein said analyzing step comprises the substep of:
    - assigning the utterance to one of a plurality of categories, said plurality of categories including a plurality of selected word classes and a class that includes all utterances not included in the plurality of selected word classes.
  - 7. The method of claim 4 wherein said neural network operates as a Bayes classifier.
  - 8. The method of claim 4 further comprising the step of:
    - training the neural network with training data using a numerical optimization procedure.
  - 9. The method of claim 8 wherein said training step comprises the substeps of:
    - initializing weights of the neural network to be a set of initial values; and
      
      thereaftervarying the neural network weights in response to training data input to the neural network until an optimal weight set is obtained.
  - 10. The method of claim 9 wherein said step of training further comprises the substeps of:
    - repeating said steps of initializing and varying using different sets of initial values in said initializing step to obtain a plurality of optimal weight sets; and
      
      thereafterselecting from the plurality of optimal weight sets, an optimal weight set that adjusts the neural network to have a minimum error given the training data as input.
  - 11. The method of claim 1 wherein one of said acoustic parameters is a zero crossing rate of one of said plurality of filtered waveforms.
  - 12. The method of claim 1 wherein one of said acoustic parameters is a power measure of one of said plurality of filtered waveforms.
  - 13. The method of claim 1 wherein one of said acoustic parameters is said duration of the spoken utterance.
  - 14. The method of claim 1 wherein said step of analyzing comprises the substep of:
    - comparing the utterance to a stored pattern for speaker verification.
  - 18. The method of claim 1 wherein A, B, and C are selected so that no multiplication by one of A, B, or C required for calculating a single term of Y(n) requires more than two shifts and two adds.

15. Apparatus for recognizing an utterance spoken by a user, said apparatus comprising;
- an analog-to-digital converter that converts an audio signal into a digitized representation;
  a repeatedly accessed single pole digital difference filter that obtains a plurality of filtered waveforms from the digitized representation, wherein said single pole digital difference filter is in the form;
  space="preserve" listing-type="equation">Y(n)=AY(n-1)+BX(n)+CX(n-1);
  a feature extractor that extracts estimates of a plurality of acoustic parameters from the plurality of filtered waveforms at successive sampling points;
  
  a timer that determines an end time of the spoken utterance and a duration of the spoken utterance;
  
  a time-normalizer that time-normalizes said estimates so that the spoken utterance extends over a predetermined number of time intervals; and
  
  a classifier that analyzes the time-normalized estimates to identify the spoken utterance.
- View Dependent Claims (16, 17, 21)
- - 16. The apparatus of claim 15 further comprising a digital difference filter controller that:
    - configures the digital difference filter using a predetermined set of filter parameters;
      
      applies the digitized representation to an input of the digital difference filter;
      
      modifies the filter parameters of the digital difference filter to reconfigure the digital difference filter;
      
      directs a previous output signal of the digital difference filter to the input of the digital difference filter to obtain a new output signal of the digital filter; and
      
      repeats said modifying step and said directing step to simulate a bandpass filter having one of said plurality of waveforms as an output.
  - 17. The apparatus of claim 15 wherein said classifier is a pretrained neural network.
  - 21. The apparatus of claim 15 wherein A, B, and C are selected so that no multiplication by one of A, B, or C required for calculating a single term of Y(n) requires more than two shifts and two adds.

19. A method for recognizing an utterance spoken by a user comprising the steps of:
- capturing the utterance as an input audio signal;
  
  converting the input audio signal to a digitized representation;
  using a digital difference filter to obtain a plurality of filtered waveforms from the digitized representation, wherein said single pole digital difference filter is in the form;
  space="preserve" listing-type="equation">Y(n)=AY(n-1)+BX(n)+CX(n-1);
  extracting, concurrently with said capturing step, estimates of a plurality of acoustic parameters from the plurality of filtered waveforms at successive sampling points, concurrently with said capturing step;
  
  determining an end time of the spoken utterance and a duration of the spoken utterance;
  
  thereaftertime-normalizing said estimates so that the spoken utterance extends over a predetermined number of time intervals; and
  
  analyzing the time-normalized estimates to identify the utterance.

20. Apparatus for recognizing an utterance spoken by a user, said apparatus comprising;
- an audio input device that accepts speech input from the user including the utterance and provides an electrical signal responsive to the speech input;
  
  an analog-to-digital converter that converts the electrical signal into a digitized representation;
  a single pole digital difference filter that obtains a plurality of filtered waveforms from the digitized representation, wherein said single pole digital difference filter is in the form;
  space="preserve" listing-type="equation">Y(n)=AY(n-1)+BX(n)+CX(n-1);
  a feature extractor that extracts, estimates of a plurality of acoustic parameters from the plurality of filtered waveforms at successive sampling points;
  
  a timer that determines an end time of the spoken utterance and a duration of the spoken utterance;
  
  a time-normalizer that time-normalizes said estimates so that the spoken utterance extends over a predetermined number of time intervals; and
  
  a classifier that analyzes the time-normalized estimates to identify the spoken utterance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sensory, Inc.
Original Assignee
Sensory, Inc.
Inventors
Mozer, Forrest S., Mozer, Todd F., Mozer, Michael C.
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
Opsasnick, Michael N.

Application Number

US08/327,455
Time in Patent Office

1,383 Days
Field of Search

395/2.09, 395/2.1, 395/2.14, 395/2.15, 395/2.4, 395/2.41, 395/2.57, 395/2.58, 395/2.6, 395/2.62, 395/2.63, 395/21-24
US Class Current

704/232
CPC Class Codes

G10L 15/16 using artificial neural net...

Speech recognition apparatus for consumer electronic applications

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition apparatus for consumer electronic applications

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links