Enhancing Speech Recognition Using Visual Information

US 20110224979A1
Filed: 03/09/2010
Published: 09/15/2011
Est. Priority Date: 03/09/2010
Status: Active Grant

First Claim

Patent Images

1. A method of performing speech recognition, comprising:

capturing one or more images;

extracting environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the audio signal including a speaker'"'"'s utterance;

performing dereverberation or noise cancellation processing on the audio signal based on an environment adaptation parameter, the environment adaptation parameter determined from the extracted environmental features; and

producing speech elements by processing the processed audio signal.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Speech recognition device uses visual information to narrow down the range of likely adaptation parameters even before a speaker makes an utterance. Images of the speaker and/or the environment are collected using an image capturing device, and then processed to extract biometric features and environmental features. The extracted features and environmental features are then used to estimate adaptation parameters. A voice sample may also be collected to refine the adaptation parameters for more accurate speech recognition.

60 Citations

View as Search Results

24 Claims

1. A method of performing speech recognition, comprising:
- capturing one or more images;
  
  extracting environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the audio signal including a speaker'"'"'s utterance;
  
  performing dereverberation or noise cancellation processing on the audio signal based on an environment adaptation parameter, the environment adaptation parameter determined from the extracted environmental features; and
  
  producing speech elements by processing the processed audio signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising:
    - extracting biometric features of the speaker from the one or more images;
      
      determining a biometric adaptation parameter based on the extracted biometric features; and
      
      performing adaptation based on the biometric adaptation parameter, the speech elements are produced based on the adaptation.
  - 3. The method of claim 2, wherein the biometric adaptation parameter is a vector representing a probability distribution of a warping factor for frequency warping the processed audio signal.
  - 4. The method of claim 2, wherein the biometric features comprise at least one of height, age, gender, weight and ethnicity of the speaker.
  - 5. The method of claim 2, further comprising estimating a class of the speaker based on the biometric features, the biometric adaptation parameter determined based on the estimated class of the speaker.
  - 6. The method of claim 2, further comprising:
    - receiving a voice sample of the speaker; and
      
      updating the biometric adaptation parameter based on the voice sample of the speaker.
  - 7. The method of claim 2, further comprising:
    - receiving training voice samples, environment information and biometric data of speakers of the training voice samples;
      
      generating acoustic models based on the received training voice samples;
      
      determining first correlation between values of the environment adaptation parameter and the environment information, the environment adaptation parameter determined based further on the first correlation; and
      
      determining second correlation between values of the biometric adaptation parameter and the biometric data, the biometric adaptation parameter determined based further on the second correlation.
  - 8. The method of claim 1, further comprising:
    - receiving a voice sample of the speaker; and
      
      updating the environment adaptation parameter based on the voice sample of the speaker.
  - 9. The method of claim 1, wherein extracting the environmental features comprises performing Simultaneous Localization and Mapping (SLAM) processing on the one or more images.
  - 10. The method of claim 1, wherein the environment adaptation parameter comprises Spectral Subtraction (SS) parameter.
  - 11. The method of claim 1, wherein the environmental features comprise at least one of a size of an enclosed area where the audio signal is generated, a configuration of the enclosed area, a location of the microphone within the enclosed area, and a location of the speaker within the enclosed area.

12. A speech recognition device, comprising:
- an image capturing module configured to capture one or more images;
  
  a feature extractor coupled to the image capturing module, the feature extractor configured to extract environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the audio signal including a speaker'"'"'s utterance;
  
  an environment parameter estimator coupled to the feature extractor, the environment parameter estimator configured to determine an environment adaptation parameter based on the extracted environmental features;
  
  an audio signal processor coupled to the environment parameter estimator, the audio signal processor configured to perform dereverberation or noise cancellation processing on an audio signal based on the environment adaptation parameter; and
  
  a speech recognition engine coupled to the audio signal processor, the speech recognition engine configured to recognize speech elements based on the processed audio signal.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The device of claim 12, further comprising a biometric parameter estimator coupled to the feature extractor, the biometric parameter estimator configured to determine a biometric adaptation parameter based on biometric features extracted from the one or more images, the speech recognition engine further configured to perform adaptation based on the biometric adaptation parameter and recognize speech elements based on the adaptation.
  - 14. The device of claim 13, wherein the biometric adaptation parameter is a vector representing a probability distribution of a warping factor for frequency warping the processed audio signal.
  - 15. The device of claim 13, wherein the biometric features comprise at least one of height, age, gender, weight and ethnicity of the speaker.
  - 16. The device of claim 13, further comprising a speaker class estimator configured to estimate a class of the speaker based on the biometric features, the biometric parameter estimator further configured to determine the biometric adaptation parameter based on the estimated class.
  - 17. The device of claim 13, further comprising a biometric parameter modifier configured to update the biometric adaptation parameter based on a voice sample of the speaker.
  - 18. The device of claim 13, further comprising an acoustic trainer configured to:
    - receive training voice samples, environment information and biometric data of speakers of the training voice samples;
      
      generate acoustic models based on the received training voice samples;
      
      determine first correlation between values of the environment adaptation parameter and the environment information, the environment parameter estimator configured to determine the environment adaptation parameter based further on the first correlation; and
      
      determine second correlation between values of the biometric adaptation parameter and the biometric data, the biometric parameter estimator further configured to determine the biometric adaptation parameter based further on the second correlation.
  - 19. The device of claim 12, further comprising an environment parameter modifier configured to update the environment adaptation parameter based on a voice sample of the speaker.
  - 20. The device of claim 12, wherein the feature extractor is configured to perform Simultaneous Localization and Mapping (SLAM) processing on the one or more images.
  - 21. The device of claim 12, wherein the environment adaptation parameter comprises Spectral Subtraction (SS) parameter.
  - 22. The device of claim 12, wherein the environmental features comprise at least one of a size of an enclosed area where the audio signal is generated, a configuration of the enclosed area, a location of the microphone within the enclosed area, and a location of the speaker within the enclosed area.

23. A computer-readable storage medium structured to store instructions executable by a processor in speech recognition device, the instructions, when executed cause the processor to:
- capture one or more images;
  
  extract environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the audio signal including a speaker'"'"'s utterance;
  
  perform dereverberation or noise cancelling processing on an audio signal based on an environment adaptation parameter, the environment adaptation parameter determined from the extracted environmental features; and
  
  recognize speech elements based on the processed audio signal.
- View Dependent Claims (24)
- - 24. The computer-readable storage medium of claim 23, further comprising instructions to:
    - extract biometric features of the speaker from the one or more images;
      
      determine a biometric adaptation parameter based on the extracted biometric features; and
      
      perform adaptation based on the biometric adaptation parameter, the recognizing of speech elements based on the adaptation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Honda Motor Co., Ltd. (Honda Motor Company)
Original Assignee
Honda Motor Co., Ltd. (Honda Motor Company)
Inventors
Raux, Antoine R.

Granted Patent

US 8,660,842 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/233
CPC Class Codes

G10L 15/07   to the speaker

G10L 2021/02082   the noise being echo, rever...

G10L 2021/02163   Only one microphone

G10L 21/0208   Noise filtering

Enhancing Speech Recognition Using Visual Information

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

60 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Enhancing Speech Recognition Using Visual Information

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

60 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links