Enhancing Speech Recognition Using Visual Information
First Claim
Patent Images
1. A method of performing speech recognition, comprising:
- capturing one or more images;
extracting environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the audio signal including a speaker'"'"'s utterance;
performing dereverberation or noise cancellation processing on the audio signal based on an environment adaptation parameter, the environment adaptation parameter determined from the extracted environmental features; and
producing speech elements by processing the processed audio signal.
1 Assignment
0 Petitions
Accused Products
Abstract
Speech recognition device uses visual information to narrow down the range of likely adaptation parameters even before a speaker makes an utterance. Images of the speaker and/or the environment are collected using an image capturing device, and then processed to extract biometric features and environmental features. The extracted features and environmental features are then used to estimate adaptation parameters. A voice sample may also be collected to refine the adaptation parameters for more accurate speech recognition.
60 Citations
24 Claims
-
1. A method of performing speech recognition, comprising:
-
capturing one or more images; extracting environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the audio signal including a speaker'"'"'s utterance; performing dereverberation or noise cancellation processing on the audio signal based on an environment adaptation parameter, the environment adaptation parameter determined from the extracted environmental features; and producing speech elements by processing the processed audio signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A speech recognition device, comprising:
-
an image capturing module configured to capture one or more images; a feature extractor coupled to the image capturing module, the feature extractor configured to extract environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the audio signal including a speaker'"'"'s utterance; an environment parameter estimator coupled to the feature extractor, the environment parameter estimator configured to determine an environment adaptation parameter based on the extracted environmental features; an audio signal processor coupled to the environment parameter estimator, the audio signal processor configured to perform dereverberation or noise cancellation processing on an audio signal based on the environment adaptation parameter; and a speech recognition engine coupled to the audio signal processor, the speech recognition engine configured to recognize speech elements based on the processed audio signal. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A computer-readable storage medium structured to store instructions executable by a processor in speech recognition device, the instructions, when executed cause the processor to:
-
capture one or more images; extract environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the audio signal including a speaker'"'"'s utterance; perform dereverberation or noise cancelling processing on an audio signal based on an environment adaptation parameter, the environment adaptation parameter determined from the extracted environmental features; and recognize speech elements based on the processed audio signal. - View Dependent Claims (24)
-
Specification