Enhancing speech recognition using visual information
First Claim
Patent Images
1. A method of performing speech recognition, comprising:
- capturing one or more images;
extracting environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the environmental features including at least a configuration of an enclosed area in which a speaker is located, the audio signal including the speaker'"'"'s utterance;
determining an environment adaptation parameter based on the extracted environment features;
performing dereverberation or noise cancellation processing on the audio signal including the speaker'"'"'s utterance based on the environment adaptation parameter; and
producing speech elements by processing the processed audio signal.
1 Assignment
0 Petitions
Accused Products
Abstract
Speech recognition device uses visual information to narrow down the range of likely adaptation parameters even before a speaker makes an utterance. Images of the speaker and/or the environment are collected using an image capturing device, and then processed to extract biometric features and environmental features. The extracted features and environmental features are then used to estimate adaptation parameters. A voice sample may also be collected to refine the adaptation parameters for more accurate speech recognition.
48 Citations
24 Claims
-
1. A method of performing speech recognition, comprising:
-
capturing one or more images; extracting environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the environmental features including at least a configuration of an enclosed area in which a speaker is located, the audio signal including the speaker'"'"'s utterance; determining an environment adaptation parameter based on the extracted environment features; performing dereverberation or noise cancellation processing on the audio signal including the speaker'"'"'s utterance based on the environment adaptation parameter; and producing speech elements by processing the processed audio signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A speech recognition device, comprising:
-
an image capturing module configured to capture one or more images; a feature extractor coupled to the image capturing module, the feature extractor configured to extract environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the environmental features including at least a configuration of an enclosed area in which a speaker is located, the audio signal including a speaker'"'"'s utterance; an environment parameter estimator coupled to the feature extractor, the environment parameter estimator configured to determine an environment adaptation parameter based on the extracted environment features and determine an environment adaptation parameter based on the extracted environmental features; an audio signal processor coupled to the environment parameter estimator, the audio signal processor configured to perform dereverberation or noise cancellation processing on the audio signal including the speaker'"'"'s utterance based on the environment adaptation parameter; and a speech recognition engine coupled to the audio signal processor, the speech recognition engine configured to recognize speech elements based on the processed audio signal. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A non-transitory computer-readable storage medium structured to store instructions executable by a processor in speech recognition device, the instructions, when executed cause the processor to:
-
capture one or more images; extract environmental features affecting reverberation of an audio signal or noise in the audio signal from the captured one or more images, the environmental features including at least a configuration of an enclosed area in which a speaker is located, the audio signal including the speaker'"'"'s utterance; determine an environment adaptation parameter based on the extracted environment features; perform dereverberation or noise cancellation processing on the audio signal including the speaker'"'"'s utterance based on the environment adaptation parameter; and recognize speech elements based on the processed audio signal. - View Dependent Claims (24)
-
Specification