Audio-only backoff in audio-visual speech recognition system
First Claim
1. A method of using a computer processor to improve speech recognition performance in an audio-visual speech recognition system comprising the steps of:
- receiving audio data and visual data associated with an input spoken utterance;
using the computer processor to select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and
using the computer processor to decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance using the selected data model.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques for performing audio-visual speech recognition, with improved recognition performance, in a degraded visual environment. For example, in one aspect of the invention, a technique for use in accordance with an audio-visual speech recognition system for improving a recognition performance thereof includes the steps/operations of: (i) selecting between an acoustic-only data model and an acoustic-visual data model based on a condition associated with a visual environment; and (ii) decoding at least a portion of an input spoken utterance using the selected data model. Advantageously, during periods of degraded visual conditions, the audio-visual speech recognition system is able to decode (recognize) input speech data using audio-only data, thus avoiding recognition inaccuracies that may result from performing speech recognition based on acoustic-visual data models and degraded visual data.
-
Citations
22 Claims
-
1. A method of using a computer processor to improve speech recognition performance in an audio-visual speech recognition system comprising the steps of:
-
receiving audio data and visual data associated with an input spoken utterance; using the computer processor to select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and using the computer processor to decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance using the selected data model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. Apparatus to improve speech recognition performance in an audio-visual speech recognition system the apparatus comprising:
-
a memory; and at least one processor coupled to the memory and operative to;
(i) receive audio data and visual data associated with an input spoken utterance;
(ii) select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and
(iii) decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance using the selected data model. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An article of manufacture for use with a computer processor to improve speech recognition performance in an audio-visual speech recognition system, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
-
receiving audio data and visual data associated with an input spoken utterance; using the computer processor to select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and using the computer processor to decode at least a portion of at least one of the audio data and the visual data associated with the an input spoken utterance using the selected data model. - View Dependent Claims (20)
-
-
21. An audio-visual speech recognition system, comprising:
-
a memory; and at least one processor coupled to the memory and operative to;
(i) receive audio data and visual data associated with an input spoken utterance;
(ii) select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and
(iii) decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance using the selected data model, wherein the acoustic-only data model and the acoustic-visual data model are stored in the memory such that model selection is made by shifting one or more pointers to one or more memory locations where the selected model is located.
-
-
22. A method of using a computer processor to improve speech recognition performance in a speech recognition system comprising the steps of:
-
receiving one or more frames of audio data and visual data associated with an input spoken utterance; using the computer processor to select for a given frame between a first data model and at least a second data model based on a level of degradation of the visual data; and using the computer processor to decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance for the given frame using the selected data model.
-
Specification