Audio-only backoff in audio-visual speech recognition system

US 7,251,603 B2
Filed: 06/23/2003
Issued: 07/31/2007
Est. Priority Date: 06/23/2003
Status: Active Grant

First Claim

Patent Images

1. A method of using a computer processor to improve speech recognition performance in an audio-visual speech recognition system comprising the steps of:

receiving audio data and visual data associated with an input spoken utterance;

using the computer processor to select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and

using the computer processor to decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance using the selected data model.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for performing audio-visual speech recognition, with improved recognition performance, in a degraded visual environment. For example, in one aspect of the invention, a technique for use in accordance with an audio-visual speech recognition system for improving a recognition performance thereof includes the steps/operations of: (i) selecting between an acoustic-only data model and an acoustic-visual data model based on a condition associated with a visual environment; and (ii) decoding at least a portion of an input spoken utterance using the selected data model. Advantageously, during periods of degraded visual conditions, the audio-visual speech recognition system is able to decode (recognize) input speech data using audio-only data, thus avoiding recognition inaccuracies that may result from performing speech recognition based on acoustic-visual data models and degraded visual data.

Citations

22 Claims

1. A method of using a computer processor to improve speech recognition performance in an audio-visual speech recognition system comprising the steps of:
- receiving audio data and visual data associated with an input spoken utterance;
  
  using the computer processor to select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and
  
  using the computer processor to decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance using the selected data model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising the step of storing the acoustic-only data model and the acoustic-visual data model in memory such that model selection is made by shifting one or more pointers to one or more memory locations where the selected model is located.
  - 3. The method of claim 1, wherein the model selection step is based on a likelihood ratio test.
  - 4. The method of claim 3, wherein the model selection step further comprises selecting the acoustic-only data model when a result of the likelihood test is not greater than a threshold value.
  - 5. The method of claim 3, wherein the model selection step further comprises selecting the acoustic-visual data model when a result of the likelihood test is not less than a threshold value.
  - 6. The method of claim 5, wherein the threshold value is based on a cost associated with a recognition error.
  - 7. The method of claim 3, wherein the likelihood ratio test is based on one or more observations of a given visual feature.
  - 8. The method of claim 7, wherein the given visual feature is associated with the mouth region of a speaker of the input utterance.
  - 9. The method of claim 1, wherein model selection is performed at a rate substantially equivalent to an observation rate associated with the audio-visual speech recognition system.

10. Apparatus to improve speech recognition performance in an audio-visual speech recognition system the apparatus comprising:
- a memory; and
  
  at least one processor coupled to the memory and operative to;
  
  (i) receive audio data and visual data associated with an input spoken utterance;
  
  (ii) select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and
  
  (iii) decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance using the selected data model.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The apparatus of claim 10, wherein the acoustic-only data model and the acoustic-visual data model are stored in the memory such that model selection is made by shifting one or more pointers to one or more memory locations where the selected model is located.
  - 12. The apparatus of claim 10, wherein the model selection operation is based on a likelihood ratio test.
  - 13. The apparatus of claim 12, wherein the model selection operation further comprises selecting the acoustic-only data model when a result of the likelihood test is not greater than a threshold value.
  - 14. The apparatus of claim 12, wherein the model selection operation further comprises selecting the acoustic-visual data model when a result of the likelihood test is not less than a threshold value.
  - 15. The apparatus of claim 14, wherein the threshold value is based on a cost associated with a recognition error.
  - 16. The apparatus of claim 12, wherein the likelihood ratio test is based on one or more observations of a given visual feature.
  - 17. The apparatus of claim 16, wherein the given visual feature is associated with the mouth region of a speaker of the input utterance.
  - 18. The apparatus of claim 10, wherein model selection is performed at a rate substantially equivalent to an observation rate associated with the audio-visual speech recognition system.

19. An article of manufacture for use with a computer processor to improve speech recognition performance in an audio-visual speech recognition system, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
- receiving audio data and visual data associated with an input spoken utterance;
  
  using the computer processor to select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and
  
  using the computer processor to decode at least a portion of at least one of the audio data and the visual data associated with the an input spoken utterance using the selected data model.
- View Dependent Claims (20)
- - 20. The article of claim 19, further comprising the step of storing the acoustic-only data model and the acoustic-visual data model in memory such that model selection is made by shifting one or more pointers to one or more memory locations where the selected model is located.

21. An audio-visual speech recognition system, comprising:
- a memory; and
  
  at least one processor coupled to the memory and operative to;
  
  (i) receive audio data and visual data associated with an input spoken utterance;
  
  (ii) select between an acoustic-only data model and an acoustic-visual data model based on a level of degradation of the visual data; and
  
  (iii) decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance using the selected data model, wherein the acoustic-only data model and the acoustic-visual data model are stored in the memory such that model selection is made by shifting one or more pointers to one or more memory locations where the selected model is located.

22. A method of using a computer processor to improve speech recognition performance in a speech recognition system comprising the steps of:
- receiving one or more frames of audio data and visual data associated with an input spoken utterance;
  
  using the computer processor to select for a given frame between a first data model and at least a second data model based on a level of degradation of the visual data; and
  
  using the computer processor to decode at least a portion of at least one of the audio data and the visual data associated with the input spoken utterance for the given frame using the selected data model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Haas, Norman, Connell, Jonathan H., Potamianos, Gerasimos, Marcheret, Etienne, Neti, Chalapathy Venkata
Primary Examiner(s)
ARMSTRONG, ANGELA A

Application Number

US10/601,350
Publication Number

US 20040260554A1
Time in Patent Office

1,499 Days
Field of Search

704/231, 704236-240, 704/251, 704/255, 704/270, 704/276, 704/275, 382/115, 382/159
US Class Current

704/270
CPC Class Codes

G10L 15/25 using position of the lips,...

Audio-only backoff in audio-visual speech recognition system

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Audio-only backoff in audio-visual speech recognition system

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links