Method and apparatus for audio-visual speech detection and recognition

US 6,816,836 B2
Filed: 08/30/2002
Issued: 11/09/2004
Est. Priority Date: 08/06/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method of providing speech recognition, the method comprising the steps of:

processing a video signal associated with an arbitrary content video source;

processing an audio signal associated with the video signal; and

recognizing at least a portion of the processed audio signal, using at least a portion of the processed video signal, to generate an output signal representative of the audio signal.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for providing speech recognition comprise the steps of processing a video signal associated with an arbitrary content video source, processing an audio signal associated with the video signal, and recognizing at least a portion of the processed audio signal, using at least a portion of the processed video signal, to generate an output signal representative of the audio signal.

Citations

56 Claims

1. A method of providing speech recognition, the method comprising the steps of:
- processing a video signal associated with an arbitrary content video source;
  
  processing an audio signal associated with the video signal; and
  
  recognizing at least a portion of the processed audio signal, using at least a portion of the processed video signal, to generate an output signal representative of the audio signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 2. The method of claim 1, wherein the audio signal is representative of conversational speech.
  - 3. The method of claim 1, wherein the video signal processing operation comprises the step of detecting whether the video signal associated with the arbitrary content video source contains one or more face candidates.
  - 4. The method of claim 3, wherein the video signal processing operation further comprises the step of detecting one or more facial features on one or more detected face candidates.
  - 5. The method of claim 4, wherein the video signal processing operation further comprises the step of detecting whether the one or more detected facial features are characteristic of respective face candidates in frontal poses.
  - 6. The method of claim 5, wherein at least one of face, facial feature and pose detection employ Fisher linear discriminant (FLD) analysis.
  - 7. The method of claim 5, wherein at least one of face, facial feature and pose detection employ a distance from face space (DFFS) measure.
  - 8. The method of claim 3, wherein the video signal processing operation comprises the step of extracting visual feature vectors from the video signal when a face is detected, the visual feature vectors representing one or more facial features of the detected face.
  - 9. The method of claim 8, wherein the visual feature vectors represent grey scale parameters.
  - 10. The method of claim 8, wherein the visual feature vectors represent geometric features.
  - 11. The method of claim 8, wherein the visual feature vectors are normalized with respect to pose estimates.
  - 12. The method of claim 8, wherein the audio signal processing operation comprises the step of extracting audio feature vectors from the associated audio signal, the audio feature vectors representing one or more acoustic features of the audio signal.
  - 13. The method of claim 12, wherein probabilities are assigned to phonemes based on the audio feature vectors.
  - 14. The method of claim 13, wherein probabilities are assigned to one of phonemes and visemes based on the visual feature vectors.
  - 15. The method of claim 14, wherein respective joint probabilities are formed from corresponding probabilities associated with the visual feature vectors and the audio feature vectors.
  - 16. The method of claim 15, wherein, in forming the joint probabilities, the corresponding probabilities associated with the visual feature vectors and the audio feature vectors are assigned weights based on a confidence measure.
  - 17. The method of claim 16, wherein the joint probabilities are used to search for an appropriate output signal representative of the audio signal.
  - 18. The method of claim 17, wherein the acoustic probabilities are used to search for a word hypothesis representative of the audio signal.
  - 19. The method of claim 18, wherein the word hypothesis is re-calculated using the visual probabilities to generate an appropriate output signal representative of the audio signal.
  - 20. The method of claim 12, wherein visual feature vectors are combined with corresponding audio feature vectors to form audio-visual feature vectors.
  - 21. The method of claim 20, wherein probabilities are assigned to one of phonemes and visemes based on the audio-visual feature vectors.
  - 22. The method of claim 21, wherein the probabilities are used to search for an appropriate output signal representative of the audio signal.
  - 23. The method of claim 1, wherein at least one of the video signal and the audio signal are compressed signals.
  - 24. The method of claim 1, wherein compressed signals are decompressed prior to processing operations.
  - 25. The method of claim 1, wherein the arbitrary content video source provides MPEG-2 standard signals.
  - 26. The method of claim 1, wherein the video signal includes at least one of visible electromagnetic spectrum images, non-visible electromagnetic spectrum images, and images from other sensing techniques.

27. Apparatus for providing speech recognition, the apparatus comprising:
- at least one processor operable to;
  
  (i) process a video signal associated with an arbitrary content video source;
  
  (ii) process an audio signal associated with the video signal; and
  
  (iii) recognize at least a portion of the processed audio signal, using at least a portion of the processed video signal, to generate an output signal representative of the audio signal; and
  
  memory, coupled to the at least one processor, for storing at least a portion of results associated with at least one of the processing and recognizing operations.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53)
- - 28. The apparatus of claim 27, wherein the audio signal is representative of conversational speech.
  - 29. The apparatus of claim 27, wherein the video signal processing operation comprises the step of detecting whether the video signal associated with the arbitrary content video source contains one or more face candidates.
  - 30. The apparatus of claim 29, wherein the video signal processing operation further comprises the step of detecting one or more facial features on one or more detected face candidates.
  - 31. The apparatus of claim 30, wherein the video signal processing operation further comprises the step of detecting whether the one or more detected facial features are characteristic of respective face candidates in frontal poses.
  - 32. The apparatus of claim 31, wherein at least one of face, facial feature and pose detection employ Fisher linear discriminant (FLD) analysis.
  - 33. The apparatus of claim 31, wherein at least one of face, facial feature and pose detection employ a distance from face space (DFFS) measure.
  - 34. The apparatus of claim 29, wherein the video signal processing operation comprises the step of extracting visual feature vectors from the video signal when a face is detected, the visual feature vectors representing one or more facial features of the detected face.
  - 35. The apparatus of claim 34, wherein the visual feature vectors represent grey scale parameters.
  - 36. The apparatus of claim 34, wherein the visual feature vectors represent geometric features.
  - 37. The apparatus of claim 34, wherein the visual feature vectors are normalized with respect to pose estimates.
  - 38. The apparatus of claim 34, wherein the audio signal processing operation comprises the step of extracting audio feature vectors from the associated audio signal, the audio feature vectors representing one or more acoustic features of the audio signal.
  - 39. The apparatus of claim 38, wherein probabilities are assigned to phonemes based on the audio feature vectors.
  - 40. The apparatus of claim 39, wherein probabilities are assigned to one of phonemes and visemes based on the visual feature vectors.
  - 41. The apparatus of claim 40, wherein respective joint probabilities are formed from corresponding probabilities associated with the visual feature vectors and the audio feature vectors.
  - 42. The apparatus of claim 41, wherein, in forming the joint probabilities, the corresponding probabilities associated with the visual feature vectors and the audio feature vectors are assigned weights based on a confidence measure.
  - 43. The apparatus of claim 42, wherein the joint probabilities are used to search for an appropriate output signal representative of the audio signal.
  - 44. The apparatus of claim 39, wherein the acoustic probabilities are used to search for a word hypothesis representative of the audio signal.
  - 45. The apparatus of claim 44, wherein the word hypothesis is re-calculated using the visual probabilities to generate an appropriate output signal representative of the audio signal.
  - 46. The apparatus of claim 38, wherein visual feature vectors are combined with corresponding audio feature vectors to form audio-visual feature vectors.
  - 47. The apparatus of claim 46, wherein probabilities are assigned to one of phonemes and visemes based on the audio-visual feature vectors.
  - 48. The apparatus of claim 47, wherein the probabilities are used to search for an appropriate output signal representative of the audio signal.
  - 49. The apparatus of claim 27, wherein at least one of the video signal and the audio signal are compressed signals.
  - 50. The apparatus of claim 27, wherein compressed signals are decompressed prior to processing operations.
  - 51. The apparatus of claim 27, wherein the arbitrary content video source provides MPEG-2 standard signals.
  - 52. The apparatus of claim 27, wherein the video signal includes at least one of visible electromagnetic spectrum images, non-visible electromagnetic spectrum images, and images from other sensing techniques.
  - 53. The apparatus of claim 27, wherein a speech detection decision is made using information from at least one of the video signal, the audio signal, and a correlation between the video signal and the audio signal.

54. A method of providing speech recognition, the method comprising the steps of:
- processing an image signal associated with an arbitrary content image source;
  
  processing an audio signal associated with the image signal; and
  
  recognizing at least a portion of the processed audio signal, using at least a portion of the processed image signal, to generate an output signal representative of the audio signal.

55. Apparatus for providing speech recognition, the apparatus comprising:
- at least one processor operable to;
  
  (i) process an image signal associated with an arbitrary content image source, (ii) process an audio signal associated with the image signal, and (iii) recognize at least a portion of the processed audio signal, using at least a portion of the processed image signal, to generate an output signal representative of the audio signal; and
  
  memory, coupled to the at least one processor, for storing at least a portion of results associated with at least one of the processing and recognizing operations.

56. Apparatus for providing speech recognition, the apparatus comprising:
- means for processing a video signal associated with an arbitrary content video source;
  
  means for processing an audio signal associated with the video signal; and
  
  means for recognizing at least a portion of the processed audio signal, using at least a portion of the processed video signal, to generate an output signal representative of the audio signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Senior, Andrew William, Neti, Chalapathy Venkata, de Cuetos, Philippe Christian, Maes, Stephane Herman, Basu, Sankar
Primary Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US10/231,676
Publication Number

US 20030018475A1
Time in Patent Office

802 Days
Field of Search

704/246, 704/250, 704/251, 704/270, 704/275, 382/100, 382/118
US Class Current

704/270
CPC Class Codes

G06V 40/161   Detection; Localisation; No...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/25   using position of the lips,...

G10L 25/78   Detection of presence or ab...

Method and apparatus for audio-visual speech detection and recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

56 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for audio-visual speech detection and recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

56 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links