Methods and apparatus for audio-visual speech detection and recognition

US 6,594,629 B1
Filed: 08/06/1999
Issued: 07/15/2003
Est. Priority Date: 08/06/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method of providing speech detection in accordance with a speech recognition system, the method comprising the steps of:

processing a video signal associated with an arbitrary content video source to detect whether one or more features associated with the video signal are representative of speech, wherein at least one of the one or more features represents verbal motion associated with formation of speech; and

processing an audio signal associated with the video signal in accordance with the speech recognition system to generate an output signal representative of the video signal when at least one of the one or more features associated with the video signal is representative of speech.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a first aspect of the invention, methods and apparatus for providing speech recognition comprise the steps of processing a video signal associated with an arbitrary content video source, processing an audio signal associated with the video signal, and decoding the processed audio signal in conjunction with the processed video signal to generate a decoded output signal representative of the audio signal. In a second aspect 6f the invention, methods and apparatus for providing speech detection in accordance with a speech recognition system comprise the steps of processing a video signal associated with a video source to detect whether one or more features associated with the video signal are representative of speech, and processing an audio signal associated with the video signal in accordance with the speech recognition system to generate a decoded output signal representative of the audio signal when the one or more features associated with the video signal are representative of speech. Speech detection may also be performed using information from both the video path and the audio path simultaneously.

Citations

21 Claims

1. A method of providing speech detection in accordance with a speech recognition system, the method comprising the steps of:
- processing a video signal associated with an arbitrary content video source to detect whether one or more features associated with the video signal are representative of speech, wherein at least one of the one or more features represents verbal motion associated with formation of speech; and
  
  processing an audio signal associated with the video signal in accordance with the speech recognition system to generate an output signal representative of the video signal when at least one of the one or more features associated with the video signal is representative of speech.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the video signal processing operation comprises the step of detecting if one of a mouth opening and mouth motion occurs in the video signal.
  - 3. The method of claim 2, wherein the audio signal processing operation comprises the steps of enabling a microphone of the speech recognition system and buffering the audio signal when a mouth opening is detected.
  - 4. The method of claim 3, wherein the video signal processing operation further comprises the step of performing mouth opening pattern recognition on subsequent portions of the video signal.
  - 5. The method of claim 4, wherein the audio signal processing operation comprises the step of decoding the buffered audio signal when the mouth opening pattern recognition operation indicates that the subsequent portions of the video signal are representative of speech.

6. A method of providing speech detection in accordance with a speech recognition system, the method comprising the steps of:
- processing a video signal associated with a video source;
  
  processing an audio signal associated with the video signal;
  
  comparing the processed audio signal with the processed video signal to determine a level of correlation between the signals; and
  
  making a speech detection decision based on the correlation level.

7. Apparatus for providing speech detection in accordance with a speech recognition system, the apparatus comprising:
- at least one processor operable to;
  
  (i) process a video signal associated with an arbitrary content video source to detect whether one or more features associated with the video signal are representative of speech, wherein at least one of the one or more features represents verbal motion associated with formation of speech, and (ii) process an audio signal associated with the video signal in accordance with the speech recognition system to generate an output signal representative of the audio signal when at least one of the one or more features associated with the video signal is representative of speech; and
  
  memory, coupled to the at least one processor, for storing at least a portion of results associated with at least one of the processing operations.

8. Apparatus for providing speech detection in accordance with a speech recognition system, the apparatus comprising:
- at least one processor operable to;
  
  (i) process a video signal associated with a video source;
  
  (ii) process an audio signal associated with the video signal;
  
  (iii) compare the processed audio signal with the processed video signal to determine a level of correlation between the signals; and
  
  (iv) make a speech detection decision based on the correlation level; and
  
  memory, coupled to the at least one processor, for storing at least a portion of results associated with at least one of the processing, comparing and decision making operations.

9. A method of providing speech recognition, the method comprising the steps of:
- processing a video signal associated with an arbitrary content video source;
  
  processing an audio signal associated with the video signal; and
  
  recognizing the processed audio signal in conjunction with the processed video signal to generate an output signal representative of the audio signal;
  
  wherein the video signal processing step includes the steps of;
  
  detecting whether the video signal associated with the arbitrary content video source contains one or more face candidates;
  
  detecting one or more facial features on one or more detected face candidates; and
  
  detecting whether the one or more detected facial features are characteristic of respective face candidates in frontal poses, wherein the frontal pose detection step includes making a determination to discard certain face candidates.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The method of claim 9, wherein the-discard determination step comprises the steps of:
11. The method of claim 9, wherein the discard determination step comprises the step of discarding face candidates having a low facial feature score for a facial feature such as at least one of eyes, nose, and mouth.
12. The method of claim 9, wherein the discard determination step comprises the step of discarding face candidates based on a ratio of the distance between each eye and the center of the nose.
13. The method of claim 9, wherein the discard determination step comprises the step of discarding face candidates based on a ratio of the distance between each eye and the side of the face region.
14. The method of claim 9, wherein the discard determination step comprises the step of matching face candidates to at least one of a face template and a pose template.

15. Apparatus for providing speech recognition, the apparatus comprising:
- at least one processor operable to;
  
  (i) process a video signal associated with an arbitrary content video source, (ii) process an audio signal associated with the video signal, and (iii) recognize the processed audio signal in conjunction with the processed video signal to generate an output signal representative of the audio signal;
  
  wherein the video signal processing operation includes;
  
  detecting whether the video signal associated with the arbitrary content video source contains one or more face candidates;
  
  detecting one or more facial features on one or more detected face candidates; and
  
  detecting whether the one or more detected facial features are characteristic of respective face candidates in frontal poses, wherein the frontal pose detection operation includes making a determination to discard certain face candidates; and
  
  memory, coupled to the at least one processor, for storing at least a portion of results associated with at least one of the processing and recognizing operations. generate an representative of the audio signal.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The apparatus of claim 15, wherein the discard determination operation comprises:
17. The apparatus of claim 15, wherein the discard determination operation comprises discarding face candidates having a low facial feature score for a facial feature such as at least one of eyes, nose, and mouth.
18. The apparatus of claim 15, wherein the discard determination operation comprises discarding face candidates based on a ratio of the distance between each eye and the center of the nose.
19. The apparatus of claim 15, wherein the discard determination operation comprises discarding face candidates based on a ratio of the distance between each eye and the side of the face region.
20. The apparatus of claim 15, wherein the discard determination operation comprises matching face candidates to at least one of a face template and a pose template.

21. Apparatus for providing speech detection in accordance with a speech recognition system, the apparatus comprising:
- means for processing a video signal associated with an arbitrary content video source to detect whether one or more features associated with the video signal are representative of speech, wherein at least one of the one or more features represents verbal motion associated with formation of speech; and
  
  means for processing an audio signal associated with the video signal in accordance with the speech recognition system to generate an output signal representative of the audio signal when at least one of the one or more features associated with the video signal is representative of speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Senior, Andrew William, Maes, Stephane Herman, Neti, Chalapathy Venkata, Basu, Sankar, de Cuetos, Philippe Christian
Primary Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US09/369,707
Time in Patent Office

1,439 Days
Field of Search

704/270, 704/231, 704/251, 704/233, 704/246, 382/115, 382/118, 379/88.01
US Class Current

704/251
CPC Class Codes

G06V 40/161   Detection; Localisation; No...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/25   using position of the lips,...

G10L 25/78   Detection of presence or ab...

Methods and apparatus for audio-visual speech detection and recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for audio-visual speech detection and recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links