Methods and apparatus for audio-visual speech detection and recognition
First Claim
1. A method of providing speech detection in accordance with a speech recognition system, the method comprising the steps of:
- processing a video signal associated with an arbitrary content video source to detect whether one or more features associated with the video signal are representative of speech, wherein at least one of the one or more features represents verbal motion associated with formation of speech; and
processing an audio signal associated with the video signal in accordance with the speech recognition system to generate an output signal representative of the video signal when at least one of the one or more features associated with the video signal is representative of speech.
2 Assignments
0 Petitions
Accused Products
Abstract
In a first aspect of the invention, methods and apparatus for providing speech recognition comprise the steps of processing a video signal associated with an arbitrary content video source, processing an audio signal associated with the video signal, and decoding the processed audio signal in conjunction with the processed video signal to generate a decoded output signal representative of the audio signal. In a second aspect 6f the invention, methods and apparatus for providing speech detection in accordance with a speech recognition system comprise the steps of processing a video signal associated with a video source to detect whether one or more features associated with the video signal are representative of speech, and processing an audio signal associated with the video signal in accordance with the speech recognition system to generate a decoded output signal representative of the audio signal when the one or more features associated with the video signal are representative of speech. Speech detection may also be performed using information from both the video path and the audio path simultaneously.
-
Citations
21 Claims
-
1. A method of providing speech detection in accordance with a speech recognition system, the method comprising the steps of:
-
processing a video signal associated with an arbitrary content video source to detect whether one or more features associated with the video signal are representative of speech, wherein at least one of the one or more features represents verbal motion associated with formation of speech; and
processing an audio signal associated with the video signal in accordance with the speech recognition system to generate an output signal representative of the video signal when at least one of the one or more features associated with the video signal is representative of speech. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of providing speech detection in accordance with a speech recognition system, the method comprising the steps of:
-
processing a video signal associated with a video source;
processing an audio signal associated with the video signal;
comparing the processed audio signal with the processed video signal to determine a level of correlation between the signals; and
making a speech detection decision based on the correlation level.
-
-
7. Apparatus for providing speech detection in accordance with a speech recognition system, the apparatus comprising:
-
at least one processor operable to;
(i) process a video signal associated with an arbitrary content video source to detect whether one or more features associated with the video signal are representative of speech, wherein at least one of the one or more features represents verbal motion associated with formation of speech, and (ii) process an audio signal associated with the video signal in accordance with the speech recognition system to generate an output signal representative of the audio signal when at least one of the one or more features associated with the video signal is representative of speech; and
memory, coupled to the at least one processor, for storing at least a portion of results associated with at least one of the processing operations.
-
-
8. Apparatus for providing speech detection in accordance with a speech recognition system, the apparatus comprising:
- at least one processor operable to;
(i) process a video signal associated with a video source;
(ii) process an audio signal associated with the video signal;
(iii) compare the processed audio signal with the processed video signal to determine a level of correlation between the signals; and
(iv) make a speech detection decision based on the correlation level; andmemory, coupled to the at least one processor, for storing at least a portion of results associated with at least one of the processing, comparing and decision making operations.
- at least one processor operable to;
-
9. A method of providing speech recognition, the method comprising the steps of:
-
processing a video signal associated with an arbitrary content video source;
processing an audio signal associated with the video signal; and
recognizing the processed audio signal in conjunction with the processed video signal to generate an output signal representative of the audio signal;
wherein the video signal processing step includes the steps of;
detecting whether the video signal associated with the arbitrary content video source contains one or more face candidates;
detecting one or more facial features on one or more detected face candidates; and
detecting whether the one or more detected facial features are characteristic of respective face candidates in frontal poses, wherein the frontal pose detection step includes making a determination to discard certain face candidates.- View Dependent Claims (10, 11, 12, 13, 14)
combining facial feature scores of a face candidate;
comparing the combination to a threshold value; and
discarding the face candidate when the score combination is not greater than the threshold value.
-
-
11. The method of claim 9, wherein the discard determination step comprises the step of discarding face candidates having a low facial feature score for a facial feature such as at least one of eyes, nose, and mouth.
-
12. The method of claim 9, wherein the discard determination step comprises the step of discarding face candidates based on a ratio of the distance between each eye and the center of the nose.
-
13. The method of claim 9, wherein the discard determination step comprises the step of discarding face candidates based on a ratio of the distance between each eye and the side of the face region.
-
14. The method of claim 9, wherein the discard determination step comprises the step of matching face candidates to at least one of a face template and a pose template.
-
15. Apparatus for providing speech recognition, the apparatus comprising:
-
at least one processor operable to;
(i) process a video signal associated with an arbitrary content video source, (ii) process an audio signal associated with the video signal, and (iii) recognize the processed audio signal in conjunction with the processed video signal to generate an output signal representative of the audio signal;
wherein the video signal processing operation includes;
detecting whether the video signal associated with the arbitrary content video source contains one or more face candidates;
detecting one or more facial features on one or more detected face candidates; and
detecting whether the one or more detected facial features are characteristic of respective face candidates in frontal poses, wherein the frontal pose detection operation includes making a determination to discard certain face candidates; and
memory, coupled to the at least one processor, for storing at least a portion of results associated with at least one of the processing and recognizing operations. generate an representative of the audio signal. - View Dependent Claims (16, 17, 18, 19, 20)
combining facial feature scores of a face candidate;
comparing the combination to a threshold value; and
discarding the face candidate when the score combination is not greater than the threshold value.
-
-
17. The apparatus of claim 15, wherein the discard determination operation comprises discarding face candidates having a low facial feature score for a facial feature such as at least one of eyes, nose, and mouth.
-
18. The apparatus of claim 15, wherein the discard determination operation comprises discarding face candidates based on a ratio of the distance between each eye and the center of the nose.
-
19. The apparatus of claim 15, wherein the discard determination operation comprises discarding face candidates based on a ratio of the distance between each eye and the side of the face region.
-
20. The apparatus of claim 15, wherein the discard determination operation comprises matching face candidates to at least one of a face template and a pose template.
-
21. Apparatus for providing speech detection in accordance with a speech recognition system, the apparatus comprising:
-
means for processing a video signal associated with an arbitrary content video source to detect whether one or more features associated with the video signal are representative of speech, wherein at least one of the one or more features represents verbal motion associated with formation of speech; and
means for processing an audio signal associated with the video signal in accordance with the speech recognition system to generate an output signal representative of the audio signal when at least one of the one or more features associated with the video signal is representative of speech.
-
Specification