RGB/DEPTH CAMERA FOR IMPROVING SPEECH RECOGNITION
First Claim
1. In a system comprising a computing environment coupled to a capture device for capturing information from a scene, a method of recognizing phonemes from image data, comprising:
- a) receiving information from the scene including image data and audio data;
b) capturing image data relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth; and
c) comparing the image data captured in said step e) against stored rules to identify a phoneme indicated by the image data captured in said step e).
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method are disclosed for facilitating speech recognition through the processing of visual speech cues. These speech cues may include the position of the lips, tongue and/or teeth during speech. In one embodiment, upon capture of a frame of data by an image capture device, the system identifies a speaker and a location of the speaker. The system then focuses in on the speaker to get a clear image of the speaker'"'"'s mouth. The system includes a visual speech cues engine which operates to recognize and distinguish sounds based on the captured position of the speaker'"'"'s lips, tongue and/or teeth. The visual speech cues data may be synchronized with the audio data to ensure the visual speech cues engine is processing image data which corresponds to the correct audio data.
144 Citations
20 Claims
-
1. In a system comprising a computing environment coupled to a capture device for capturing information from a scene, a method of recognizing phonemes from image data, comprising:
-
a) receiving information from the scene including image data and audio data; b) capturing image data relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth; and c) comparing the image data captured in said step e) against stored rules to identify a phoneme indicated by the image data captured in said step e). - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. In a system comprising a computing environment coupled to a capture device for capturing information from a scene, a method of recognizing phonemes from image data, comprising:
-
a) receiving information from the scene including image data and audio data; b) identifying a speaker in the scene; c) locating a position of the speaker within the scene; d) measuring a plurality of parameters to determine whether a clarity threshold is met for obtaining image data relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth; e) capturing image data relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth if it is determined in said step d) that the clarity threshold is met; and f) identifying a phoneme indicated by the image data captured in said step e) if it is determined in said step d) that the clarity threshold is met. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A computer-readable storage medium for programming a processor to perform a method of recognizing phonemes from image data, the method comprising:
-
a) capturing image data and audio data from a capture device; b) setting a frame rate at which the capture device captures images based on a frame rate determined to capture movement required to determine lip, tongue and/or teeth positions in forming a phoneme; c) setting a resolution of the image data to a resolution that does not result in latency in the frame rate set in said step b); d) prompting a user to move to a position close enough to the capture device for the resolution set in said step c) to obtain an image of the user'"'"'s lips, tongue and/or teeth with enough clarity to discern between different phonemes; e) capturing image data from the user relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth; and f) identifying a phoneme based on the image data captured in said step e). - View Dependent Claims (18, 19, 20)
-
Specification