RGB/DEPTH CAMERA FOR IMPROVING SPEECH RECOGNITION

US 20110311144A1
Filed: 06/17/2010
Published: 12/22/2011
Est. Priority Date: 06/17/2010
Status: Abandoned Application

First Claim

Patent Images

1. In a system comprising a computing environment coupled to a capture device for capturing information from a scene, a method of recognizing phonemes from image data, comprising:

a) receiving information from the scene including image data and audio data;

b) capturing image data relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth; and

c) comparing the image data captured in said step e) against stored rules to identify a phoneme indicated by the image data captured in said step e).

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are disclosed for facilitating speech recognition through the processing of visual speech cues. These speech cues may include the position of the lips, tongue and/or teeth during speech. In one embodiment, upon capture of a frame of data by an image capture device, the system identifies a speaker and a location of the speaker. The system then focuses in on the speaker to get a clear image of the speaker'"'"'s mouth. The system includes a visual speech cues engine which operates to recognize and distinguish sounds based on the captured position of the speaker'"'"'s lips, tongue and/or teeth. The visual speech cues data may be synchronized with the audio data to ensure the visual speech cues engine is processing image data which corresponds to the correct audio data.

144 Citations

View as Search Results

20 Claims

1. In a system comprising a computing environment coupled to a capture device for capturing information from a scene, a method of recognizing phonemes from image data, comprising:
- a) receiving information from the scene including image data and audio data;
  
  b) capturing image data relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth; and
  
  c) comparing the image data captured in said step e) against stored rules to identify a phoneme indicated by the image data captured in said step e).
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising the steps of:
    - d) identifying a speaker in the scene,e) locating a position of the speaker within the scene,f) obtaining greater image detail on speaker within the scene relative to other areas of the scene, andg) synchronizing the image data to the audio data.
  - 3. The method of claim 2, further comprising the step h) of processing the audio data by a speech recognition engine for recognizing speech from audio data.
  - 4. The method of claim 3, said step f) of comparing the captured image data against stored rules to identify a phoneme occurring contemporaneously with said step h) of processing the audio data by a speech recognition engine.
  - 5. The method of claim 3, said step f) of comparing the captured image data against stored rules to identify a phoneme occurring after the speech recognition engine is unable to identify a phoneme from the audio data in said step h).
  - 6. The method of claim 1, said step f) of comparing the captured image data against stored rules to identify a phoneme comprising the step j) of iteratively comparing data for the current frame and past frames of image data against the stored rules.
  - 7. The method of claim 6, said step j) of iteratively comparing data for the current frame and past frames of image data against the stored rules comprising selecting the number of past frames based on a frame rate at which image data is captured.
  - 8. The method of claim 1, said step b) of identifying a speaker in the scene comprising the step of analyzing image data and comparing that to a location of the source of audio data.
  - 9. The method of claim 1, said step c) of obtaining greater image detail on the one or more areas of interest within the scene comprising the step of performing one of a mechanical zoom or digital zoom to focus on at least one area of interest in the one or more areas of interest.

10. In a system comprising a computing environment coupled to a capture device for capturing information from a scene, a method of recognizing phonemes from image data, comprising:
- a) receiving information from the scene including image data and audio data;
  
  b) identifying a speaker in the scene;
  
  c) locating a position of the speaker within the scene;
  
  d) measuring a plurality of parameters to determine whether a clarity threshold is met for obtaining image data relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth;
  
  e) capturing image data relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth if it is determined in said step d) that the clarity threshold is met; and
  
  f) identifying a phoneme indicated by the image data captured in said step e) if it is determined in said step d) that the clarity threshold is met.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The method of claim 10, said step d) of measuring a plurality of parameters to determine whether a clarity threshold is met comprises the step of measuring at least one of:
    - d1) a resolution of the image data,d2) a distance between the speaker and the capture device, andd3) an amount of light energy incident on the speaker.
  - 12. The method of claim 11, wherein parameter d1) may vary inversely with parameters d2) and d3) and the clarity threshold is still met.
  - 13. The method of claim 10, further comprising the step g) of synchronizing the image data to the audio data by the step of time stamping the image data and audio data and comparing time stamps.
  - 14. The method of claim 13, further comprising the step h) of processing the audio data by a speech recognition engine for recognizing speech from audio data.
  - 15. The method of claim 14, said step f) comprising the step of comparing the captured image data against stored rules to identify a phoneme, said step f) occurring contemporaneously with said step h) of processing the audio data by a speech recognition engine.
  - 16. The method of claim 14, said step f) comprising the step of comparing the captured image data against stored rules to identify a phoneme, said step f) occurring after the speech recognition engine is unable to identify a phoneme from the audio data in said step h).

17. A computer-readable storage medium for programming a processor to perform a method of recognizing phonemes from image data, the method comprising:
- a) capturing image data and audio data from a capture device;
  
  b) setting a frame rate at which the capture device captures images based on a frame rate determined to capture movement required to determine lip, tongue and/or teeth positions in forming a phoneme;
  
  c) setting a resolution of the image data to a resolution that does not result in latency in the frame rate set in said step b);
  
  d) prompting a user to move to a position close enough to the capture device for the resolution set in said step c) to obtain an image of the user'"'"'s lips, tongue and/or teeth with enough clarity to discern between different phonemes;
  
  e) capturing image data from the user relating to a position of at least one of the speaker'"'"'s lips, tongue and/or teeth; and
  
  f) identifying a phoneme based on the image data captured in said step e).
- View Dependent Claims (18, 19, 20)
- - 18. The computer-readable storage medium of claim 17, further comprising the step of generating stored rules including information on the position of lips, tongue and/or teeth in mouthing a phoneme, the stored rules used for comparison against captured image data to determine whether the image data indicates a phoneme defined in a stored rule, the stored rules further including a confidence threshold indicating how closely captured image data needs to match the information in the stored rule in order for the image data to indicate the phoneme defined in the stored rule.
  - 19. The computer-readable storage medium of claim 18, further comprising the step iteratively comparing data for the current frame and past frames of image data against the stored rules to identify a phoneme.
  - 20. The computer-readable storage medium of claim 17, further comprising the step g) of processing the audio data by a speech recognition engine for recognizing speech from audio data, said step f) of identifying a phoneme based on the captured image data performed only upon the speech recognition engine failing to identify recognize speech from the audio data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Tardif, John A.

Application Number

US12/817,854
Publication Number

US 20110311144A1
Time in Patent Office

Days
Field of Search
US Class Current

382/195
CPC Class Codes

G10L 15/25 using position of the lips,...

RGB/DEPTH CAMERA FOR IMPROVING SPEECH RECOGNITION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

144 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

RGB/DEPTH CAMERA FOR IMPROVING SPEECH RECOGNITION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

144 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links