AUGMENTING SPEECH RECOGNITION WITH DEPTH IMAGING

US 20140122086A1
Filed: 10/26/2012
Published: 05/01/2014
Est. Priority Date: 10/26/2012
Status: Abandoned Application

First Claim

Patent Images

1. On a computing device, a method for recognizing speech of a user, comprising:

receiving depth information of a physical space from a depth camera;

receiving audio information from one or more microphones;

identifying a set of one or more possible spoken words from the audio information;

determining a speech input for the computing device based upon comparing the set of one or more possible spoken words from the audio information and the depth information; and

taking an action on the computing device based upon the speech input determined.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments related to the use of depth imaging to augment speech recognition are disclosed. For example, one disclosed embodiment provides, on a computing device, a method including receiving depth information of a physical space from a depth camera, receiving audio information from one or more microphones, identifying a set of one or more possible spoken words from the audio information, determining a speech input for the computing device based upon comparing the set of one or more possible spoken words from the audio information and the depth information, and taking an action on the computing device based upon the speech input determined.

Citations

20 Claims

1. On a computing device, a method for recognizing speech of a user, comprising:
- receiving depth information of a physical space from a depth camera;
  
  receiving audio information from one or more microphones;
  
  identifying a set of one or more possible spoken words from the audio information;
  
  determining a speech input for the computing device based upon comparing the set of one or more possible spoken words from the audio information and the depth information; and
  
  taking an action on the computing device based upon the speech input determined.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising identifying contextual elements in one or more of the depth information from a depth camera, audio information from a directional microphone, and image information from a visible light camera, and comparing the set of one or more possible spoken words from the audio information to the contextual elements to determine the speech input.
  - 3. The method of claim 2, wherein identifying the contextual elements comprises one or more of determining an identity of the user based on one or more of the depth information and information from a visible light camera, determining an emotional state of the user, determining a physical state of the user, determining a gesture performed by the user, and identifying one or more objects in a physical space of the user.
  - 4. The method of claim 1, further comprising identifying a set of one or more possible spoken sounds and/or words from the depth information and comparing the set of one or more possible spoken words identified via the audio information to the set of one or more possible spoken sounds and/or words identified via the depth information to determine the speech input.
  - 5. The method of claim 4, wherein identifying the set of one or more possible spoken sounds and/or words from the depth information further comprises identifying one or more mouth, tongue, and/or throat movements of the user, and identifying the set of one or more possible spoken sounds and/or words based on the movements.
  - 6. The method of claim 1, wherein the speech input comprises one or more of a command and content to be displayed on a display device, and wherein taking the action comprises one or more of performing the command and sending the content to the display device.
  - 7. The method of claim 1, further comprising identifying which user of a plurality of users is speaking based on one or more of mouth movements and gaze direction.
  - 8. The method of claim 1, wherein the speech input is content to be stored, and wherein taking the action comprises storing the content.

9. On a computing device, a method for recognizing speech of a user, comprising:
- receiving depth image information of a physical space from a depth camera;
  
  receiving audio information from one or more microphones;
  
  identifying one or more spoken words from the audio information;
  
  identifying one or more contextual elements from the depth image information;
  
  determining whether the one or more spoken words are intended as a user input to the computing system based upon the one or more contextual elements;
  
  performing an action via the computing device if it is determined that the spoken words are intended as a user input; and
  
  not performing the action via the computing device if it is determined that the spoken words are not intended as a user input.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The method of claim 9, wherein the one or more contextual elements comprise a user gesture, and wherein determining whether the one or more spoken words are intended as the user input further comprises determining that the one or more spoken words are intended to be a user input if the user gesture is directed toward a speech recognition system device.
  - 11. The method of claim 9, wherein the one or more contextual elements comprise an orientation of a head of the user, and wherein determining whether the one or more spoken words are intended as the user input further comprises determining that the one or more spoken words are intended as the user input if the head of the user is orientated toward a speech recognition system device.
  - 12. The method of claim 9, wherein the one or more contextual elements comprise an emotion of the user.
  - 13. The method of claim 9, wherein determining whether the one or more spoken words are intended as the user input further comprises determining whether the spoken words are intended as the user input based on the one or more spoken words matching a recognized user input.
  - 14. The method of claim 9, further comprising identifying that the user is speaking based on the depth information, and responsive to identifying that the user speaking, commencing identifying the one or more spoken words.

15. A method for recognizing speech of a user, comprising:
- receiving depth information of a physical space from a depth camera;
  
  receiving audio information from one or more microphones;
  
  identifying one or more of a mouth, tongue, and throat of the user from the depth information;
  
  identifying one or more of mouth movements, tongue movements, and throat movements of the user;
  
  determining that the user is speaking based on the identified movements;
  
  responsive to the determination that the user is speaking, identifying a speech input from the received audio information; and
  
  taking an action on the computing device in response to identifying the speech input.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15, further comprising identifying a set of one or more possible spoken sounds and/or words from the depth information and comparing a set of one or more possible spoken words identified via the audio information to the set of one or more possible spoken sounds and/or words identified via the depth information to determine the speech input.
  - 17. The method of claim 16, wherein the set of one or more possible spoken sounds and/or words is identified based on the identified mouth movements, tongue movements, and/or throat movements of the user.
  - 18. The method of claim 17, wherein a boundary between possible spoken sounds and/or words is determined based on identified hand movements of the user.
  - 19. The method of claim 15, wherein the speech input comprises a command, and wherein taking the action comprises performing the command.
  - 20. The method of claim 15, wherein the speech input comprises content to be displayed on a display device, and wherein taking the action comprises sending the content to the display device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Kapur, Jay, Tashev, Ivan, Hodges, Stephen Edward, Seltzer, Mike

Application Number

US13/662,293
Publication Number

US 20140122086A1
Time in Patent Office

Days
Field of Search
US Class Current

704/275
CPC Class Codes

A63F 13/213   comprising photodetecting m...

A63F 13/424   involving acoustic input si...

A63F 2300/1081   Input via voice recognition

A63F 2300/1087   comprising photodetecting m...

G06F 3/017   Gesture based interaction, ...

G10L 15/24   Speech recognition using no...

G10L 2015/227   of the speaker; Human-fact...

AUGMENTING SPEECH RECOGNITION WITH DEPTH IMAGING

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

AUGMENTING SPEECH RECOGNITION WITH DEPTH IMAGING

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links