Using visual cues to disambiguate speech inputs

US 9,190,058 B2
Filed: 01/25/2013
Issued: 11/17/2015
Est. Priority Date: 01/25/2013
Status: Active Grant

First Claim

Patent Images

1. On a computing device, a method for recognizing a speech input, the method comprising:

receiving image information of a physical space from a one or more cameras;

determining an identity of a user in the physical space based on the image information;

receiving audio information from one or more microphones;

determining a speech input from the audio input;

if the speech input comprises an ambiguous term, then comparing the ambiguous term in the speech input to digital content consumption information for the user to identify an unambiguous term corresponding to the ambiguous term, the digital content consumption information comprising social network information obtained from a remote service, the social network information including contacts from a social network, and wherein identifying the unambiguous term comprises identifying another user from the social network information; and

after identifying the unambiguous term, taking an action on the computing device based on the speech input and the unambiguous term.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments related to recognizing speech inputs are disclosed. One disclosed embodiment provides a method for recognizing a speech input including receiving depth information of a physical space from a depth camera, determining an identity of a user in the physical space based on the depth information, receiving audio information from one or more microphones, and determining a speech input from the audio input. If the speech input comprises an ambiguous term, the ambiguous term in the speech input is compared to one or more of depth image data received from the depth image sensor and digital content consumption information for the user to identify an unambiguous term corresponding to the ambiguous term. After identifying the unambiguous term, an action is taken on the computing device based on the speech input and the unambiguous term.

Citations

19 Claims

1. On a computing device, a method for recognizing a speech input, the method comprising:
- receiving image information of a physical space from a one or more cameras;
  
  determining an identity of a user in the physical space based on the image information;
  
  receiving audio information from one or more microphones;
  
  determining a speech input from the audio input;
  
  if the speech input comprises an ambiguous term, then comparing the ambiguous term in the speech input to digital content consumption information for the user to identify an unambiguous term corresponding to the ambiguous term, the digital content consumption information comprising social network information obtained from a remote service, the social network information including contacts from a social network, and wherein identifying the unambiguous term comprises identifying another user from the social network information; and
  
  after identifying the unambiguous term, taking an action on the computing device based on the speech input and the unambiguous term.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the digital content consumption information for the user further comprises past content consumption information for the user, and wherein identifying the unambiguous term comprises identifying a content item referred to by the ambiguous term from the historical content consumption information.
  - 3. The method of claim 1, further comprising identifying one or more gestures performed by the user via the image information, and utilizing the one or more gestures to identify the unambiguous term.
  - 4. The method of claim 3, wherein the one or more gestures indicate another person referred to in the ambiguous speech input.
  - 5. The method of claim 4, further comprising identifying the other person indicated by the one or more gestures.
  - 6. The method of claim 3, wherein the one or more gestures indicate an object referred to in the ambiguous speech input.
  - 7. The method of claim 1, wherein the identity of the user is further determined based on information received from the one or more microphones.
  - 8. The method of claim 1, further comprising identifying one or more other persons in the physical environment.
  - 9. The method of claim 1, wherein the digital content consumption information further comprises user preference information based on past digital content consumption.

10. On a computing device, a method for recognizing speech of a user, comprising:
- receiving depth information of a physical space from a depth camera;
  
  identifying one or more gestures performed by the user based on the depth information;
  
  receiving audio information from one or more microphones;
  
  determining a speech input from the audio input;
  
  if the speech input comprises an ambiguous term, then utilizing one or more of the one or more gestures and social network information obtained from a remote service to identify an unambiguous term corresponding to the ambiguous term, the social network information including contacts from a social network, and wherein identifying the unambiguous term comprises identifying another user from the social network information; and
  
  after identifying the unambiguous term, taking an action on the computing device based on the speech input and the unambiguous term.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The method of claim 10, wherein the speech input comprises a command, and wherein the one or more gestures comprise the user pointing to an object in the physical space such that the unambiguous term is an identity of the object.
  - 12. The method of claim 10, wherein the speech input comprises a command, and wherein the one or more gestures comprise the user pointing to another user in the physical space such that the unambiguous term is an identity of the other user.
  - 13. The method of claim 10, wherein the speech input comprises a command, and wherein the one or more gestures comprise the user pointing to a user interface element displayed on a display device such that the unambiguous term is a selected user interface element at which the user is pointing.
  - 14. The method of claim 10, wherein the one or more gestures include a proactive gesture performed prior to the user speaking the speech input to indicate a context of the speech input.
  - 15. The method of claim 10, wherein the one or more gestures include a reactive gesture performed after the user makes the speech input.
  - 16. The method of claim 10, further comprising additionally utilizing digital content consumption information of the user to disambiguate the speech input.

17. A storage device comprising instructions executable by a logic subsystem to:
- receive depth information of a physical space from a depth camera;
  
  determine an identity of a user in the physical space based on the depth information;
  
  identify one or more gestures performed by the user based on the depth information;
  
  receive audio information from one or more microphones;
  
  determine a speech input from the audio input;
  
  if the speech input comprises an ambiguous term, then utilize one or more of digital content consumption information for the user and the one or more gestures to identify an unambiguous term corresponding to the ambiguous term, the digital content consumption information including social network information obtained from a remote service, the social network information including contacts from a social network, and wherein identifying the unambiguous term comprises identifying another user from the social network information; and
  
  after identifying the unambiguous term, take an action on the computing device based on the speech input and the unambiguous term.
- View Dependent Claims (18, 19)
- - 18. The storage device of claim 17, wherein the instructions are executable to determine that the one or more gestures indicate one or more of another person in the physical space and an object in the physical space.
  - 19. The storage device of claim 17, wherein the digital content information further comprises one or more of historical content consumption data and user preference data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Klein, Christian
Primary Examiner(s)
PULLIAS, JESSE SCOTT

Application Number

US13/750,674
Publication Number

US 20140214415A1
Time in Patent Office

1,026 Days
Field of Search

704231-257, 704270-275
US Class Current

1/1
CPC Class Codes

G06F 2203/0381   Multimodal input, i.e. inte...

G06F 3/017   Gesture based interaction, ...

G06F 3/0304   Detection arrangements usin...

G06F 3/167   Audio in a user interface, ...

G10L 15/22   Procedures used during a sp...

G10L 15/24   Speech recognition using no...

G10L 2015/223   Execution procedure of a sp...

Using visual cues to disambiguate speech inputs

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Using visual cues to disambiguate speech inputs

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links