Compound gesture-speech commands

US 10,534,438 B2
Filed: 04/28/2017
Issued: 01/14/2020
Est. Priority Date: 06/18/2010
Status: Active Grant

First Claim

Patent Images

1. An automated method of initiating a machine action based on a combination of sounds and gestures made by one or more users, the method comprising:

using a depth determining camera to capture a first three-dimensional body pose made and/or a first three-dimensional body action performed by a respective at least one of the one or more users;

identifying a first pre-specified three-dimensional gesture based on the first three-dimensional body pose and/or the first three-dimensional body action of the respective at least one user, and determining a confidence level for identification of the first pre-specified three-dimensional gesture;

assigning a weight to the first pre-specified three-dimensional gesture based on the confidence level for identification of the first pre-specified three-dimensional gesture;

detecting a first set of one or more sounds made by the respective at least one or at least another of the one or more users, the first set of one or more sounds being made in combination with the first respective three-dimensional body pose and/or the first three-dimensional body action of the respective at least one user;

recognizing a first voice command based on the first set of one or more sounds and determining a confidence level for recognition of the first voice command;

assigning a weight to the first voice command based on the confidence level for recognition of the first voice command;

automatically identifying a first command pre-associated with a compound combination of the first pre-specified three-dimensional gesture and the first voice command by,determining that the weight of the first pre-specified three-dimensional gesture is greater than the weight of the first voice command, and verifying that the first voice command is within a set of voice commands associated with the first pre-specified three-dimensional gesture;

in response to the first command, initiating performance by an instructable machine of a first machine action that has been predetermined to be commanded by the first command;

identifying a second pre-specified three-dimensional gesture based on a second three-dimensional body pose and/or a second three-dimensional body action of the respective at least one user;

assigning a weight to the second pre-specified three-dimensional gesture;

recognizing a second voice command based on a second set of one or more sounds;

assigning a weight to the second voice command;

automatically identifying a second command pre-associated with a compound combination of the second pre-specified three-dimensional gesture and the second voice command by determining that the weight of the second voice command is greater than the weight of the second pre-specified three-dimensional gesture, and verifying that the second pre-specified three-dimensional gesture is within a set of gestures associated with the second voice command; and

in response to the second command, initiating performance by the instructable machine of a second machine action that has been predetermined to be commanded by the second command.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A multimedia entertainment system combines both gestures and voice commands to provide an enhanced control scheme. A user'"'"'s body position or motion may be recognized as a gesture, and may be used to provide context to recognize user generated sounds, such as speech input. Likewise, speech input may be recognized as a voice command, and may be used to provide context to recognize a body position or motion as a gesture. Weights may be assigned to the inputs to facilitate processing. When a gesture is recognized, a limited set of voice commands associated with the recognized gesture are loaded for use. Further, additional sets of voice commands may be structured in a hierarchical manner such that speaking a voice command from one set of voice commands leads to the system loading a next set of voice commands.

Citations

18 Claims

1. An automated method of initiating a machine action based on a combination of sounds and gestures made by one or more users, the method comprising:
- using a depth determining camera to capture a first three-dimensional body pose made and/or a first three-dimensional body action performed by a respective at least one of the one or more users;
  
  identifying a first pre-specified three-dimensional gesture based on the first three-dimensional body pose and/or the first three-dimensional body action of the respective at least one user, and determining a confidence level for identification of the first pre-specified three-dimensional gesture;
  
  assigning a weight to the first pre-specified three-dimensional gesture based on the confidence level for identification of the first pre-specified three-dimensional gesture;
  
  detecting a first set of one or more sounds made by the respective at least one or at least another of the one or more users, the first set of one or more sounds being made in combination with the first respective three-dimensional body pose and/or the first three-dimensional body action of the respective at least one user;
  
  recognizing a first voice command based on the first set of one or more sounds and determining a confidence level for recognition of the first voice command;
  
  assigning a weight to the first voice command based on the confidence level for recognition of the first voice command;
  
  automatically identifying a first command pre-associated with a compound combination of the first pre-specified three-dimensional gesture and the first voice command by,determining that the weight of the first pre-specified three-dimensional gesture is greater than the weight of the first voice command, and verifying that the first voice command is within a set of voice commands associated with the first pre-specified three-dimensional gesture;
  
  in response to the first command, initiating performance by an instructable machine of a first machine action that has been predetermined to be commanded by the first command;
  
  identifying a second pre-specified three-dimensional gesture based on a second three-dimensional body pose and/or a second three-dimensional body action of the respective at least one user;
  
  assigning a weight to the second pre-specified three-dimensional gesture;
  
  recognizing a second voice command based on a second set of one or more sounds;
  
  assigning a weight to the second voice command;
  
  automatically identifying a second command pre-associated with a compound combination of the second pre-specified three-dimensional gesture and the second voice command by determining that the weight of the second voice command is greater than the weight of the second pre-specified three-dimensional gesture, and verifying that the second pre-specified three-dimensional gesture is within a set of gestures associated with the second voice command; and
  
  in response to the second command, initiating performance by the instructable machine of a second machine action that has been predetermined to be commanded by the second command.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1 wherein the identifying of the first pre-specified three-dimensional gesture and the identifying of the second pre-specified three-dimensional gesture includes determining how a motion part of a body action performed by the respective at least one user relates to a size aspect of the body of the respective at least one user.
  - 3. The method of claim 2 wherein the size aspect of the body is a height of the respective at least one user.
  - 4. The method of claim 1 wherein the identifying of the first pre-specified three-dimensional gesture and the identifying of the second pre-specified three-dimensional gesture includes identifying a three-dimensional geometric shape.
  - 5. The method of claim 4 wherein the three-dimensional geometric shape includes a closed two-dimensional geometric shape in three-dimensional space.
  - 6. The method of claim 1 wherein the first pre-specified three-dimensional gesture and the second pre-specified three-dimensional gesture are made by a hand of the respective at least one user.
  - 7. The method of claim 1 wherein at least part of the first set of one or more sounds includes a non-speech fragment.
  - 8. The method of claim 7 wherein the non-speech fragment includes a sound produced by interaction of at least two body parts of at least one of the users.
  - 9. The method of claim 8 wherein the interaction produced sound is a clapping sound.
  - 10. The method of claim 1 wherein at least part of the first set of one or more sounds comprises:
    - a detectable association of the first set of one or more sounds with one or more predetermined text string fragments that describe or define the respective sounds in keyword form.
  - 11. The method of claim 10 and further comprising:
    - concatenating together associated text string fragments of two or more of the first set of one or more sounds.
  - 12. The method of claim 11 wherein:
    - the first command is a function of the concatenated together text string fragments.
  - 13. The method of claim 10 wherein:
    - the first command is a function of one or more of the text string fragments.
  - 14. The method of claim 1 wherein the first three-dimensional body action and the second three-dimensional body action include a three-dimensional motion made by at least one free hand of a respective at least one of the users in a respective three-dimensional unencumbered space present about the respective at least one user.
  - 15. The method of claim 1 wherein the first set of one or more sounds begin before the first three-dimensional body pose is made and/or the first three-dimensional body action is performed.

16. A machine system comprising:
- a display configured to display content;
  
  a depth sensor configured to capture depth information about real world objects;
  
  a sound sensor configured to capture sounds made by real world objects; and
  
  at least one processor in operative communication with the display, with the depth sensor, and with the sound sensor, the processor being configured to;
  
  use the depth sensor to capture depth aspects of respective three-dimensional body poses made and/or a three-dimensional body actions performed by a respective at least one of one or more users present in a field of view of the depth sensor;
  
  identify a pre-specified three-dimensional gesture based on the depth aspects captured by the depth sensor with respect to the three-dimensional body poses and/or three-dimensional body actions of the respective at least one user and determine a confidence level for identification of the pre-specified three-dimensional gesture;
  
  assign a weight to the gesture based on the confidence level for identification of the pre-specified three-dimensional gesture;
  
  use the sound sensor to detect one or more sounds made by the respective at least one or at least another of the one or more users present in a field of view of the depth sensor, the one or more sounds being made in combination with the respective three-dimensional poses and/or three-dimensional actions of the respective at least one user;
  
  recognize a voice command based on the one or more sounds and determine a confidence level for recognition of the voice command;
  
  assigning a weight to the voice command based on the confidence level for recognition of the voice command;
  
  use the identified pre-specified three-dimensional gesture to automatically identify a command pre-associated with a compound combination of the pre-specified three-dimensional gesture and the voice command by, when the weight of the pre-specified three-dimensional gesture is greater than the weight of the voice command, verifying that the voice command is within a set of voice commands associated with the pre-specified three-dimensional gesture, and when the weight of the voice command is greater than the weight of the pre-specified three-dimensional gesture, verifying that the pre-specified three-dimensional gesture is within a set of gestures associated with the voice command; and
  
  in response to the automatically identified command, initiating performance by the at least one processor or a different computing system of a machine action that has been predetermined to be commanded by the automatically identified command.

17. A computer-readable storage device having computer-readable instructions embedded therein, the instructions being executable by a processor to provide an automated method of initiating a machine action based on a combination of sounds and gestures made by one or more users, the method comprising:
- using a depth determining camera to capture a respective three-dimensional body pose made and/or a three-dimensional body action performed by a respective at least one of the one or more users;
  
  identifying a pre-specified three-dimensional gesture based on the three-dimensional body pose and/or three-dimensional body action of the respective at least one user, and determining a confidence level for identification of the pre-specified three-dimensional gesture;
  
  assigning a weight to the gesture based on the confidence level for identification of the pre-specified three-dimensional gesture;
  
  detecting one or more sounds made by the respective at least one or at least another of the one or more users, the one or more sounds being made in combination with the respective three-dimensional pose and/or three-dimensional action of the respective at least one user;
  
  recognizing a voice command based on the one or more sounds and determining a confidence level for recognition of the voice command;
  
  assigning a weight to the voice command based on the confidence level for recognition of the voice command;
  
  using the identified pre-specified three-dimensional gesture to automatically identify a command pre-associated with a compound combination of the pre-specified three-dimensional gesture and the voice command by, when the weight of the pre-specified three-dimensional gesture is greater than the weight of the voice command, verifying that the voice command is within a set of voice commands associated with the pre-specified three-dimensional gesture, and when the weight of the voice command is greater than the weight of the pre-specified three-dimensional gesture, verifying that the pre-specified three-dimensional gesture is within a set of gestures associated with the voice command; and
  
  in response to the automatically identified command, initiating performance by an instructable machine of a machine action that has been predetermined to be commanded by the automatically identified command.
- View Dependent Claims (18)
- - 18. The computer-readable storage device of claim 17 wherein for the automated method, the one or more sounds can begin before the respective three-dimensional body pose is made and/or the three-dimensional body action is performed by a respective at least one of the one or more users.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Klein, Christian, Vassigh, Ali M., Flaks, Jason S., Larco, Vanessa, Soemo, Thomas M.
Primary Examiner(s)
Wozniak, James S

Application Number

US15/581,333
Publication Number

US 20170228036A1
Time in Patent Office

991 Days
Field of Search

704251, 704270, 704272, 704275, 382181, 382195
US Class Current
CPC Class Codes

G06F 2203/0381   Multimodal input, i.e. inte...

G06F 3/017   Gesture based interaction, ...

G06F 3/038   Control and interface arran...

G06F 3/167   Audio in a user interface, ...

G06T 7/521   from laser ranging, e.g. us...

G06V 40/107   Static hand or arm

G10L 2015/223   Execution procedure of a sp...

G10L 2015/226   using non-speech characteri...

Compound gesture-speech commands

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Compound gesture-speech commands

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links