Compound gesture-speech commands
First Claim
1. An automated method of initiating a machine action based on a combination of sounds and gestures made by one or more users, the method comprising:
- using a depth determining camera to capture a first three-dimensional body pose made and/or a first three-dimensional body action performed by a respective at least one of the one or more users;
identifying a first pre-specified three-dimensional gesture based on the first three-dimensional body pose and/or the first three-dimensional body action of the respective at least one user, and determining a confidence level for identification of the first pre-specified three-dimensional gesture;
assigning a weight to the first pre-specified three-dimensional gesture based on the confidence level for identification of the first pre-specified three-dimensional gesture;
detecting a first set of one or more sounds made by the respective at least one or at least another of the one or more users, the first set of one or more sounds being made in combination with the first respective three-dimensional body pose and/or the first three-dimensional body action of the respective at least one user;
recognizing a first voice command based on the first set of one or more sounds and determining a confidence level for recognition of the first voice command;
assigning a weight to the first voice command based on the confidence level for recognition of the first voice command;
automatically identifying a first command pre-associated with a compound combination of the first pre-specified three-dimensional gesture and the first voice command by,determining that the weight of the first pre-specified three-dimensional gesture is greater than the weight of the first voice command, and verifying that the first voice command is within a set of voice commands associated with the first pre-specified three-dimensional gesture;
in response to the first command, initiating performance by an instructable machine of a first machine action that has been predetermined to be commanded by the first command;
identifying a second pre-specified three-dimensional gesture based on a second three-dimensional body pose and/or a second three-dimensional body action of the respective at least one user;
assigning a weight to the second pre-specified three-dimensional gesture;
recognizing a second voice command based on a second set of one or more sounds;
assigning a weight to the second voice command;
automatically identifying a second command pre-associated with a compound combination of the second pre-specified three-dimensional gesture and the second voice command by determining that the weight of the second voice command is greater than the weight of the second pre-specified three-dimensional gesture, and verifying that the second pre-specified three-dimensional gesture is within a set of gestures associated with the second voice command; and
in response to the second command, initiating performance by the instructable machine of a second machine action that has been predetermined to be commanded by the second command.
2 Assignments
0 Petitions
Accused Products
Abstract
A multimedia entertainment system combines both gestures and voice commands to provide an enhanced control scheme. A user'"'"'s body position or motion may be recognized as a gesture, and may be used to provide context to recognize user generated sounds, such as speech input. Likewise, speech input may be recognized as a voice command, and may be used to provide context to recognize a body position or motion as a gesture. Weights may be assigned to the inputs to facilitate processing. When a gesture is recognized, a limited set of voice commands associated with the recognized gesture are loaded for use. Further, additional sets of voice commands may be structured in a hierarchical manner such that speaking a voice command from one set of voice commands leads to the system loading a next set of voice commands.
-
Citations
18 Claims
-
1. An automated method of initiating a machine action based on a combination of sounds and gestures made by one or more users, the method comprising:
-
using a depth determining camera to capture a first three-dimensional body pose made and/or a first three-dimensional body action performed by a respective at least one of the one or more users; identifying a first pre-specified three-dimensional gesture based on the first three-dimensional body pose and/or the first three-dimensional body action of the respective at least one user, and determining a confidence level for identification of the first pre-specified three-dimensional gesture; assigning a weight to the first pre-specified three-dimensional gesture based on the confidence level for identification of the first pre-specified three-dimensional gesture; detecting a first set of one or more sounds made by the respective at least one or at least another of the one or more users, the first set of one or more sounds being made in combination with the first respective three-dimensional body pose and/or the first three-dimensional body action of the respective at least one user; recognizing a first voice command based on the first set of one or more sounds and determining a confidence level for recognition of the first voice command; assigning a weight to the first voice command based on the confidence level for recognition of the first voice command; automatically identifying a first command pre-associated with a compound combination of the first pre-specified three-dimensional gesture and the first voice command by, determining that the weight of the first pre-specified three-dimensional gesture is greater than the weight of the first voice command, and verifying that the first voice command is within a set of voice commands associated with the first pre-specified three-dimensional gesture; in response to the first command, initiating performance by an instructable machine of a first machine action that has been predetermined to be commanded by the first command; identifying a second pre-specified three-dimensional gesture based on a second three-dimensional body pose and/or a second three-dimensional body action of the respective at least one user; assigning a weight to the second pre-specified three-dimensional gesture; recognizing a second voice command based on a second set of one or more sounds; assigning a weight to the second voice command; automatically identifying a second command pre-associated with a compound combination of the second pre-specified three-dimensional gesture and the second voice command by determining that the weight of the second voice command is greater than the weight of the second pre-specified three-dimensional gesture, and verifying that the second pre-specified three-dimensional gesture is within a set of gestures associated with the second voice command; and in response to the second command, initiating performance by the instructable machine of a second machine action that has been predetermined to be commanded by the second command. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A machine system comprising:
-
a display configured to display content; a depth sensor configured to capture depth information about real world objects; a sound sensor configured to capture sounds made by real world objects; and at least one processor in operative communication with the display, with the depth sensor, and with the sound sensor, the processor being configured to; use the depth sensor to capture depth aspects of respective three-dimensional body poses made and/or a three-dimensional body actions performed by a respective at least one of one or more users present in a field of view of the depth sensor; identify a pre-specified three-dimensional gesture based on the depth aspects captured by the depth sensor with respect to the three-dimensional body poses and/or three-dimensional body actions of the respective at least one user and determine a confidence level for identification of the pre-specified three-dimensional gesture; assign a weight to the gesture based on the confidence level for identification of the pre-specified three-dimensional gesture; use the sound sensor to detect one or more sounds made by the respective at least one or at least another of the one or more users present in a field of view of the depth sensor, the one or more sounds being made in combination with the respective three-dimensional poses and/or three-dimensional actions of the respective at least one user; recognize a voice command based on the one or more sounds and determine a confidence level for recognition of the voice command; assigning a weight to the voice command based on the confidence level for recognition of the voice command; use the identified pre-specified three-dimensional gesture to automatically identify a command pre-associated with a compound combination of the pre-specified three-dimensional gesture and the voice command by, when the weight of the pre-specified three-dimensional gesture is greater than the weight of the voice command, verifying that the voice command is within a set of voice commands associated with the pre-specified three-dimensional gesture, and when the weight of the voice command is greater than the weight of the pre-specified three-dimensional gesture, verifying that the pre-specified three-dimensional gesture is within a set of gestures associated with the voice command; and in response to the automatically identified command, initiating performance by the at least one processor or a different computing system of a machine action that has been predetermined to be commanded by the automatically identified command.
-
-
17. A computer-readable storage device having computer-readable instructions embedded therein, the instructions being executable by a processor to provide an automated method of initiating a machine action based on a combination of sounds and gestures made by one or more users, the method comprising:
-
using a depth determining camera to capture a respective three-dimensional body pose made and/or a three-dimensional body action performed by a respective at least one of the one or more users; identifying a pre-specified three-dimensional gesture based on the three-dimensional body pose and/or three-dimensional body action of the respective at least one user, and determining a confidence level for identification of the pre-specified three-dimensional gesture; assigning a weight to the gesture based on the confidence level for identification of the pre-specified three-dimensional gesture; detecting one or more sounds made by the respective at least one or at least another of the one or more users, the one or more sounds being made in combination with the respective three-dimensional pose and/or three-dimensional action of the respective at least one user; recognizing a voice command based on the one or more sounds and determining a confidence level for recognition of the voice command; assigning a weight to the voice command based on the confidence level for recognition of the voice command; using the identified pre-specified three-dimensional gesture to automatically identify a command pre-associated with a compound combination of the pre-specified three-dimensional gesture and the voice command by, when the weight of the pre-specified three-dimensional gesture is greater than the weight of the voice command, verifying that the voice command is within a set of voice commands associated with the pre-specified three-dimensional gesture, and when the weight of the voice command is greater than the weight of the pre-specified three-dimensional gesture, verifying that the pre-specified three-dimensional gesture is within a set of gestures associated with the voice command; and in response to the automatically identified command, initiating performance by an instructable machine of a machine action that has been predetermined to be commanded by the automatically identified command. - View Dependent Claims (18)
-
Specification