Speech-inclusive device interfaces
First Claim
1. A method of determining user input to a computing device, comprising:
- capturing audio data using at least one audio capture element of the computing device;
concurrent with capturing audio data, capturing image data of the user using at least one image capture element of the computing device; and
using at least one algorithm executing on a processor of the computing device;
detecting a presence of speech information contained in the captured audio data;
in the captured image data, detecting mouth movement of the user of the computing device at a time during which the speech information was detected;
in response to detecting the presence of speech information and detecting mouth movement, identifying at least one user input based at least in part on a combination of the speech information and the mouth movement for a defined period of time, wherein the identifying of the at least one user input includes comparing the mouth movement in the captured image data to one or more word formation models, the one or more word formation models capable of being personalized for the user over a duration of time based at least in part on the speech information and the mouth movement of the user; and
providing the user input for processing if a confidence level of the identified user input exceeds a minimum threshold, wherein the confidence level of the identified user input is relative to the combination of the speech information and the mouth movement, and wherein the confidence level of the identified user input is based at least in part on a metric indicating a level of matching between the identified user input and an input term.
1 Assignment
0 Petitions
Accused Products
Abstract
A user can provide input to a computing device through various combinations of speech, movement, and/or gestures. A computing device can analyze captured audio data and analyze that data to determine any speech information in the audio data. The computing device can simultaneously capture image or video information which can be used to assist in analyzing the audio information. For example, image information is utilized by the device to determine when someone is speaking, and the movement of the person'"'"'s lips can be analyzed to assist in determining the words that were spoken. Any gestures or other motions can assist in the determination as well. By combining various types of data to determine user input, the accuracy of a process such as speech recognition can be improved, and the need for lengthy application training processes can be avoided.
175 Citations
34 Claims
-
1. A method of determining user input to a computing device, comprising:
-
capturing audio data using at least one audio capture element of the computing device; concurrent with capturing audio data, capturing image data of the user using at least one image capture element of the computing device; and using at least one algorithm executing on a processor of the computing device; detecting a presence of speech information contained in the captured audio data; in the captured image data, detecting mouth movement of the user of the computing device at a time during which the speech information was detected; in response to detecting the presence of speech information and detecting mouth movement, identifying at least one user input based at least in part on a combination of the speech information and the mouth movement for a defined period of time, wherein the identifying of the at least one user input includes comparing the mouth movement in the captured image data to one or more word formation models, the one or more word formation models capable of being personalized for the user over a duration of time based at least in part on the speech information and the mouth movement of the user; and providing the user input for processing if a confidence level of the identified user input exceeds a minimum threshold, wherein the confidence level of the identified user input is relative to the combination of the speech information and the mouth movement, and wherein the confidence level of the identified user input is based at least in part on a metric indicating a level of matching between the identified user input and an input term. - View Dependent Claims (2, 3)
-
-
4. A method of determining input to a computing device, comprising:
-
capturing audio data using at least one audio capture element of the computing device; capturing image data using at least one image capture element of the computing device; analyzing at least one of the audio data and the image data to determine whether a person interacting with the computing device is generating speech that is perceptible by the computing device; and if the speech is perceptible by the computing device, analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech, wherein the analyzed combination of audio data and image data are for a substantially same period of time, and wherein the analyzing includes, in part, comparing at least a portion of the image data to one or more word formation models that are capable of being personalized for the person over a duration of time based at least in part on the speech generated by the person; and if at least a portion of the content corresponds to an input to the computing device, processing the input, wherein the at least the portion corresponds to the input when the at least the portion matches the input at least at a defined confidence level, the defined confidence level being relative to the audio data and the image data. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A computing device, comprising:
-
a processor; a memory device including instructions operable to be executed by the processor to perform a set of actions, enabling the processor to; capture audio data using at least one audio capture element in communication with the computing device; capture image data using at least one image capture element in communication with the computing device; analyze at least one of the audio data and the image data to determine whether a person interacting with the computing device is generating speech that is perceptible by the computing device; and if the person is generating speech, analyze a combination of the audio data and the image data to determine at least a portion of the content of the speech, wherein the analyzing includes, in part, comparing at least a portion of the image data to one or more word formation models that are capable of being personalized for the person over a duration of time based at least in part on the speech generated by the person; and if the content corresponds to input to the computing device, process the input on the computing device, wherein the content corresponds to the input when the content matches the input at least at a defined confidence level, the defined confidence level being relative to the audio data and the image data. - View Dependent Claims (26, 27, 28)
-
-
29. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising:
-
program code for accessing audio data; program code for accessing image data; program code for analyzing at least one of the audio data and the image data to determine whether a person interacting with the computing device is generating speech that is perceptible by the computing device; and program code for, if the person is generating speech, analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech, wherein the analyzing includes, in part, comparing at least a portion of the image data to one or more word formation models that are capable of being personalized for the person over a duration of time based at least in part on the speech generated by the person; and if the content corresponds to an input to the computing device, providing the input for processing, wherein the content corresponds to the input when the content matches the input at least at a defined confidence level, the defined confidence level being relative to the audio data and the image data. - View Dependent Claims (30, 31, 32, 33, 34)
-
Specification