Speech-inclusive device interfaces

US 8,700,392 B1
Filed: 09/10/2010
Issued: 04/15/2014
Est. Priority Date: 09/10/2010
Status: Active Grant

First Claim

Patent Images

1. A method of determining user input to a computing device, comprising:

capturing audio data using at least one audio capture element of the computing device;

concurrent with capturing audio data, capturing image data of the user using at least one image capture element of the computing device; and

using at least one algorithm executing on a processor of the computing device;

detecting a presence of speech information contained in the captured audio data;

in the captured image data, detecting mouth movement of the user of the computing device at a time during which the speech information was detected;

in response to detecting the presence of speech information and detecting mouth movement, identifying at least one user input based at least in part on a combination of the speech information and the mouth movement for a defined period of time, wherein the identifying of the at least one user input includes comparing the mouth movement in the captured image data to one or more word formation models, the one or more word formation models capable of being personalized for the user over a duration of time based at least in part on the speech information and the mouth movement of the user; and

providing the user input for processing if a confidence level of the identified user input exceeds a minimum threshold, wherein the confidence level of the identified user input is relative to the combination of the speech information and the mouth movement, and wherein the confidence level of the identified user input is based at least in part on a metric indicating a level of matching between the identified user input and an input term.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A user can provide input to a computing device through various combinations of speech, movement, and/or gestures. A computing device can analyze captured audio data and analyze that data to determine any speech information in the audio data. The computing device can simultaneously capture image or video information which can be used to assist in analyzing the audio information. For example, image information is utilized by the device to determine when someone is speaking, and the movement of the person'"'"'s lips can be analyzed to assist in determining the words that were spoken. Any gestures or other motions can assist in the determination as well. By combining various types of data to determine user input, the accuracy of a process such as speech recognition can be improved, and the need for lengthy application training processes can be avoided.

175 Citations

34 Claims

1. A method of determining user input to a computing device, comprising:
- capturing audio data using at least one audio capture element of the computing device;
  
  concurrent with capturing audio data, capturing image data of the user using at least one image capture element of the computing device; and
  
  using at least one algorithm executing on a processor of the computing device;
  
  detecting a presence of speech information contained in the captured audio data;
  
  in the captured image data, detecting mouth movement of the user of the computing device at a time during which the speech information was detected;
  
  in response to detecting the presence of speech information and detecting mouth movement, identifying at least one user input based at least in part on a combination of the speech information and the mouth movement for a defined period of time, wherein the identifying of the at least one user input includes comparing the mouth movement in the captured image data to one or more word formation models, the one or more word formation models capable of being personalized for the user over a duration of time based at least in part on the speech information and the mouth movement of the user; and
  
  providing the user input for processing if a confidence level of the identified user input exceeds a minimum threshold, wherein the confidence level of the identified user input is relative to the combination of the speech information and the mouth movement, and wherein the confidence level of the identified user input is based at least in part on a metric indicating a level of matching between the identified user input and an input term.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein identifying the at least one user input utilizes at least one speech algorithm and at least one image analysis algorithm.
  - 3. The method of claim 1, wherein the defined period of time utilizes audio data and image data captured for a substantially same period of time.

4. A method of determining input to a computing device, comprising:
- capturing audio data using at least one audio capture element of the computing device;
  
  capturing image data using at least one image capture element of the computing device;
  
  analyzing at least one of the audio data and the image data to determine whether a person interacting with the computing device is generating speech that is perceptible by the computing device; and
  
  if the speech is perceptible by the computing device,analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech, wherein the analyzed combination of audio data and image data are for a substantially same period of time, and wherein the analyzing includes, in part, comparing at least a portion of the image data to one or more word formation models that are capable of being personalized for the person over a duration of time based at least in part on the speech generated by the person; and
  
  if at least a portion of the content corresponds to an input to the computing device, processing the input, wherein the at least the portion corresponds to the input when the at least the portion matches the input at least at a defined confidence level, the defined confidence level being relative to the audio data and the image data.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 5. The method of claim 4, wherein analyzing the image data comprises executing at least one algorithm for determining speech formation or gesturing of the person, the speech formation or gesturing used to determine input provided by the person.
  - 6. The method of claim 4, further comprising:
    - determining a state of the computing device; and
      
      determining a sub-dictionary to be used in identifying content of the speech, the sub-dictionary comprising terms at least partially relevant to the determined state and including fewer terms than a dictionary for general speech determination.
  - 7. The method of claim 6, wherein the state of the device depends at least in part on an interface displayed on the device.
  - 8. The method of claim 6, further comprising:
    - determining an approximate location on the computing device at which the person is looking, the approximate location capable of being used to modify a currently selected sub-dictionary based at least in part upon one or more elements at that approximate location.
  - 9. The method of claim 4, wherein a general speech model is able to be used to determine content of the speech data without the person first undertaking an initial training process.
  - 10. The method of claim 4, wherein capturing image data only occurs when the captured audio data meets at least one triggering criterion.
  - 11. The method of claim 10, wherein the at least one triggering criterion includes at least one of a minimum volume threshold or a frequency pattern matching human speech.
  - 12. The method of claim 4, further comprising:
    - irradiating the person with infrared (IR) radiation; and
      
      detecting IR reflected from at least one feature of the person in order to determine at least one aspect of the person with respect to the computing device.
  - 13. The method of claim 12, wherein the reflected IR is analyzed to determine at least one of a gaze direction of the person and whether the user is forming speech.
  - 14. The method of claim 4, wherein capturing image data comprises capturing image data corresponding to the person'"'"'s mouth, and further comprising:
    - monitoring movement of the person'"'"'s mouth in order to determine speech being formed by the person'"'"'s mouth.
  - 15. The method of claim 4, wherein the person is a primary user of the computing device or another person within proximity of the computing device.
  - 16. The method of claim 15, further comprising:
    - determining the person speaking out of a plurality of people within a proximity of the computing device.
  - 17. The method of claim 16, further comprising:
    - accepting input only from an identified person authorized to provide input.
  - 18. The method of claim 16, further comprising:
    - determining a context for the speech based at least in part upon an identity of the person providing the speech.
  - 19. The method of claim 15, wherein the computing device includes at least two image capture elements positioned with respect to the computing device so as to be able to capture image data for a primary user of the device or other person within a proximity of the device.
  - 20. The method of claim 15, wherein the audio capture element includes a plurality of microphones so as to be able to determine a location of the primary user, further comprising capturing audio data from a primary user from the determined location and rejecting noise sources from positions other than the location of the primary user.
  - 21. The method of claim 4, further comprising:
    - associating a gesture made by the person concurrent with the determined speech generated, wherein after an initial period of association, the person is able to provide the input based only on the gesture and without the speech.
  - 22. The method of claim 4, further comprising:
    - capturing an image of an object; and
      
      determining the input for the speech based in part upon an identification of the object.
  - 23. The method of claim 4, further comprising:
    - determining a location of the computing device or the person; and
      
      determining the input for the speech based in part upon the determined location.
  - 24. The method of claim 4, wherein the image capture element includes at least one of a digital camera element and an infrared (IR) radiation detector.

25. A computing device, comprising:
- a processor;
  
  a memory device including instructions operable to be executed by the processor to perform a set of actions, enabling the processor to;
  
  capture audio data using at least one audio capture element in communication with the computing device;
  
  capture image data using at least one image capture element in communication with the computing device;
  
  analyze at least one of the audio data and the image data to determine whether a person interacting with the computing device is generating speech that is perceptible by the computing device; and
  
  if the person is generating speech,analyze a combination of the audio data and the image data to determine at least a portion of the content of the speech, wherein the analyzing includes, in part, comparing at least a portion of the image data to one or more word formation models that are capable of being personalized for the person over a duration of time based at least in part on the speech generated by the person; and
  
  if the content corresponds to input to the computing device, process the input on the computing device, wherein the content corresponds to the input when the content matches the input at least at a defined confidence level, the defined confidence level being relative to the audio data and the image data.
- View Dependent Claims (26, 27, 28)
- - 26. The computing device of claim 25, wherein analyzing the image data comprises executing at least one algorithm for determining speech formation or gesturing of the user, the speech formation or gesturing capable of being used to determine input provided by the person.
  - 27. The computing device of claim 25, wherein the instructions when executed further enable the computing device to:
    - determine a state of the computing device; and
      
      determine a sub-dictionary to be used in identifying content of the speech, the sub-dictionary comprising terms at least partially relevant to the determined state and including fewer terms that a dictionary for general speech determination.
  - 28. The computing device of claim 25, further comprising:
    - an infrared (IR) emitter for irradiating the person with IR radiation; and
      
      an IR detector for detecting IR reflected from at least one feature of the person in order to determine at least one aspect of the person with respect to the computing device.

29. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising:
- program code for accessing audio data;
  
  program code for accessing image data;
  
  program code for analyzing at least one of the audio data and the image data to determine whether a person interacting with the computing device is generating speech that is perceptible by the computing device; and
  
  program code for, if the person is generating speech,analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech, wherein the analyzing includes, in part, comparing at least a portion of the image data to one or more word formation models that are capable of being personalized for the person over a duration of time based at least in part on the speech generated by the person; and
  
  if the content corresponds to an input to the computing device, providing the input for processing, wherein the content corresponds to the input when the content matches the input at least at a defined confidence level, the defined confidence level being relative to the audio data and the image data.
- View Dependent Claims (30, 31, 32, 33, 34)
- - 30. The non-transitory computer-readable storage medium of claim 29, wherein analyzing the image data comprises executing at least one algorithm for determining speech formation or gesturing of the user, the speech formation or gesturing capable of being used to determine input provided by the person.
  - 31. The non-transitory computer-readable storage medium of claim 29, further comprising:
    - program code for determining a state of the computing device; and
      
      program code for determining a sub-dictionary to be used in identifying content of the speech, the sub-dictionary comprising terms at least partially relevant to the determined state and including fewer terms that a dictionary for general speech determination.
  - 32. The non-transitory computer-readable storage medium of claim 31, further comprising:
    - program code for determining a region of interest on the computing device at which the person is looking; and
      
      program code for modifying a currently selected sub-dictionary based in part upon one or more elements of the determined region of interest.
  - 33. The non-transitory computer-readable storage medium of claim 29, wherein the computing devices includes at least two image capture elements, wherein analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech further comprising:
    - program code for identifying at least one image capture element generating an output that is greater than or equal to a threshold level; and
      
      program code for using the output from each identified image capture element to determine at least a portion of the content of the speech.
  - 34. The non-transitory computer-readable storage medium of claim 29, wherein the computing devices each include at least two audio capture elements, wherein analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech further comprising:
    - program code for identifying at least one audio capture element generating an output that is greater than or equal to a threshold level; and
      
      program code for using the output from each identified audio capture element to determine at least a portion of the content of the speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hart, Gregory M., Freed, Ian W., Bezos, Jeffrey P., Zehr, Gregg Elliott
Primary Examiner(s)
Godbold, Douglas

Application Number

US12/879,981
Time in Patent Office

1,313 Days
Field of Search

704231-257
US Class Current

704/231
CPC Class Codes

G10L 15/25 using position of the lips,...

Speech-inclusive device interfaces

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

175 Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Speech-inclusive device interfaces

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

175 Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links