Method and apparatus for using image data to aid voice recognition

US 10,311,868 B2
Filed: 03/21/2017
Issued: 06/04/2019
Est. Priority Date: 05/24/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, by a computing device, audio data corresponding to a user speaking in a vicinity of the computing device;

obtaining, by the computing device, a first image that includes a first representation of the user and that was captured during receipt of a first portion of the audio data;

obtaining, by the computing device, a second image that includes a second representation of the user and that was captured during receipt of a second portion of the audio data;

determining by the computing device, a first feature of the first representation of the user by analyzing the first image;

determining, by the computing device, a second feature of the second representation of the user by analyzing the second image;

based on the first feature of the first representation of the user included in the first image, obtaining, by the computing device, a transcription of the first portion of the audio data;

based on the second feature of the second representation of the user included in the second image, bypassing, by the computing device, obtaining a transcription of the second portion of the audio data; and

providing, for output by the computing device, the transcription of a portion of the audio data.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A device performs a method for using image data to aid voice recognition. The method includes the device capturing (302) image data of a vicinity of the device and adjusting (304), based on the image data, a set of parameters for voice recognition performed by the device (102). The set of parameters for the device performing voice recognition include, but are not limited to: a trigger threshold of a trigger for voice recognition; a set of beamforming parameters; a database for voice recognition; and/or an algorithm for voice recognition. The algorithm may include using noise suppression or using acoustic beamforming.

57 Citations

View as Search Results

21 Claims

1. A computer-implemented method comprising:
- receiving, by a computing device, audio data corresponding to a user speaking in a vicinity of the computing device;
  
  obtaining, by the computing device, a first image that includes a first representation of the user and that was captured during receipt of a first portion of the audio data;
  
  obtaining, by the computing device, a second image that includes a second representation of the user and that was captured during receipt of a second portion of the audio data;
  
  determining by the computing device, a first feature of the first representation of the user by analyzing the first image;
  
  determining, by the computing device, a second feature of the second representation of the user by analyzing the second image;
  
  based on the first feature of the first representation of the user included in the first image, obtaining, by the computing device, a transcription of the first portion of the audio data;
  
  based on the second feature of the second representation of the user included in the second image, bypassing, by the computing device, obtaining a transcription of the second portion of the audio data; and
  
  providing, for output by the computing device, the transcription of a portion of the audio data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 21)
- - 2. The method of claim 1, wherein the first feature of the first representation of the user included in the first image includes both eyes, a nose, and a mouth of the user.
  - 3. The method of claim 1, wherein the second feature of the second representation of the user included in the second image includes less than both eyes, a nose, and a mouth of the user.
  - 4. The method of claim 1, wherein:
    - the user is an authorized user of the computing device, andproviding the transcription of the portion of the audio data based on the user being an authorized user of the computing device.
  - 5. The method of claim 1, wherein the first feature of the first representation of the user included in the first image includes both pupils of the user.
  - 6. The method of claim 1, wherein the computing device obtains the first image and the second image using an infrared camera.
  - 7. The method of claim 1, wherein an accuracy of speech recognition performed on the first portion and the second portion of the audio data varies according to a number of other users who are in each of the first image and the second image.
  - 21. The method of claim 1, wherein:
    - obtaining a transcription of the first portion of the audio data comprises performing speech recognition on the first portion of the audio data, andbypassing obtaining a transcription of the second portion of the audio data comprises bypassing performing speech recognition on the second portion of the audio data.

8. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving, by a computing device, audio data corresponding to a user speaking in a vicinity of the computing device;
  
  obtaining, by the computing device, a first image that includes a first representation of the user and that was captured during receipt of a first portion of the audio data;
  
  obtaining, by the computing device, a second image that includes a second representation of the user and that was captured during receipt of a second portion of the audio data;
  
  determining, by the computing device, a first feature of the first representation of the user by analyzing the first image;
  
  determining, by the computing device, a second feature of the second representation of the user by analyzing the second image;
  
  based on the first feature of the first representation of the user included in the first image, obtaining, by the computing device, a transcription of the first portion of the audio data;
  
  based on the second feature of the second representation of the user included in the second image, bypassing, by the computing device, obtaining a transcription of the second portion of the audio data; and
  
  providing, for output by the computing device, the transcription of a portion of the audio data.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the first feature of the first representation of the user included in the first image includes both eyes, a nose, and a mouth of the user.
  - 10. The system of claim 8, wherein the second feature of the second representation of the user included in the second image includes less than both eyes, a nose, and a mouth of the user.
  - 11. The system of claim 8, wherein:
    - the user is an authorized user of the computing device, andproviding the transcription of the portion of the audio data based on the user being an authorized user of the computing device.
  - 12. The system of claim 8, wherein the first feature of the first representation of the user included in the first image includes both pupils of the user.
  - 13. The system of claim 8, wherein the computing device obtains the first image and the second image using an infrared camera.
  - 14. The system of claim 8, wherein an accuracy of speech recognition performed on the first portion and the second portion of the audio data varies according to a number of other users who are in each of the first image and the second image.

15. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- receiving, by a computing device, audio data corresponding to a user speaking in a vicinity of the computing device;
  
  obtaining, by the computing device, a first image that includes a first representation of the user and that was captured during receipt of a first portion of the audio data;
  
  obtaining, by the computing device, a second image that includes a second representation of the user and that was captured during receipt of a second portion of the audio data;
  
  determining, by the computing device, a first feature of the first representation of the user by analyzing the first image;
  
  determining, by the computing device, a second feature of the second representation of the user by analyzing the second image;
  
  based on the first feature of the first representation of the user included in the first image, obtaining, by the computing device, a transcription of the first portion of the audio data;
  
  based on the second feature of the second representation of the user included in the second image, bypassing, by the computing device, obtaining a transcription of the second portion of the audio data; and
  
  providing, for output by the computing device, the transcription of a portion of the audio data.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The medium of claim 15, wherein the first feature of the first representation of the user included in the first image includes both eyes, a nose, and a mouth of the user.
  - 17. The medium of claim 15, wherein the second feature of the second representation of the user included in the second image includes less than both eyes, a nose, and a mouth of the user.
  - 18. The medium of claim 15, wherein:
    - the user is an authorized user of the computing device, andproviding the transcription of the portion of the audio data based on the user being an authorized user of the computing device.
  - 19. The medium of claim 15, wherein the first feature of the first representation of the user included in the first image includes both pupils of the user.
  - 20. The medium of claim 15, wherein the computing device obtains the first image and the second image using an infrared camera.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Technology Holdings LLC (Alphabet Inc.)
Inventors
Zurek, Robert A., Schuster, Adrian M., Shau, Fu-Lin, Wu, Jincheng
Primary Examiner(s)
Riley, Marcus T

Application Number

US15/464,704
Publication Number

US 20170193996A1
Time in Patent Office

805 Days
Field of Search

None
US Class Current
CPC Class Codes

B60N 2/002   Seats provided with an occu...

G06F 3/013   Eye tracking input arrangem...

G06V 20/59   inside of a vehicle, e.g. r...

G06V 40/166   using acquisition arrangements

G06V 40/18   Eye characteristics, e.g. o...

G06V 40/19   Sensors therefor

G06V 40/20   Movements or behaviour, e.g...

G10L 15/20   Speech recognition techniqu...

G10L 15/22   Procedures used during a sp...

G10L 15/24   Speech recognition using no...

G10L 15/25   using position of the lips,...

G10L 15/26   Speech to text systems G10L...

G10L 2015/223   Execution procedure of a sp...

G10L 2015/227   of the speaker; Human-fact...

G10L 2021/02166   Microphone arrays; Beamforming

G10L 21/0208   Noise filtering

G10L 25/78   Detection of presence or ab...

H04R 2430/20   Processing of the output si...

H04R 2460/07   Use of position data from w...

H04R 2499/11   Transducers incorporated or...

Method and apparatus for using image data to aid voice recognition

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

57 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for using image data to aid voice recognition

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

57 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links