Method and apparatus for using image data to aid voice recognition
First Claim
Patent Images
1. A computer-implemented method comprising:
- receiving, by a computing device, audio data corresponding to a user speaking in a vicinity of the computing device;
obtaining, by the computing device, a first image that includes a first representation of the user and that was captured during receipt of a first portion of the audio data;
obtaining, by the computing device, a second image that includes a second representation of the user and that was captured during receipt of a second portion of the audio data;
determining by the computing device, a first feature of the first representation of the user by analyzing the first image;
determining, by the computing device, a second feature of the second representation of the user by analyzing the second image;
based on the first feature of the first representation of the user included in the first image, obtaining, by the computing device, a transcription of the first portion of the audio data;
based on the second feature of the second representation of the user included in the second image, bypassing, by the computing device, obtaining a transcription of the second portion of the audio data; and
providing, for output by the computing device, the transcription of a portion of the audio data.
3 Assignments
0 Petitions
Accused Products
Abstract
A device performs a method for using image data to aid voice recognition. The method includes the device capturing (302) image data of a vicinity of the device and adjusting (304), based on the image data, a set of parameters for voice recognition performed by the device (102). The set of parameters for the device performing voice recognition include, but are not limited to: a trigger threshold of a trigger for voice recognition; a set of beamforming parameters; a database for voice recognition; and/or an algorithm for voice recognition. The algorithm may include using noise suppression or using acoustic beamforming.
57 Citations
21 Claims
-
1. A computer-implemented method comprising:
-
receiving, by a computing device, audio data corresponding to a user speaking in a vicinity of the computing device; obtaining, by the computing device, a first image that includes a first representation of the user and that was captured during receipt of a first portion of the audio data; obtaining, by the computing device, a second image that includes a second representation of the user and that was captured during receipt of a second portion of the audio data; determining by the computing device, a first feature of the first representation of the user by analyzing the first image; determining, by the computing device, a second feature of the second representation of the user by analyzing the second image; based on the first feature of the first representation of the user included in the first image, obtaining, by the computing device, a transcription of the first portion of the audio data; based on the second feature of the second representation of the user included in the second image, bypassing, by the computing device, obtaining a transcription of the second portion of the audio data; and providing, for output by the computing device, the transcription of a portion of the audio data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 21)
-
-
8. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving, by a computing device, audio data corresponding to a user speaking in a vicinity of the computing device; obtaining, by the computing device, a first image that includes a first representation of the user and that was captured during receipt of a first portion of the audio data; obtaining, by the computing device, a second image that includes a second representation of the user and that was captured during receipt of a second portion of the audio data; determining, by the computing device, a first feature of the first representation of the user by analyzing the first image; determining, by the computing device, a second feature of the second representation of the user by analyzing the second image; based on the first feature of the first representation of the user included in the first image, obtaining, by the computing device, a transcription of the first portion of the audio data; based on the second feature of the second representation of the user included in the second image, bypassing, by the computing device, obtaining a transcription of the second portion of the audio data; and providing, for output by the computing device, the transcription of a portion of the audio data. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
-
receiving, by a computing device, audio data corresponding to a user speaking in a vicinity of the computing device; obtaining, by the computing device, a first image that includes a first representation of the user and that was captured during receipt of a first portion of the audio data; obtaining, by the computing device, a second image that includes a second representation of the user and that was captured during receipt of a second portion of the audio data; determining, by the computing device, a first feature of the first representation of the user by analyzing the first image; determining, by the computing device, a second feature of the second representation of the user by analyzing the second image; based on the first feature of the first representation of the user included in the first image, obtaining, by the computing device, a transcription of the first portion of the audio data; based on the second feature of the second representation of the user included in the second image, bypassing, by the computing device, obtaining a transcription of the second portion of the audio data; and providing, for output by the computing device, the transcription of a portion of the audio data. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification