VIDEO ANALYSIS BASED LANGUAGE MODEL ADAPTATION
First Claim
1. A computer-implemented method comprising:
- receiving audio data obtained by a microphone of a wearable computing device, wherein the audio data encodes an utterance of a user;
receiving image data obtained by a camera of the wearable computing device;
identifying one or more image features based on the image data;
classifying the image data as pertaining to a particular activity, based at least on the one or more image features, wherein the particular activity is unrelated to providing an explicit user input to the wearable computing device;
selecting one or more terms associated with a language model used by a speech recognizer to generate transcriptions;
adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the particular activity; and
obtaining, as an output of the speech recognizer that uses the adjusted probabilities, a transcription of the user utterance.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving audio data obtained by a microphone of a wearable computing device, wherein the audio data encodes a user utterance, receiving image data obtained by a camera of the wearable computing device, identifying one or more image features based on the image data, identifying one or more concepts based on the one or more image features, selecting one or more terms associated with a language model used by a speech recognizer to generate transcriptions, adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the one or more concepts, and obtaining a transcription of the user utterance using the speech recognizer.
37 Citations
25 Claims
-
1. A computer-implemented method comprising:
-
receiving audio data obtained by a microphone of a wearable computing device, wherein the audio data encodes an utterance of a user; receiving image data obtained by a camera of the wearable computing device; identifying one or more image features based on the image data; classifying the image data as pertaining to a particular activity, based at least on the one or more image features, wherein the particular activity is unrelated to providing an explicit user input to the wearable computing device; selecting one or more terms associated with a language model used by a speech recognizer to generate transcriptions; adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the particular activity; and obtaining, as an output of the speech recognizer that uses the adjusted probabilities, a transcription of the user utterance. - View Dependent Claims (2, 3, 4, 21, 25)
-
-
5. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving audio data encoding an utterance of a user; receiving image data; classifying the image data as pertaining to a particular activity, based at least on a result of analyzing the image data, wherein the particular activity is unrelated to providing an explicit user input to the one or more computers; influencing a speech recognizer based at least on classifying the image data as pertaining to the particular activity; and obtaining a transcription of the user utterance using the influenced speech recognizer. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 22)
-
13. A computer readable storage device encoded with a computer program, the program comprising instructions that, if executed by one or more computers, cause the one or more computers to perform operations comprising:
-
receiving audio data encoding an utterance of a user; receiving image data; classifying the image data as pertaining to a particular activity, based at least on a result of analyzing the image data, wherein the particular activity is unrelated to providing an explicit user input to the one or more computers; influencing a speech recognizer based at least on classifying the image data as pertaining to the particular activity; and obtaining a transcription of the user utterance using the influenced speech recognizer. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 23)
-
-
24. (canceled)
Specification