VIDEO ANALYSIS BASED LANGUAGE MODEL ADAPTATION

US 20140379346A1
Filed: 06/21/2013
Published: 12/25/2014
Est. Priority Date: 06/21/2013
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving audio data obtained by a microphone of a wearable computing device, wherein the audio data encodes an utterance of a user;

receiving image data obtained by a camera of the wearable computing device;

identifying one or more image features based on the image data;

classifying the image data as pertaining to a particular activity, based at least on the one or more image features, wherein the particular activity is unrelated to providing an explicit user input to the wearable computing device;

selecting one or more terms associated with a language model used by a speech recognizer to generate transcriptions;

adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the particular activity; and

obtaining, as an output of the speech recognizer that uses the adjusted probabilities, a transcription of the user utterance.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving audio data obtained by a microphone of a wearable computing device, wherein the audio data encodes a user utterance, receiving image data obtained by a camera of the wearable computing device, identifying one or more image features based on the image data, identifying one or more concepts based on the one or more image features, selecting one or more terms associated with a language model used by a speech recognizer to generate transcriptions, adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the one or more concepts, and obtaining a transcription of the user utterance using the speech recognizer.

37 Citations

View as Search Results

25 Claims

1. A computer-implemented method comprising:
- receiving audio data obtained by a microphone of a wearable computing device, wherein the audio data encodes an utterance of a user;
  
  receiving image data obtained by a camera of the wearable computing device;
  
  identifying one or more image features based on the image data;
  
  classifying the image data as pertaining to a particular activity, based at least on the one or more image features, wherein the particular activity is unrelated to providing an explicit user input to the wearable computing device;
  
  selecting one or more terms associated with a language model used by a speech recognizer to generate transcriptions;
  
  adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the particular activity; and
  
  obtaining, as an output of the speech recognizer that uses the adjusted probabilities, a transcription of the user utterance.
- View Dependent Claims (2, 3, 4, 21, 25)
- - 2. The method of claim 1, wherein classifying the image data as pertaining to the activity comprises:
    - obtaining a result of performing at least an optical character recognition process on the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the result.
  - 3. The method of claim 1, wherein classifying the image data as pertaining to the particular activity comprises:
    - obtaining a result of performing a feature matching process on the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the result.
  - 4. The method of claim 1, wherein classifying the image data as pertaining to the particular activity comprises:
    - obtaining a result of performing a shape matching process on the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the result.
  - 21. The method of claim 1, wherein classifying the image data as pertaining to the particular activity comprises:
    - classifying the image data as pertaining to the particular activity without performing an optical character recognition process on the image data.
  - 25. The method of claim 1, wherein the particular activity is one of driving, running, shopping, or attending a concert.

5. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving audio data encoding an utterance of a user;
  
  receiving image data;
  
  classifying the image data as pertaining to a particular activity, based at least on a result of analyzing the image data, wherein the particular activity is unrelated to providing an explicit user input to the one or more computers;
  
  influencing a speech recognizer based at least on classifying the image data as pertaining to the particular activity; and
  
  obtaining a transcription of the user utterance using the influenced speech recognizer.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 22)
- - 6. The system of claim 5, wherein classifying the image data as pertaining to the particular activity comprises:
    - obtaining a result of performing at least an optical character recognition process on the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the result.
  - 7. The system of claim 5, wherein classifying the image data as pertaining to the particular activity comprises:
    - obtaining a result of performing a feature recognition process on the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the result.
  - 8. The system of claim 5, wherein classifying the image data as pertaining to the particular activity comprises:
    - obtaining a result of performing a shape matching process on the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the result.
  - 9. The system of claim 5, wherein influencing the speech recognizer based at least on classifying the image data as pertaining to the particular activity comprises:
    - selecting one or more terms associated with a language model; and
      
      adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the particular activity, wherein the speech recognizer uses the language model comprising the adjusted probabilities to generate the transcription.
  - 10. The system of claim 5, wherein influencing the speech recognizer based at least on classifying the image data as pertaining to the particular activity, comprises:
    - selecting a language model associated with the particular activity, wherein the speech recognizer uses the selected language model to generate the transcription.
  - 11. The system of claim 5, wherein influencing the speech recognizer based at least on classifying the image data as pertaining to the particular activity comprises:
    - selecting a language model associated with the particular activity; and
      
      interpolating the language model associated with the particular activity with a general language model, wherein the speech recognizer uses the interpolated language model to generate the transcription.
  - 12. The system of claim 5, wherein:
    - the audio data encoding the utterance of the user is obtained by a microphone of a wearable computing device; and
      
      the image data is obtained by a camera of the wearable computing device.
  - 22. The system of claim 5, wherein classifying the image data as pertaining to the activity comprises:
    - identifying, without performing an optical character recognition process on the image data, one or more image features associated with the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the one or more identified image features.

13. A computer readable storage device encoded with a computer program, the program comprising instructions that, if executed by one or more computers, cause the one or more computers to perform operations comprising:
- receiving audio data encoding an utterance of a user;
  
  receiving image data;
  
  classifying the image data as pertaining to a particular activity, based at least on a result of analyzing the image data, wherein the particular activity is unrelated to providing an explicit user input to the one or more computers;
  
  influencing a speech recognizer based at least on classifying the image data as pertaining to the particular activity; and
  
  obtaining a transcription of the user utterance using the influenced speech recognizer.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 23)
- - 14. The device of claim 13, wherein classifying the image data as pertaining to the particular activity comprises:
    - obtaining a result of performing at least an optical character recognition process on the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the result.
  - 15. The device of claim 13, wherein classifying the image data as pertaining to the particular activity comprises:
    - obtaining a result of performing a feature recognition process on the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the result.
  - 16. The device of claim 13, wherein classifying the image data as pertaining to the particular activity comprises:
    - obtaining a result of performing a shape matching process on the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the result.
  - 17. The device of claim 13, wherein influencing the speech recognizer based at least on classifying the image data as pertaining to the particular activity comprises:
    - selecting one or more terms associated with a language model; and
      
      adjusting one or more probabilities associated with the language model that correspond to one or more of the selected terms based on the relevance of one or more of the selected terms to the particular activity, wherein the speech recognizer uses the language model comprising the adjusted probabilities to generate the transcription.
  - 18. The device of claim 13, wherein influencing the speech recognizer based at least on classifying the image data as pertaining to the particular activity comprises:
    - selecting a language model associated with the particular activity, wherein the speech recognizer uses the selected language model to generate the transcription.
  - 19. The device of claim 13, wherein influencing the speech recognizer based at least on classifying the image data as pertaining to the particular activity comprises:
    - selecting a language model associated with the particular activity; and
      
      interpolating the language model associated with the particular activity with a general language model, wherein the speech recognizer uses the interpolated language model to generate the transcription.
  - 20. The device of claim 13, wherein:
    - the audio data encoding the utterance of the user is obtained by a microphone of a wearable computing device; and
      
      the image data is obtained by a camera of the wearable computing device.
  - 23. The device of claim 13, wherein classifying the image data as pertaining to the particular activity comprises:
    - identifying, without performing an optical character recognition process on the image data, one or more image features associated with the image data; and
      
      classifying the image data as pertaining to the particular activity based at least on the one or more identified image features.

24. (canceled)

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Aleksic, Petar, Lei, Xin

Application Number

US13/923,545
Publication Number

US 20140379346A1
Time in Patent Office

Days
Field of Search
US Class Current

704/251
CPC Class Codes

G06V 30/274   Syntactic or semantic conte...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/183   using context dependencies,...

G10L 15/24   Speech recognition using no...

G10L 15/25   using position of the lips,...

VIDEO ANALYSIS BASED LANGUAGE MODEL ADAPTATION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

25 Claims

Specification

Use Cases

Quick Links

Others

VIDEO ANALYSIS BASED LANGUAGE MODEL ADAPTATION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

25 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others