SENSOR ENHANCED SPEECH RECOGNITION

US 20150364139A1
Filed: 06/11/2014
Published: 12/17/2015
Est. Priority Date: 06/11/2014
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

a memory that stores instructions;

a processor that executes the instructions to perform operations, the operations comprising;

obtaining visual content associated with a user and an environment of the user;

obtaining, from the visual content, metadata associated the user and the environment of the user;

determining, based on the visual content and metadata, if the user is speaking;

obtaining, if the user is determined to be speaking, audio content associated with the user and the environment;

adapting, based on the visual content, audio content, and metadata, an acoustic model corresponding to the user and the environment; and

enhancing, by utilizing the acoustic model, a speech recognition process utilized for processing speech of the user.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for sensor enhanced speech recognition is disclosed. The system may obtain visual content or other content associated with a user and an environment of the user. Additionally, the system may obtain, from the visual content, metadata associated with the user and the environment of the user. The system may also include determining, based on the visual content and metadata, if the user is speaking. If the user is determined to be speaking, the system may obtain audio content associated with the user and the environment. The system may then adapt, based on the visual content, audio content, and metadata, one or more acoustic models that match the user and the environment. Once the one or more acoustic models are adapted and loaded, the system may enhance a speech recognition process or other process associated with the user.

Citations

20 Claims

1. A system, comprising:
- a memory that stores instructions;
  
  a processor that executes the instructions to perform operations, the operations comprising;
  
  obtaining visual content associated with a user and an environment of the user;
  
  obtaining, from the visual content, metadata associated the user and the environment of the user;
  
  determining, based on the visual content and metadata, if the user is speaking;
  
  obtaining, if the user is determined to be speaking, audio content associated with the user and the environment;
  
  adapting, based on the visual content, audio content, and metadata, an acoustic model corresponding to the user and the environment; and
  
  enhancing, by utilizing the acoustic model, a speech recognition process utilized for processing speech of the user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, wherein the operations further comprise determining a location of the user based on the metadata.
  - 3. The system of claim 2, wherein the operations further comprise adapting the acoustic model based on the location of the user.
  - 4. The system of claim 1, wherein the operations further comprise determining a distance between the user and a device utilized by the user, and wherein the operations further comprise adapting the acoustic model based on the distance determined between the user and the device utilized by the user.
  - 5. The system of claim 1, wherein the operations further comprise determining an orientation of a face of the user with respect to a device utilized by the user, and wherein the operations further comprise adapting the acoustic model based on the orientation of the face of the user with respect to the device.
  - 6. The system of claim 1, wherein the operations further comprise determining, based on the visual content and metadata, a language being spoken by the user, and wherein the operations further comprise adapting the acoustic model based on the language being spoken by the user.
  - 7. The system of claim 1, wherein the operations further comprise determining a velocity of the user based on a global positioning sensor, and wherein the operations further comprise adapting the acoustic model based on the velocity of the user.
  - 8. The system of claim 1, wherein the operations further comprise determining a gender of the user based on the visual content and metadata, and wherein the operations further comprise adapting the acoustic model based on the gender.
  - 9. The system of claim 1, wherein the operations further comprise creating a user profile of the user based on the metadata, and wherein the operations further comprise adapting the acoustic model based on the user profile.
  - 10. The system of claim 1, wherein the operations further comprise not obtaining the audio content associated with the user and the environment if the user is determined to not be speaking.

11. A method, comprising:
- obtaining visual content associated with a user and an environment of the user;
  
  obtaining, from the visual content, metadata associated the user and the environment of the user;
  
  determining, based on the visual content and metadata, if the user is speaking;
  
  obtaining, if the user is determined to be speaking, audio content associated with the user and the environment;
  
  adapting, based on the visual content, audio content, and metadata, an acoustic model corresponding to the user and the environment; and
  
  enhancing, by utilizing the acoustic model and by utilizing instructions from memory that are executed by a processor, a speech recognition process utilized for processing speech of the user.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The method of claim 11, further comprising determining a location of the user based on the metadata.
  - 13. The method of claim 12, further comprising adapting the acoustic model based on the location of the user.
  - 14. The method of claim 11, further comprising determining a distance between the user and a device utilized by the user, and further comprising adapting the acoustic model based on the distance determined between the user and the device utilized by the user.
  - 15. The method of claim 11, further comprising determining an orientation of a face of the user with respect to a device utilized by the user, and further comprising adapting the acoustic model based on the orientation of the face of the user with respect to the device.
  - 16. The method of claim 11, further comprising determining, based on the visual content and metadata, a language being spoken by the user, and further comprising adapting the acoustic model based on the language being spoken by the user.
  - 17. The method of claim 11, further comprising determining a velocity of the user based on a global positioning sensor, and further comprising adapting the acoustic model based on the velocity of the user.
  - 18. The method of claim 11, further comprising determining an age of the user based on the visual content and metadata, and further comprising adapting the acoustic model based on the age.
  - 19. The method of claim 11, further comprising creating a user profile of the user based on the metadata, and further comprising adapting the acoustic model based on the user profile.

20. A computer-readable device comprising instructions, which when executed by a processor, cause the processor to perform operations comprising:
- obtaining visual content associated with a user and an environment of the user;
  
  obtaining, from the visual content, metadata associated the user and the environment of the user;
  
  determining, based on the visual content and metadata, if the user is speaking;
  
  obtaining, if the user is determined to be speaking, audio content associated with the user and the environment;
  
  adapting, based on the visual content, audio content, and metadata, an acoustic model corresponding to the user and the environment; and
  
  enhancing, by utilizing the acoustic model, a speech recognition process utilized for processing speech of the user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Dimitriadis, Dimitrios, Bowen, Donald J., Gilbert, Mazin E., Schroeter, Horst J.

Granted Patent

US 9,870,500 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06V 20/10   Terrestrial scenes scenes u...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/065   Adaptation

G10L 2015/227   of the speaker; Human-fact...

G10L 2015/228   of application context

SENSOR ENHANCED SPEECH RECOGNITION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SENSOR ENHANCED SPEECH RECOGNITION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links