Sensor enhanced speech recognition

US 10,083,350 B2
Filed: 01/11/2018
Issued: 09/25/2018
Est. Priority Date: 06/11/2014
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

a memory that stores instructions;

a processor that executes the instructions to perform operations, the operations comprising;

obtaining, from visual content, metadata associated with a user and an environment of the user;

identifying, based on the visual content and the metadata, an interferer and a location of the interferer in the environment;

obtaining audio content associated with the user and the environment;

enhancing a speech recognition process utilized for processing speech of the user that is within the audio content;

cancelling, after identifying the interferer and the location of the interferer in the environment and by utilizing an audio profile corresponding to the interferer, noise generated by the interferer that interferes with the speech of the user; and

adjusting, based on a user profile of the user and a location profile corresponding to a location of the user, a feature of an application executing the speech recognition process so as to tailor the application to the user, wherein adjusting the feature of the application comprises adjusting at least an audio feature of the application based on the user profile and the location profile.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for sensor enhanced speech recognition is disclosed. The system may obtain visual content or other content associated with a user and an environment of the user. Additionally, the system may obtain, from the visual content, metadata associated with the user and the environment of the user. The system may also include determining, based on the visual content and metadata, if the user is speaking. If the user is determined to be speaking, the system may obtain audio content associated with the user and the environment. The system may then adapt, based on the visual content, audio content, and metadata, one or more acoustic models that match the user and the environment. Once the one or more acoustic models are adapted and loaded, the system may enhance a speech recognition process or other process associated with the user.

Citations

21 Claims

1. A system, comprising:
- a memory that stores instructions;
  
  a processor that executes the instructions to perform operations, the operations comprising;
  
  obtaining, from visual content, metadata associated with a user and an environment of the user;
  
  identifying, based on the visual content and the metadata, an interferer and a location of the interferer in the environment;
  
  obtaining audio content associated with the user and the environment;
  
  enhancing a speech recognition process utilized for processing speech of the user that is within the audio content;
  
  cancelling, after identifying the interferer and the location of the interferer in the environment and by utilizing an audio profile corresponding to the interferer, noise generated by the interferer that interferes with the speech of the user; and
  
  adjusting, based on a user profile of the user and a location profile corresponding to a location of the user, a feature of an application executing the speech recognition process so as to tailor the application to the user, wherein adjusting the feature of the application comprises adjusting at least an audio feature of the application based on the user profile and the location profile.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The system of claim 1, wherein the operations further comprise obtaining the visual content associated with the user and the environment of the user.
  - 3. The system of claim 1, wherein the operations further comprise determining, based on the visual content or the metadata, if the user is speaking.
  - 4. The system of claim 1, wherein the operations further comprise determining, based on the visual content and the metadata, a type of a first device being utilized in the environment.
  - 5. The system of claim 4, wherein the operations further comprise adapting an acoustic model corresponding to the user and the environment based on the visual content, the audio content, the metadata, and the type of the first device being utilized in the environment.
  - 6. The system of claim 5, wherein the operations further comprise enhancing, by utilizing the acoustic model, the speech recognition process utilized for processing the speech of the user that is within the audio content.
  - 7. The system of claim 1, wherein the operations further comprise determining, based on the visual content and the metadata, a distance between a first device being utilized in the environment and the user.
  - 8. The system of claim 7, wherein the operations further comprise adjusting, based on the distance between the first device and the user, a ratio of direct to indirect audio signals occurring in the environment and a level of reverberation associated with the audio content in the environment.
  - 9. The system of claim 1, wherein the operations further comprise adapting, based on the user profile of the user, a user interface of the application executing the speech recognition process utilized for processing the speech of the user, wherein the user interface is adapted based on the visual content, the audio content, the metadata, or a combination thereof.
  - 10. The system of claim 9, wherein the operations further comprise adapting the user interface of the application by enhancing a visual aspect of the user interface itself to reflect a characteristic of the user, the location of the user, an interest of the user, or a combination thereof.
  - 11. The system of claim 1, wherein the operations further comprise determining an orientation of a face of the user with respect to a device utilized by the user, and wherein the operations further comprise adapting an acoustic model based on the orientation of the face of the user with respect to the device.
  - 12. The system of claim 1, wherein the operations further comprise creating the user profile of the user based on the metadata.
  - 13. The system of claim 1, wherein the operations further not obtaining the audio content associated with the user and the environment if the user is determined to not be speaking.

14. A method, comprising:
- extracting, from visual content, metadata associated with a user and an environment of the user;
  
  detecting, based on the visual content and the metadata, an interferer and a location of the interferer in the environment;
  
  obtaining audio content associated with the user and the environment;
  
  enhancing a speech recognition process utilized for processing speech of the user that is within the audio content, wherein the enhancing is performed by utilizing instructions from a memory that are executed by a processor;
  
  cancelling, after identifying the interferer and the location of the interferer in the environment and by utilizing an audio profile corresponding to the interferer, noise generated by the interferer that interferes with the speech of the user; and
  
  modifying, based on a user profile of the user and a location profile corresponding to a location of the user, a feature of an application executing the speech recognition process so as to tailor the application to the user, wherein modifying the feature of the application comprises modifying at least an audio feature of the application based on the user profile and the location profile.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The method of claim 14, further comprising determining, based on the visual content and metadata, a language being spoken by the user, and wherein the operations further comprise adapting an acoustic model utilized to enhance the speech recognition process based on the language being spoken by the user.
  - 16. The method of claim 14, further comprising determining a velocity of the user, and wherein the operations further comprise adapting an acoustic model utilized to enhance the speech recognition process based on the velocity of the user.
  - 17. The method of claim 14, further comprising determining the location of the user based on the metadata.
  - 18. The method of claim 17, further comprising adapting an acoustic model utilized to enhance the speech recognition process based on the location of the user.
  - 19. The method of claim 14, further comprising determining an age of the user based on the visual content and the metadata, and further comprising adapting an acoustic model utilized to enhance the speech recognition process based on the age.
  - 20. The method of claim 14, further comprising creating the user profile of the user based on the metadata, and further comprising adapting an acoustic model utilized to enhance the speech recognition process based on the user profile.

21. A non-transitory computer-readable device comprising instructions, which when executed by a processor, cause the processor to perform operations comprising:
- extracting, from visual content, metadata associated with a user and an environment of the user;
  
  identifying, based on the visual content and the metadata, an interferer and a location of the interferer in the environment;
  
  capturing audio content associated with the user and the environment;
  
  enhancing a speech recognition process utilized for processing speech of the user that is within the audio content;
  
  cancelling, after identifying the interferer and the location of the interferer in the environment and by utilizing an audio profile corresponding to the interferer, noise generated by the interferer that interferes with the speech of the user; and
  
  modifying, based on a user profile of the user and a location profile corresponding to a location of the user, a feature of an application executing the speech recognition process so as to tailor the application to the user, wherein modifying the feature of the application comprises modifying at least an audio feature of the application based on the user profile and the location profile.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Dimitriadis, Dimitrios, Bowen, Donald J., Gilbert, Mazin E., Schroeter, Horst J.
Primary Examiner(s)
Leland, III, Edwin S

Application Number

US15/868,546
Publication Number

US 20180137348A1
Time in Patent Office

257 Days
Field of Search

704231
US Class Current
CPC Class Codes

G06V 20/10   Terrestrial scenes scenes u...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/065   Adaptation

G10L 2015/227   of the speaker; Human-fact...

G10L 2015/228   of application context

Sensor enhanced speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Sensor enhanced speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links