SENSOR ENHANCED SPEECH RECOGNITION
First Claim
1. A system, comprising:
- a memory that stores instructions;
a processor that executes the instructions to perform operations, the operations comprising;
obtaining visual content associated with a user and an environment of the user;
obtaining, from the visual content, metadata associated the user and the environment of the user;
determining, based on the visual content and metadata, if the user is speaking;
obtaining, if the user is determined to be speaking, audio content associated with the user and the environment;
adapting, based on the visual content, audio content, and metadata, an acoustic model corresponding to the user and the environment; and
enhancing, by utilizing the acoustic model, a speech recognition process utilized for processing speech of the user.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for sensor enhanced speech recognition is disclosed. The system may obtain visual content or other content associated with a user and an environment of the user. Additionally, the system may obtain, from the visual content, metadata associated with the user and the environment of the user. The system may also include determining, based on the visual content and metadata, if the user is speaking. If the user is determined to be speaking, the system may obtain audio content associated with the user and the environment. The system may then adapt, based on the visual content, audio content, and metadata, one or more acoustic models that match the user and the environment. Once the one or more acoustic models are adapted and loaded, the system may enhance a speech recognition process or other process associated with the user.
-
Citations
20 Claims
-
1. A system, comprising:
-
a memory that stores instructions; a processor that executes the instructions to perform operations, the operations comprising; obtaining visual content associated with a user and an environment of the user; obtaining, from the visual content, metadata associated the user and the environment of the user; determining, based on the visual content and metadata, if the user is speaking; obtaining, if the user is determined to be speaking, audio content associated with the user and the environment; adapting, based on the visual content, audio content, and metadata, an acoustic model corresponding to the user and the environment; and enhancing, by utilizing the acoustic model, a speech recognition process utilized for processing speech of the user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method, comprising:
-
obtaining visual content associated with a user and an environment of the user; obtaining, from the visual content, metadata associated the user and the environment of the user; determining, based on the visual content and metadata, if the user is speaking; obtaining, if the user is determined to be speaking, audio content associated with the user and the environment; adapting, based on the visual content, audio content, and metadata, an acoustic model corresponding to the user and the environment; and enhancing, by utilizing the acoustic model and by utilizing instructions from memory that are executed by a processor, a speech recognition process utilized for processing speech of the user. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer-readable device comprising instructions, which when executed by a processor, cause the processor to perform operations comprising:
-
obtaining visual content associated with a user and an environment of the user; obtaining, from the visual content, metadata associated the user and the environment of the user; determining, based on the visual content and metadata, if the user is speaking; obtaining, if the user is determined to be speaking, audio content associated with the user and the environment; adapting, based on the visual content, audio content, and metadata, an acoustic model corresponding to the user and the environment; and enhancing, by utilizing the acoustic model, a speech recognition process utilized for processing speech of the user.
-
Specification