INTEGRATED LOCAL AND CLOUD BASED SPEECH RECOGNITION
First Claim
1. A method for performing speech recognition, comprising:
- acquiring a plurality of audio signals from a plurality of microphones, each of the plurality of audio signals is associated with a different microphone of the plurality of microphones, the plurality of audio signals is associated with a first environment;
determining one or more directions within the first environment, the first environment includes one or more persons, each of the one or more directions is associated with a different person of the one or more persons;
generating one or more audio recordings based on the plurality of audio signals, a first audio recording of the one or more audio recordings is generated by applying audio signal processing techniques to the plurality of audio signals such that sounds originating from a first direction of the one or more directions are amplified while other sounds originating from one or more other directions are attenuated;
performing local speech recognition on each of the one or more audio recordings, the performing local speech recognition includes detecting a first utterance and detecting one or more keywords within the first utterance, the first utterance is detected by applying one or more speech detection techniques to a first audio recording of the one or more audio recordings;
transmitting the first utterance and the one or more keywords to a second computing device, the second computing device performs a speech recognition technique on the first utterance, the speech recognition technique detects one or more words within the first utterance; and
receiving a first response from the second computing device based on the first utterance.
2 Assignments
0 Petitions
Accused Products
Abstract
A system for integrating local speech recognition with cloud-based speech recognition in order to provide an efficient natural user interface is described. In some embodiments, a computing device determines a direction associated with a particular person within an environment and generates an audio recording associated with the direction. The computing device then performs local speech recognition on the audio recording in order to detect a first utterance spoken by the particular person and to detect one or more keywords within the first utterance. The first utterance may be detected by applying voice activity detection techniques to the audio recording. The first utterance and the one or more keywords are subsequently transferred to a server which may identify speech sounds within the first utterance associated with the one or more keywords and adapt one or more speech recognition techniques based on the identified speech sounds.
223 Citations
20 Claims
-
1. A method for performing speech recognition, comprising:
-
acquiring a plurality of audio signals from a plurality of microphones, each of the plurality of audio signals is associated with a different microphone of the plurality of microphones, the plurality of audio signals is associated with a first environment; determining one or more directions within the first environment, the first environment includes one or more persons, each of the one or more directions is associated with a different person of the one or more persons; generating one or more audio recordings based on the plurality of audio signals, a first audio recording of the one or more audio recordings is generated by applying audio signal processing techniques to the plurality of audio signals such that sounds originating from a first direction of the one or more directions are amplified while other sounds originating from one or more other directions are attenuated; performing local speech recognition on each of the one or more audio recordings, the performing local speech recognition includes detecting a first utterance and detecting one or more keywords within the first utterance, the first utterance is detected by applying one or more speech detection techniques to a first audio recording of the one or more audio recordings; transmitting the first utterance and the one or more keywords to a second computing device, the second computing device performs a speech recognition technique on the first utterance, the speech recognition technique detects one or more words within the first utterance; and receiving a first response from the second computing device based on the first utterance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. An electronic device for integrating local and cloud-based speech recognition, comprising:
-
a capture device including a plurality of microphones, the capture device acquires one or more sounds from the plurality of microphones, the one or more sounds are associated with a first environment; and one or more processors, the one or more processors determine one or more directions within the first environment, the first environment includes one or more persons, each of the one or more directions is associated with a different person of the one or more persons, the one or more processors generate one or more audio recordings based on the one or more sounds, each of the one or more audio recordings is associated with a different direction of the one or more directions, the one or more processors detect a first utterance within a first audio recording of the one or more audio recordings by applying one or more speech detection techniques to the first audio recording, the one or more processors detect one or more keywords within the first utterance, the one or more processors transmit the first utterance and the one or more keywords to a second computing device, the second computing device performs a speech recognition technique on the first utterance based on the one or more keywords, the speech recognition technique detects one or more words within the first utterance, the one or more processors receive a first response from the second computing device based on the first utterance. - View Dependent Claims (16, 17)
-
-
18. One or more storage devices containing processor readable code for programming one or more processors to perform a method for integrating local and cloud-based speech recognition comprising the steps of:
-
receiving one or more sounds from a plurality of microphones, the one or more sounds are associated with a first environment during a first time period; receiving one or more depth images of the first environment, the one or more depth images are associated with the first environment during the first time period; determining one or more locations within the first environment, the first environment includes one or more persons, each of the one or more directions is associated with a different person of the one or more persons, the determining one or more locations includes performing skeletal tracking based on the one or more depth images for each of the one or more persons; generating one or more audio recordings based on the one or more sounds, each of the one or more audio recordings is associated with a different location of the one or more locations; performing local speech recognition on each of the one or more audio recordings, the performing local speech recognition includes detecting a first utterance and detecting one or more keywords within the first utterance, the first utterance is detected by applying one or more speech detection techniques to a first audio recording of the one or more audio recordings; transmitting the first utterance and the one or more keywords to a second computing device, the second computing device performs a speech recognition technique on the first utterance based on the one or more keywords, the speech recognition technique detects one or more words within the first utterance; and receiving a first response from the second computing device based on the first utterance. - View Dependent Claims (19, 20)
-
Specification