Directed audio for speech recognition
First Claim
Patent Images
1. A system comprising:
- a microphone array that produces audio signals in response to capturing audio from an environment;
an audio beamformer that is responsive to the audio signals to produce a plurality of directionally focused audio signals corresponding respectively to different directions relative to the microphone array;
a speech recognizer configured to;
recognize speech from each of the directionally focused audio signals to create text streams of recognized speech; and
generate a confidence value for each of the text streams of the recognized speech, wherein each confidence value indicates an estimated accuracy of a respective text stream of the recognized speech from a respective one of the directional audio signals, the confidence value generated based at least in part on expected speech input for an available command lexicon associated with the system; and
a selector configured to compare each confidence value and to select at least one of the text streams of the recognized speech from at least one of the directionally focused audio signals based at least in part on the comparing.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques are described for selecting audio from locations that are most likely to be sources of spoken commands or words. Directional audio signals are generated to emphasize sounds from different regions of an environment. The directional audio signals are processed by an automated speech recognizer to generate recognition confidence values corresponding to each of the different regions, and the region resulting in the highest recognition confidence value is selected as the region most likely to contain a user who is speaking commands.
212 Citations
20 Claims
-
1. A system comprising:
-
a microphone array that produces audio signals in response to capturing audio from an environment; an audio beamformer that is responsive to the audio signals to produce a plurality of directionally focused audio signals corresponding respectively to different directions relative to the microphone array; a speech recognizer configured to; recognize speech from each of the directionally focused audio signals to create text streams of recognized speech; and generate a confidence value for each of the text streams of the recognized speech, wherein each confidence value indicates an estimated accuracy of a respective text stream of the recognized speech from a respective one of the directional audio signals, the confidence value generated based at least in part on expected speech input for an available command lexicon associated with the system; and a selector configured to compare each confidence value and to select at least one of the text streams of the recognized speech from at least one of the directionally focused audio signals based at least in part on the comparing. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method, comprising:
-
receiving audio signals corresponding respectively to different regions of an environment; analyzing the audio signals to recognize speech from each of the audio signals, wherein analyzing the audio signals comprises concurrently generating a plurality of different text streams that each correspond to a different one of the audio signals; generating a confidence value for each of the plurality of different text streams corresponding to the audio signals, wherein the confidence value for a particular text stream indicates an estimated accuracy of the recognized speech from said particular text stream, the confidence value generated based at least in part on expected speech input for a predetermined command lexicon; and selecting at least one of the plurality of different text streams associated with the recognized speech from at least one of the audio signals based at least in part on a respective confidence value. - View Dependent Claims (7, 8, 9, 10, 11, 12)
-
-
13. One or more non-transitory computer-readable media storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising:
-
receiving a plurality of audio signals that are focused respectively on different regions of an environment; analyzing the audio signals to recognize speech from each of the audio signals by concurrently generating a plurality of different text streams that each correspond to a different one of the audio signals; generating a confidence value for each of the plurality of different text streams corresponding to the audio signals by; comparing different text streams against each other to determine that a first text stream produces similar or identical results as a second text stream of the different text streams; and increasing a first confidence value associated with the first text stream based at least in part on the first text stream producing similar or identical results as the second text stream, wherein the confidence value for a particular text stream indicates an estimated accuracy of the recognized speech from the particular text stream; comparing confidence values generated for each of the plurality of different text streams associated with the different audio signals; and selecting one or more of the audio signals based at least in part on the comparing. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification