Directed audio for speech recognition

US 9,076,450 B1
Filed: 09/21/2012
Issued: 07/07/2015
Est. Priority Date: 09/21/2012
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a microphone array that produces audio signals in response to capturing audio from an environment;

an audio beamformer that is responsive to the audio signals to produce a plurality of directionally focused audio signals corresponding respectively to different directions relative to the microphone array;

a speech recognizer configured to;

recognize speech from each of the directionally focused audio signals to create text streams of recognized speech; and

generate a confidence value for each of the text streams of the recognized speech, wherein each confidence value indicates an estimated accuracy of a respective text stream of the recognized speech from a respective one of the directional audio signals, the confidence value generated based at least in part on expected speech input for an available command lexicon associated with the system; and

a selector configured to compare each confidence value and to select at least one of the text streams of the recognized speech from at least one of the directionally focused audio signals based at least in part on the comparing.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are described for selecting audio from locations that are most likely to be sources of spoken commands or words. Directional audio signals are generated to emphasize sounds from different regions of an environment. The directional audio signals are processed by an automated speech recognizer to generate recognition confidence values corresponding to each of the different regions, and the region resulting in the highest recognition confidence value is selected as the region most likely to contain a user who is speaking commands.

212 Citations

20 Claims

1. A system comprising:
- a microphone array that produces audio signals in response to capturing audio from an environment;
  
  an audio beamformer that is responsive to the audio signals to produce a plurality of directionally focused audio signals corresponding respectively to different directions relative to the microphone array;
  
  a speech recognizer configured to;
  
  recognize speech from each of the directionally focused audio signals to create text streams of recognized speech; and
  
  generate a confidence value for each of the text streams of the recognized speech, wherein each confidence value indicates an estimated accuracy of a respective text stream of the recognized speech from a respective one of the directional audio signals, the confidence value generated based at least in part on expected speech input for an available command lexicon associated with the system; and
  
  a selector configured to compare each confidence value and to select at least one of the text streams of the recognized speech from at least one of the directionally focused audio signals based at least in part on the comparing.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein the directionally focused audio signals correspond to different regions of the environment.
  - 3. The system of claim 1, wherein the directionally focused audio signals correspond to different regions of the environment, and wherein the different regions are selected based on predicted locations of human activity within the environment.
  - 4. The system of claim 1, wherein the speech recognizer is configured to concurrently recognize different speech from the directionally focused audio signals.
  - 5. The system of claim 1, further comprising one or more instances of the audio beamformer configured to concurrently produce the plurality of directionally focused audio signals.

6. A method, comprising:
- receiving audio signals corresponding respectively to different regions of an environment;
  
  analyzing the audio signals to recognize speech from each of the audio signals, wherein analyzing the audio signals comprises concurrently generating a plurality of different text streams that each correspond to a different one of the audio signals;
  
  generating a confidence value for each of the plurality of different text streams corresponding to the audio signals, wherein the confidence value for a particular text stream indicates an estimated accuracy of the recognized speech from said particular text stream, the confidence value generated based at least in part on expected speech input for a predetermined command lexicon; and
  
  selecting at least one of the plurality of different text streams associated with the recognized speech from at least one of the audio signals based at least in part on a respective confidence value.
- View Dependent Claims (7, 8, 9, 10, 11, 12)
- - 7. The method of claim 6, further comprising:
    - receiving microphone signals from an array of spaced microphones; and
      
      processing the received microphone signals to produce the audio signals.
  - 8. The method of claim 6, further comprising:
    - receiving microphone signals from an array of spaced microphones; and
      
      beamforming the received microphone signals to produce the audio signals, wherein the beamforming emphasizes audio from a particular direction relative to audio from other directions.
  - 9. The method of claim 6, wherein the selecting comprises selecting the at least one of the plurality of different text streams associated with the recognized speech from the audio signal having a highest confidence value.
  - 10. The method of claim 6, wherein analyzing the audio signals comprises concurrently recognizing speech from each of the audio signals.
  - 11. The method of claim 6, wherein receiving the audio signals comprises concurrently receiving the audio signals.
  - 12. The method of claim 6, further comprising generating the received audio signals by a plurality of microphones that capture spoken words of different people to generate the received audio signals.

13. One or more non-transitory computer-readable media storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising:
- receiving a plurality of audio signals that are focused respectively on different regions of an environment;
  
  analyzing the audio signals to recognize speech from each of the audio signals by concurrently generating a plurality of different text streams that each correspond to a different one of the audio signals;
  
  generating a confidence value for each of the plurality of different text streams corresponding to the audio signals by;
  
  comparing different text streams against each other to determine that a first text stream produces similar or identical results as a second text stream of the different text streams; and
  
  increasing a first confidence value associated with the first text stream based at least in part on the first text stream producing similar or identical results as the second text stream, wherein the confidence value for a particular text stream indicates an estimated accuracy of the recognized speech from the particular text stream;
  
  comparing confidence values generated for each of the plurality of different text streams associated with the different audio signals; and
  
  selecting one or more of the audio signals based at least in part on the comparing.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The one or more non-transitory computer-readable media of claim 13, wherein the generating comprises comparing the recognized speech to expected speech.
  - 15. The one or more non-transitory computer-readable media of claim 13, the acts further comprising:
    - receiving microphone signals from an array of spaced microphones; and
      
      processing the received microphone signals to produce the audio signals.
  - 16. The one or more non-transitory computer-readable media of claim 13, the acts further comprising:
    - receiving microphone signals from an array of spaced microphones; and
      
      beamforming the received microphone signals to produce the audio signals, wherein the beamforming emphasizes sound from a particular direction relative to sound from other directions.
  - 17. The one or more non-transitory computer-readable media of claim 13, wherein the selecting comprises selecting the audio signal having a highest confidence value.
  - 18. The one or more non-transitory computer-readable media of claim 13, wherein analyzing the audio signals is performed concurrently on each of the audio signals.
  - 19. The one or more non-transitory computer-readable media of claim 13, wherein receiving the audio signals comprises concurrently receiving the audio signals.
  - 20. The one or more non-transitory computer-readable media of claim 13, wherein the selecting one or more of the audio signals based at least in part on the comparing further comprises selecting at least one of the plurality of different text streams based at least in part on the comparing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Crump, Edward Dietz, Pollack, Joshua, Sadek, Ramy S.
Primary Examiner(s)
PULLIAS, JESSE SCOTT

Application Number

US13/624,667
Time in Patent Office

1,019 Days
Field of Search

704231-257, 381/92
US Class Current

1/1
CPC Class Codes

G10L 15/32   Multiple recognisers used i...

G10L 15/34   Adaptation of a single reco...

G10L 2021/02166   Microphone arrays; Beamforming

H04R 1/406   microphones

H04R 2430/20   Processing of the output si...

H04R 3/005   for combining the signals o...

Directed audio for speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

212 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Directed audio for speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

212 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others