Neural network based beam selection

US 10,134,421 B1
Filed: 04/30/2018
Issued: 11/20/2018
Est. Priority Date: 08/04/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

capturing, during a first time period, first audio using a first microphone;

capturing, during the first time period, second audio using a second microphone;

determining first audio data corresponding to the first audio;

determining second audio data corresponding to the second audio;

determining, using at least the first audio data and the second audio data, third audio data corresponding to a first direction;

determining, using at least the first audio data and the second audio data, fourth audio data corresponding to a second direction; and

processing the third audio data and the fourth audio data with a neural network classifier to determine that the third audio data better represents speech than does the fourth audio data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A neural network model, such as a deep neural network (DNN), is trained using many speech examples to perform beam selection in a microphone array-based speech processing system. The DNN is trained using many different speech examples that are labeled with position or direction information relative to a training microphone array. The DNN may then be trained to recognize a direction of incoming speech so that at runtime the trained DNN may process input audio data from a microphone array and may output to a beam selector an indicator of the desired beam that may be selected for further processing. The DNN may be configured to output a beam index and/or coordinates (or other position data) corresponding to an estimated location of the detected speech. The DNN may also be configured to output acoustic unit data corresponding to speech units (for example corresponding to phonemes, senons, etc. such as those of a detected wakeword or other word).

13 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- capturing, during a first time period, first audio using a first microphone;
  
  capturing, during the first time period, second audio using a second microphone;
  
  determining first audio data corresponding to the first audio;
  
  determining second audio data corresponding to the second audio;
  
  determining, using at least the first audio data and the second audio data, third audio data corresponding to a first direction;
  
  determining, using at least the first audio data and the second audio data, fourth audio data corresponding to a second direction; and
  
  processing the third audio data and the fourth audio data with a neural network classifier to determine that the third audio data better represents speech than does the fourth audio data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer-implemented method of claim 1, further comprising:
    - before processing the third audio data and the fourth audio data with the neural network classifier, processing the first audio data and the second audio data with the neural network classifier to determine that at least one of the first audio data and the second audio data represents the speech.
  - 3. The computer-implemented method of claim 1, further comprising:
    - processing the first audio data and the second audio data with the neural network classifier to determine a distance between a source of the speech and the first microphone.
  - 4. The computer-implemented method of claim 1, further comprising:
    - processing the first audio data and the second audio data with the neural network classifier to determine an angle between a source of the speech and the first microphone; and
      
      selecting, based on the angle, a beam associated with the first microphone.
  - 5. The computer-implemented method of claim 1, further comprising:
    - before processing the third audio data and the fourth audio data with the neural network classifier, processing the first audio data and the second audio data with the neural network classifier to determine that the first audio data has a higher signal-to-noise ratio (SNR) than the second audio data.
  - 6. The computer-implemented method of claim 1, further comprising:
    - processing the first audio data and the second audio data with the neural network classifier to determine a speech unit corresponding to the speech; and
      
      determining that the speech unit corresponds to at least a portion of a wakeword or at least a portion of a command.
  - 7. The computer-implemented method of claim 1, further comprising:
    - wherein determining the third audio data comprises applying a first set of beamformer coefficients to the first audio data and the second audio data, andwherein determining the fourth audio data comprises applying a second set of beamformer coefficients to the first audio data and the second audio data.
  - 8. The computer-implemented method of claim 1, further comprising:
    - capturing, during a second time period, third audio using the first microphone;
      
      capturing, during the second time period, fourth audio using the second microphone;
      
      determining fifth audio data corresponding to the third audio;
      
      determining sixth audio data corresponding to the fourth audio;
      
      determining, using at least the fifth audio data and the sixth audio data, seventh audio data corresponding to the first direction; and
      
      sending, to a server device, the seventh audio data.
  - 9. The computer-implemented method of claim 1, wherein processing the third audio data and the fourth audio data comprises:
    - determining a first frame of the third audio data;
      
      determining a second frame preceding the first frame; and
      
      determining, based at least on the second frame, that the first frame represents at least part of the speech.
  - 10. The computer-implemented method of claim 1, further comprising:
    - after processing the third audio data and the fourth audio data with the neural network classifier, determining, using a speech recognition engine, text data corresponding to the speech.

11. A system comprising:
- at least one processor;
  
  at least one memory including instructions that, when executed by the at least one processor, cause the system to;
  
  capture, during a first time period, first audio using a first microphone;
  
  capture, during the first time period, second audio using a second microphone;
  
  determine first audio data corresponding to the first audio;
  
  determine second audio data corresponding to the second audio;
  
  determine, using at least the first audio data and the second audio data, third audio data corresponding to a first direction;
  
  determine, using at least the first audio data and the second audio data, fourth audio data corresponding to a second direction; and
  
  process the third audio data and the fourth audio data with a neural network classifier to determine that the third audio data better represents speech than does the fourth audio data.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The system of claim 11, wherein the instructions further cause the system to:
    - before processing the third audio data and the fourth audio data with the neural network classifier, process the first audio data and the second audio data with the neural network classifier to determine that at least one of the first audio data and the second audio data represents the speech.
  - 13. The system of claim 11, wherein the instructions further cause the system to:
    - process the first audio data and the second audio data with the neural network classifier to determine a distance between a source of the speech and the first microphone.
  - 14. The system of claim 11, wherein the instructions further cause the system to:
    - process the first audio data and the second audio data with the neural network classifier to determine an angle between a source of the speech and the first microphone; and
      
      select, based on the angle, a beam associated with the first microphone.
  - 15. The system of claim 11, wherein the instructions further cause the system to:
    - before processing the third audio data and the fourth audio data with the neural network classifier, process the first audio data and the second audio data with the neural network classifier to determine that the first audio data has a higher signal-to-noise ratio (SNR) than the second audio data.
  - 16. The system of claim 11, wherein the instructions further cause the system to:
    - process the first audio data and the second audio data with the neural network classifier to determine a speech unit corresponding to the speech; and
      
      determine that the speech unit corresponds to at least a portion of a wakeword or at least a portion of a command.
  - 17. The system of claim 11,wherein the instructions that cause the system to determine the third audio data further cause the system to apply a first set of beamformer coefficients to the first audio data and the second audio data, andwherein the instructions that cause the system to determine the fourth audio data further cause the system to apply a second set of beamformer coefficients to the first audio data and the second audio data.
  - 18. The system of claim 11, wherein the instructions further cause the system to:
    - capture, during a second time period, third audio using the first microphone;
      
      capture, during the second time period, fourth audio using the second microphone;
      
      determine fifth audio data corresponding to the third audio;
      
      determine sixth audio data corresponding to the fourth audio;
      
      determine, using at least the fifth audio data and the sixth audio data, seventh audio data corresponding to the first direction; and
      
      send, to a server device, the seventh audio data.
  - 19. The system of claim 11, wherein the instructions that cause the system to process the third audio data and the fourth audio data further cause the system to:
    - determine a first frame of the third audio data;
      
      determine a second frame preceding the first frame; and
      
      determine, based at least on the second frame, that the first frame represents at least part of the speech.
  - 20. The system of claim 11, wherein the instructions further cause the system to:
    - after processing the third audio data and the fourth audio data with the neural network classifier, determine, using a speech recognition engine, text data corresponding to the speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Sundaram, Shiva Kumar
Primary Examiner(s)
Baker, Charlotte M

Application Number

US15/967,185
Time in Patent Office

204 Days
Field of Search

704232
US Class Current
CPC Class Codes

G01S 11/14   using ultrasonic, sonic, or...

G01S 2205/01   specially adapted for speci...

G01S 3/8083   determining direction of so...

G01S 5/28   by co-ordinating position l...

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G10L 17/04   Training, enrolment or mode...

G10L 17/08   Use of distortion metrics o...

G10L 2021/02166   Microphone arrays; Beamforming

G10L 21/028   using properties of sound s...

G10L 25/30   using neural networks

G10L 25/78   Detection of presence or ab...

Neural network based beam selection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

13 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Neural network based beam selection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

13 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links