Trigger word based beam selection

US 10,304,475 B1
Filed: 08/14/2017
Issued: 05/28/2019
Est. Priority Date: 08/14/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving input audio data corresponding to input audio captured by a microphone array;

performing beamforming on the input audio data to determine first beamformed audio data corresponding to a first direction and second beamformed audio data corresponding to a second direction;

processing the first beamformed audio data to determine a first plurality of feature vectors corresponding to a first time period;

processing the first plurality of feature vectors using a first neural network to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first beamformed audio data corresponding to first time period;

processing the second beamformed audio data to determine a second plurality of feature vectors corresponding to a second time period;

processing the second plurality of feature vectors using a second neural network to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the second beamformed audio data corresponding to the second time period;

determining, based on the first score exceeding a threshold, that the first portion of the wakeword is represented in the first beamformed audio data;

determining, based on the second score exceeding the threshold, that the second portion of the wakeword is represented in the second beamformed audio data;

determining that the first portion of the wakeword represented in the first beamformed audio data corresponds to the first time period;

determining that the second portion of the wakeword represented in the second beamformed audio data corresponds to the second time period;

selecting the first beamformed audio data in response to the first time period being prior to the second time period; and

sending the first beamformed audio data for further processing.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An audio capture device that incorporates a beamformer and beam-specific trigger word detection. Audio data from each beam is processed by a low power trigger word detector, such as a neural network or other trained model to detect if audio data (such as an audio frame or feature vector corresponding thereto) likely includes part of a trigger word. The beam that either most strongly represents a trigger word portion or represents a trigger word portion most early in time may be selected for further processing such as speech processing or confirmation by a more robust power intensive trigger word detector.

109 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- receiving input audio data corresponding to input audio captured by a microphone array;
  
  performing beamforming on the input audio data to determine first beamformed audio data corresponding to a first direction and second beamformed audio data corresponding to a second direction;
  
  processing the first beamformed audio data to determine a first plurality of feature vectors corresponding to a first time period;
  
  processing the first plurality of feature vectors using a first neural network to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first beamformed audio data corresponding to first time period;
  
  processing the second beamformed audio data to determine a second plurality of feature vectors corresponding to a second time period;
  
  processing the second plurality of feature vectors using a second neural network to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the second beamformed audio data corresponding to the second time period;
  
  determining, based on the first score exceeding a threshold, that the first portion of the wakeword is represented in the first beamformed audio data;
  
  determining, based on the second score exceeding the threshold, that the second portion of the wakeword is represented in the second beamformed audio data;
  
  determining that the first portion of the wakeword represented in the first beamformed audio data corresponds to the first time period;
  
  determining that the second portion of the wakeword represented in the second beamformed audio data corresponds to the second time period;
  
  selecting the first beamformed audio data in response to the first time period being prior to the second time period; and
  
  sending the first beamformed audio data for further processing.
- View Dependent Claims (2)
- - 2. The computer-implemented method of claim 1, further comprising:
    - sending the first beamformed audio data to a wakeword component;
      
      operating the wakeword component to compare the first beamformed audio data to a stored audio signature corresponding to the wakeword;
      
      determining, using the wakeword component, that the first beamformed audio data comprises the wakeword; and
      
      sending the first beamformed audio data to a remote device for speech processing.

3. A computer-implemented method comprising:
- receiving input audio data corresponding to input audio captured by a microphone array;
  
  performing beamforming on the input audio data to determine first beamformed audio data corresponding to a first direction and second beamformed audio data corresponding to a second direction;
  
  determining at least one first feature vector corresponding to a first portion of the first beamformed audio data and at least one second feature vector corresponding to a first portion of the second beamformed audio data;
  
  using a first trained model to process the at least one first feature vector to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first portion of the first beamformed audio data;
  
  using a second trained model to process the at least one second feature vector to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the first portion of the second beamformed audio data;
  
  determining, based on at least the first score exceeding a threshold, that at least the first portion of the wakeword is represented in the first portion of the first beamformed audio data;
  
  determining, based on at least the second score exceeding the threshold, that at least the second portion of the wakeword is represented in the first portion of the second beamformed audio data; and
  
  selecting, based at least on the first score and the second score, at least a second portion of the first beamformed audio data for further processing by a speech processing component configured to identify a command to perform an action.
- View Dependent Claims (4, 5, 6, 8, 9, 10, 11)
- - 4. The computer-implemented method of claim 3, further comprising:
    - determining that at least the first portion of the wakeword represented in the first portion of the first beamformed audio data corresponds to a first time period;
      
      determining a plurality of scores, including the second score, wherein each of the plurality of scores corresponds to audio data captured within a time window after the first time period; and
      
      determining the first score is greater than each of the plurality of scores.
  - 5. The computer-implemented method of claim 3, wherein:
    - the first portion of the first beamformed audio data corresponds to a first time period;
      
      the first portion of the second beamformed audio data corresponds to the first time period; and
      
      the method further comprises determining that the first score is greater than the second score.
  - 6. The computer-implemented method of claim 3, wherein the first portion of the first beamformed audio data and the first portion of the second beamformed audio data correspond to a first time period and the method further comprises:
    - determining a third portion of the first beamformed audio data and a second portion of the second beamformed audio data, wherein the third portion of the first beamformed audio data and the second portion of the second beamformed audio data correspond to a second time period after the first time period;
      
      using the first trained model to process the third portion of the first beamformed audio data to determine a third score;
      
      using the second trained model to process the second portion of the second beamformed audio data to determine a fourth score;
      
      determining that the fourth score is greater than the third score;
      
      determining a difference between the fourth score and the third score;
      
      determining that the difference does not exceed a threshold; and
      
      selecting, based at least in part on the difference not exceeding the threshold, at least a fourth portion of the first beamformed audio data for further processing.
  - 8. The computer-implemented method of claim 3, further comprising:
    - sending at least the first portion of the first beamformed audio data to a further wakeword detection component; and
      
      by the further wakeword detection component, comparing the first portion of the first beamformed audio data to a stored audio signature corresponding to the wakeword to determine that the first portion of the first beamformed audio data represents the wakeword.
  - 9. The computer-implemented method of claim 8, wherein comparing the first portion of the first beamformed audio data to the stored audio signature with the further wakeword detection component uses more computing power than using the first trained model to process the at least one first feature vector to determine the first score.
  - 10. The computer-implemented method of claim 3, further comprising:
    - sending the at least one first feature vector to a further wakeword detection component.
  - 11. The computer-implemented method of claim 3, further comprising:
    - sending the second portion of the first beamformed audio data to the speech processing component.

7. A computer-implemented method, comprising:
- receiving input audio data corresponding to input audio captured by a microphone array;
  
  determining first audio data corresponding to a first direction and second audio data corresponding to a second direction;
  
  using a first trained model to process the first audio data to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first audio data;
  
  using a second trained model to process the second audio data to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the second audio data;
  
  determining, based on at least the first score exceeding a threshold, that at least the first portion of the wakeword is represented in the first audio data;
  
  determining, based on at least the second score exceeding the threshold, that at least the second portion of the wakeword is represented in the second audio data;
  
  determining that at least the first portion of the wakeword represented in the first audio data corresponds to a first time period;
  
  determining that at least the second portion of the wakeword represented in the second audio data corresponds to a second time period; and
  
  selecting, based at least in part on the first score, the second score, and the first time period being before the second time period, further audio data corresponding to the first direction for further processing.

12. A device comprising:
- at least one processor;
  
  at least one microphone array comprising a plurality of microphones; and
  
  at least one memory including instructions operable to be executed by the at least one processor to configure the device to;
  
  receive input audio data corresponding to input audio captured by the at least one microphone array,perform beamforming on the input audio data to determine first beamformed audio data corresponding to a first direction and second beamformed audio data corresponding to a second direction,determine at least one first feature vector corresponding to a first portion of the first beamformed audio data and at least one second feature vector corresponding to a first portion of the second beamformed audio data,use a first trained model to process the at least one first feature vector to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first portion of the first beamformed audio data,use a second trained model to process the at least one second feature vector to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the first portion of the second beamformed audio data,determine, based on at least the first score exceeding a threshold, that at least the first portion of the wakeword is represented in the first portion of the first beamformed audio data,determine, based on at least the second score exceeding the threshold, that at least the second portion of the wakeword is represented in the first portion of the second beamformed audio data, andselect, based at least on the first score and the second score, at least a second portion of the first beamformed audio data for further processing by a speech processing component configured to identify a command to perform an action.
- View Dependent Claims (13, 14, 15, 17, 18, 19, 20)
- - 13. The device of claim 12, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the device to:
    - determine that at least the first portion of the wakeword represented in the first portion of the first beamformed audio data corresponds to a first time period;
      
      determine a plurality of scores, including the second score, wherein each of the plurality of scores corresponds to audio data captured within a time window after the first time period; and
      
      determine the first score is greater than each of the plurality of scores.
  - 14. The device of claim 12, wherein:
    - the first portion of the first beamformed audio data corresponds to a first time period;
      
      the first portion of the second beamformed audio data corresponds to the first time period; and
      
      the memory includes additional instructions operable to be executed by the at least one processor to further configure the device to determine that the first score is greater than the second score.
  - 15. The device of claim 12, wherein the first portion of the first beamformed audio data and the first portion of the second beamformed audio data correspond to a first time period and the memory includes additional instructions operable to be executed by the at least one processor to further configure the device to:
    - determine a third portion of the first beamformed audio data and a second portion of the second beamformed audio data, wherein the third portion of the first beamformed audio data and the second portion of the second beamformed audio data correspond to a second time period after the first time period;
      
      use the first trained model to process the third portion of the first beamformed audio data to determine a third score;
      
      use the second trained model to process the second portion of the second beamformed audio data to determine a fourth score;
      
      determine that the fourth score is greater than the third score;
      
      determine a difference between the fourth score and the third score;
      
      determine that the difference does not exceed a threshold; and
      
      select, based at least in part on the difference not exceeding the threshold, at least a fourth portion of the first beamformed audio data for further processing.
  - 17. The device of claim 12, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the device to:
    - send at least the first portion of the first beamformed audio data to a further wakeword detection component; and
      
      by the further wakeword detection component, compare the first portion of the first beamformed audio data to a stored audio signature corresponding to the wakeword to determine that the first portion of the first beamformed audio data represents the wakeword.
  - 18. The device of claim 17, wherein execution of the additional instructions to compare the first portion of the first beamformed audio data to the stored audio signature with the further wakeword detection component uses more computing power than execution of the instructions to use the first trained model to process the at least one first feature vector to determine the first score.
  - 19. The device of claim 12, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the device to:
    - send the at least one first feature vector to a further wakeword detection component.
  - 20. The device of claim 12, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the device to:
    - send the second portion of the first beamformed audio data to the speech processing component.

16. A device, comprising:
- at least one processor;
  
  at least one microphone array comprising a plurality of microphones; and
  
  at least one memory including instructions operable to be executed by the at least one processor to configure the device to;
  
  receive input audio data corresponding to input audio captured by the at least one microphone array,determine first audio data corresponding to a first direction and second audio data corresponding to a second direction,use a first trained model to process the first audio data to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first audio data,use a second trained model to process the second audio data to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the second audio data,determine, based on at least the first score exceeding a threshold, that at least the first portion of the wakeword is represented in the first audio data,determine, based on at least the second score exceeding the threshold, that at least the second portion of the wakeword is represented in the second audio data,determine that at least the first portion of the wakeword represented in the first audio data corresponds to a first time period;
  
  determine that at least the second portion of the wakeword represented in the second audio data corresponds to a second time period, andselect, based at least in part on the first score, the second score, and the first time period being before the second time period, further audio data corresponding to the first direction for further processing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Wang, Rui, Chhetri, Amit Singh, Li, Xiaoxue, Kristjansson, Trausti Thor, Hilmes, Philip Ryan
Primary Examiner(s)
Thomas-Homescu, Anne L

Application Number

US15/676,273
Time in Patent Office

652 Days
Field of Search
US Class Current
CPC Class Codes

G01S 3/80   using ultrasonic, sonic or ...

G10L 15/02   Feature extraction for spee...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/16   using artificial neural net...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 2015/088   Word spotting

G10L 2015/223   Execution procedure of a sp...

G10L 2021/02166   Microphone arrays; Beamforming

G10L 21/0208   Noise filtering

G10L 21/0216   characterised by the method...

G10L 25/30   using neural networks

H04R 1/406   microphones

H04R 2201/401   2D or 3D arrays of transducers

H04R 2430/23   Direction finding using a s...

H04R 2430/25   Array processing for suppre...

H04R 3/005   for combining the signals o...

Trigger word based beam selection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

109 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Trigger word based beam selection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

109 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links