Trigger word based beam selection
First Claim
Patent Images
1. A computer-implemented method comprising:
- receiving input audio data corresponding to input audio captured by a microphone array;
performing beamforming on the input audio data to determine first beamformed audio data corresponding to a first direction and second beamformed audio data corresponding to a second direction;
processing the first beamformed audio data to determine a first plurality of feature vectors corresponding to a first time period;
processing the first plurality of feature vectors using a first neural network to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first beamformed audio data corresponding to first time period;
processing the second beamformed audio data to determine a second plurality of feature vectors corresponding to a second time period;
processing the second plurality of feature vectors using a second neural network to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the second beamformed audio data corresponding to the second time period;
determining, based on the first score exceeding a threshold, that the first portion of the wakeword is represented in the first beamformed audio data;
determining, based on the second score exceeding the threshold, that the second portion of the wakeword is represented in the second beamformed audio data;
determining that the first portion of the wakeword represented in the first beamformed audio data corresponds to the first time period;
determining that the second portion of the wakeword represented in the second beamformed audio data corresponds to the second time period;
selecting the first beamformed audio data in response to the first time period being prior to the second time period; and
sending the first beamformed audio data for further processing.
1 Assignment
0 Petitions
Accused Products
Abstract
An audio capture device that incorporates a beamformer and beam-specific trigger word detection. Audio data from each beam is processed by a low power trigger word detector, such as a neural network or other trained model to detect if audio data (such as an audio frame or feature vector corresponding thereto) likely includes part of a trigger word. The beam that either most strongly represents a trigger word portion or represents a trigger word portion most early in time may be selected for further processing such as speech processing or confirmation by a more robust power intensive trigger word detector.
109 Citations
20 Claims
-
1. A computer-implemented method comprising:
-
receiving input audio data corresponding to input audio captured by a microphone array; performing beamforming on the input audio data to determine first beamformed audio data corresponding to a first direction and second beamformed audio data corresponding to a second direction; processing the first beamformed audio data to determine a first plurality of feature vectors corresponding to a first time period; processing the first plurality of feature vectors using a first neural network to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first beamformed audio data corresponding to first time period; processing the second beamformed audio data to determine a second plurality of feature vectors corresponding to a second time period; processing the second plurality of feature vectors using a second neural network to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the second beamformed audio data corresponding to the second time period; determining, based on the first score exceeding a threshold, that the first portion of the wakeword is represented in the first beamformed audio data; determining, based on the second score exceeding the threshold, that the second portion of the wakeword is represented in the second beamformed audio data; determining that the first portion of the wakeword represented in the first beamformed audio data corresponds to the first time period; determining that the second portion of the wakeword represented in the second beamformed audio data corresponds to the second time period; selecting the first beamformed audio data in response to the first time period being prior to the second time period; and sending the first beamformed audio data for further processing. - View Dependent Claims (2)
-
-
3. A computer-implemented method comprising:
-
receiving input audio data corresponding to input audio captured by a microphone array; performing beamforming on the input audio data to determine first beamformed audio data corresponding to a first direction and second beamformed audio data corresponding to a second direction; determining at least one first feature vector corresponding to a first portion of the first beamformed audio data and at least one second feature vector corresponding to a first portion of the second beamformed audio data; using a first trained model to process the at least one first feature vector to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first portion of the first beamformed audio data; using a second trained model to process the at least one second feature vector to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the first portion of the second beamformed audio data; determining, based on at least the first score exceeding a threshold, that at least the first portion of the wakeword is represented in the first portion of the first beamformed audio data; determining, based on at least the second score exceeding the threshold, that at least the second portion of the wakeword is represented in the first portion of the second beamformed audio data; and selecting, based at least on the first score and the second score, at least a second portion of the first beamformed audio data for further processing by a speech processing component configured to identify a command to perform an action. - View Dependent Claims (4, 5, 6, 8, 9, 10, 11)
-
-
7. A computer-implemented method, comprising:
-
receiving input audio data corresponding to input audio captured by a microphone array; determining first audio data corresponding to a first direction and second audio data corresponding to a second direction; using a first trained model to process the first audio data to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first audio data; using a second trained model to process the second audio data to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the second audio data; determining, based on at least the first score exceeding a threshold, that at least the first portion of the wakeword is represented in the first audio data; determining, based on at least the second score exceeding the threshold, that at least the second portion of the wakeword is represented in the second audio data; determining that at least the first portion of the wakeword represented in the first audio data corresponds to a first time period; determining that at least the second portion of the wakeword represented in the second audio data corresponds to a second time period; and selecting, based at least in part on the first score, the second score, and the first time period being before the second time period, further audio data corresponding to the first direction for further processing.
-
-
12. A device comprising:
-
at least one processor; at least one microphone array comprising a plurality of microphones; and at least one memory including instructions operable to be executed by the at least one processor to configure the device to; receive input audio data corresponding to input audio captured by the at least one microphone array, perform beamforming on the input audio data to determine first beamformed audio data corresponding to a first direction and second beamformed audio data corresponding to a second direction, determine at least one first feature vector corresponding to a first portion of the first beamformed audio data and at least one second feature vector corresponding to a first portion of the second beamformed audio data, use a first trained model to process the at least one first feature vector to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first portion of the first beamformed audio data, use a second trained model to process the at least one second feature vector to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the first portion of the second beamformed audio data, determine, based on at least the first score exceeding a threshold, that at least the first portion of the wakeword is represented in the first portion of the first beamformed audio data, determine, based on at least the second score exceeding the threshold, that at least the second portion of the wakeword is represented in the first portion of the second beamformed audio data, and select, based at least on the first score and the second score, at least a second portion of the first beamformed audio data for further processing by a speech processing component configured to identify a command to perform an action. - View Dependent Claims (13, 14, 15, 17, 18, 19, 20)
-
-
16. A device, comprising:
-
at least one processor; at least one microphone array comprising a plurality of microphones; and at least one memory including instructions operable to be executed by the at least one processor to configure the device to; receive input audio data corresponding to input audio captured by the at least one microphone array, determine first audio data corresponding to a first direction and second audio data corresponding to a second direction, use a first trained model to process the first audio data to determine a first score, the first score corresponding to a likelihood that at least a first portion of a wakeword is represented in the first audio data, use a second trained model to process the second audio data to determine a second score, the second score corresponding to a likelihood that at least a second portion of the wakeword is represented in the second audio data, determine, based on at least the first score exceeding a threshold, that at least the first portion of the wakeword is represented in the first audio data, determine, based on at least the second score exceeding the threshold, that at least the second portion of the wakeword is represented in the second audio data, determine that at least the first portion of the wakeword represented in the first audio data corresponds to a first time period; determine that at least the second portion of the wakeword represented in the second audio data corresponds to a second time period, and select, based at least in part on the first score, the second score, and the first time period being before the second time period, further audio data corresponding to the first direction for further processing.
-
Specification