Keyword spotting using multi-task configuration
First Claim
Patent Images
1. A method for automated detection of a trigger word in an automated speech recognition system, the method comprising:
- during a training stage,accepting acoustic training data and corresponding transcription training data, the acoustic training data including processed audio input for a first set of utterances for a trigger word detection task and processed audio input for a second set of utterances for a large vocabulary speech recognition task, andexecuting a computer-implemented acoustic parameter training procedure to determine acoustic configuration parameters that yield a best measure of quality of the acoustic configuration parameters on multiple tasks, the multiple tasks including the trigger word detection task and the large vocabulary speech recognition task, the acoustic configuration parameters comprise parameters of an artificial neural network comprising a plurality of layers, each layer comprising a plurality of units, and wherein a first part of the acoustic configuration parameters is associated with both the trigger word detection task and the large vocabulary speech recognition task and characterizes multiple layers of the artificial neural network, and a second part of the acoustic configuration parameters is associated with the trigger word detection task, and a third part of the acoustic configuration parameters is associated with the large vocabulary speech recognition task, and each of the second and third parts characterize at least one layer of the artificial neural network;
after the training stage, storing the first part and the second part of the acoustic configuration parameters for configuration of the automated speech recognition system for use during a runtime stage in a user device associated with a user; and
during the runtime stage, at the user device,processing a signal representing an acoustic input monitored at the user device in an environment, the acoustic input including an utterance of a trigger word by the user, the processing including processing successive sections of the signal to form successive numerical feature vectors representing the acoustic content of respective sections of the acoustic input,mapping each successive numerical feature vector to form a corresponding state distribution using the first part and the second part of the acoustic configuration parameters, without using the third part of the acoustic configuration parameters, the mapping including computing the state distribution using a numerical computation of the artificial neural network that includes the values of the first part and the second part of the acoustic configuration parameters, andusing the state distributions as input to a keyword spotter to detect the utterance of the trigger word in the acoustic input;
wherein the measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter.
1 Assignment
0 Petitions
Accused Products
Abstract
An approach to keyword spotting makes use of acoustic parameters that are trained on a keyword spotting task as well as on a second speech recognition task, for example, a large vocabulary continuous speech recognition task. The parameters may be optimized according to a weighted measure that weighs the keyword spotting task more highly than the other task, and that weighs utterances of a keyword more highly than utterances of other speech. In some applications, a keyword spotter configured with the acoustic parameters is used for trigger or wake word detection.
-
Citations
24 Claims
-
1. A method for automated detection of a trigger word in an automated speech recognition system, the method comprising:
-
during a training stage, accepting acoustic training data and corresponding transcription training data, the acoustic training data including processed audio input for a first set of utterances for a trigger word detection task and processed audio input for a second set of utterances for a large vocabulary speech recognition task, and executing a computer-implemented acoustic parameter training procedure to determine acoustic configuration parameters that yield a best measure of quality of the acoustic configuration parameters on multiple tasks, the multiple tasks including the trigger word detection task and the large vocabulary speech recognition task, the acoustic configuration parameters comprise parameters of an artificial neural network comprising a plurality of layers, each layer comprising a plurality of units, and wherein a first part of the acoustic configuration parameters is associated with both the trigger word detection task and the large vocabulary speech recognition task and characterizes multiple layers of the artificial neural network, and a second part of the acoustic configuration parameters is associated with the trigger word detection task, and a third part of the acoustic configuration parameters is associated with the large vocabulary speech recognition task, and each of the second and third parts characterize at least one layer of the artificial neural network; after the training stage, storing the first part and the second part of the acoustic configuration parameters for configuration of the automated speech recognition system for use during a runtime stage in a user device associated with a user; and during the runtime stage, at the user device, processing a signal representing an acoustic input monitored at the user device in an environment, the acoustic input including an utterance of a trigger word by the user, the processing including processing successive sections of the signal to form successive numerical feature vectors representing the acoustic content of respective sections of the acoustic input, mapping each successive numerical feature vector to form a corresponding state distribution using the first part and the second part of the acoustic configuration parameters, without using the third part of the acoustic configuration parameters, the mapping including computing the state distribution using a numerical computation of the artificial neural network that includes the values of the first part and the second part of the acoustic configuration parameters, and using the state distributions as input to a keyword spotter to detect the utterance of the trigger word in the acoustic input; wherein the measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter. - View Dependent Claims (2)
-
-
3. A method for automated speech recognition comprising storing acoustic configuration parameters for configuration of an automated speech recognition system for detection of a keyword in an acoustic input using the automatic speech recognition system configured with the acoustic configuration parameters, the acoustic parameters having been determined by a first process comprising:
-
accepting acoustic training data and corresponding transcription training data, wherein the acoustic training data includes data for a first set of utterances associated with a keyword spotting task and data for a second set of utterances associated with a large vocabulary continuous speech recognition (LVCSR) task; executing an acoustic parameter training procedure to determine acoustic configuration parameters from both the first set of utterances and the second set of utterances according to a value of a combined measure of quality of the acoustic configuration parameters on multiple tasks including the keyword spotting task and the LVCSR task, the value of the combined measure of quality being determined from both the first set of utterances and the second set of utterances; and keyword spotting using the acoustic configuration parameters, the keyword spotting including processing, after the determining of the acoustic configuration parameters, data representing an acoustic input monitored in an environment to detect spoken occurrences of a keyword in the acoustic input; wherein the acoustic configuration parameters include a first part of the acoustic configuration parameters associated with both the keyword spotting task and the LVCSR task and that depends on both the first set of utterances and the second set of utterances, and a second part of the acoustic configuration parameters associated with the keyword spotting task; and wherein the combined measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter. - View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method for automated speech recognition comprising:
- storing acoustic configuration parameters in a storage, the acoustic configuration parameters including a first part of the acoustic configuration parameters associated with a keyword detection task and a LVCSR task, a second part of the acoustic configuration parameters associated with the keyword detection task, and a third part of the acoustic configuration parameters associated with the LVCSR task;
processing successive sections of an acoustic input received at a device to form corresponding successive first numerical representations of the sections;
processing the successive numerical representations using the acoustic configuration parameters to form successive first distributions and successive second distributions, each numerical representation being associated with a corresponding first distribution of the successive first distributions and with a corresponding second distribution of the successive second distributions, wherein processing a numerical representation of the successive numerical representations to form a first distribution of the successive first distributions uses the first part of the acoustic configuration parameters and the second part of the acoustic configuration parameters and not the third part of the acoustic configuration parameters, and wherein processing a numerical representation of the successive numerical representations to form a second distribution of the successive second distributions uses the first part of the acoustic configuration parameters and the third part of the acoustic configuration parameters and not the second part of the acoustic configuration parameters; and
performing the keyword detection task using the successive first distributions as input, and performing the LVCSR task using the successive second distributions as input, wherein the acoustic configuration parameters comprise parameters of an artificial neural network comprising a plurality of layers, each layer comprising a plurality of units, and wherein the first part of the acoustic configuration parameters characterize multiple layers of the artificial neural network, and the second part and the third part of the acoustic configuration parameters characterize at least one layer of the artificial neural network. - View Dependent Claims (18, 19, 20, 21, 22)
- storing acoustic configuration parameters in a storage, the acoustic configuration parameters including a first part of the acoustic configuration parameters associated with a keyword detection task and a LVCSR task, a second part of the acoustic configuration parameters associated with the keyword detection task, and a third part of the acoustic configuration parameters associated with the LVCSR task;
-
23. Software stored on non-transitory machine-readable media having instructions stored thereupon, wherein instructions are executable by one or more processors to:
-
accept acoustic training data and corresponding transcription training data, wherein the acoustic training data includes data for a first set of utterances associated with a keyword spotting task and data for a second set of utterances associated with a large vocabulary continuous speech recognition (LVCSR) task; execute an acoustic parameter training procedure to determine acoustic configuration parameters from both the first set of utterances and the second set of utterances according to a value of a combined measure of quality of the acoustic configuration parameters on multiple tasks including the keyword spotting task and the LVCSR task, the value of the combined measure of quality being determined from both the first set of utterances and the second set of utterance; and use the determined acoustic configuration parameters for automated speech recognition to detect a keyword in an acoustic input; wherein the acoustic configuration parameters include a first part of the acoustic configuration parameters associated with both the keyword spotting task and the LVCSR task and that depends on both the first set of utterances and the second set of utterances, and a second part of the acoustic configuration parameters associated with the keyword spotting task; and wherein the combined measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter.
-
-
24. A speech recognition system comprising:
-
a storage for training data; and a computer-implemented trainer configured to accept acoustic training data and corresponding transcription training data, wherein the acoustic training data includes data for a first set of utterances associated with a keyword spotting task and data for a second set of utterances associated with a large vocabulary continuous speech recognition (LVCSR) task, and execute an acoustic parameter training procedure to determine acoustic configuration parameters from both the first set of utterances and the second set of utterances according to a value of a combined measure of quality of the acoustic configuration parameters on multiple tasks including the keyword spotting task and the LVCSR task, the value of the combined measure of quality being determined from both the first set of utterances and the second set of utterance; and a computer-implemented automated speech recognition system configured to use the determined acoustic configuration parameters to detect a keyword in an acoustic input; wherein the acoustic configuration parameters include a first part of the acoustic configuration parameters associated with both the keyword spotting task and the LVCSR task and that depends on both the first set of utterances and the second set of utterances, and a second part of the acoustic configuration parameters associated with the keyword spotting task; and wherein the combined measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter.
-
Specification