Keyword spotting using multi-task configuration

US 10,304,440 B1
Filed: 06/30/2016
Issued: 05/28/2019
Est. Priority Date: 07/10/2015
Status: Active Grant

First Claim

Patent Images

1. A method for automated detection of a trigger word in an automated speech recognition system, the method comprising:

during a training stage,accepting acoustic training data and corresponding transcription training data, the acoustic training data including processed audio input for a first set of utterances for a trigger word detection task and processed audio input for a second set of utterances for a large vocabulary speech recognition task, andexecuting a computer-implemented acoustic parameter training procedure to determine acoustic configuration parameters that yield a best measure of quality of the acoustic configuration parameters on multiple tasks, the multiple tasks including the trigger word detection task and the large vocabulary speech recognition task, the acoustic configuration parameters comprise parameters of an artificial neural network comprising a plurality of layers, each layer comprising a plurality of units, and wherein a first part of the acoustic configuration parameters is associated with both the trigger word detection task and the large vocabulary speech recognition task and characterizes multiple layers of the artificial neural network, and a second part of the acoustic configuration parameters is associated with the trigger word detection task, and a third part of the acoustic configuration parameters is associated with the large vocabulary speech recognition task, and each of the second and third parts characterize at least one layer of the artificial neural network;

after the training stage, storing the first part and the second part of the acoustic configuration parameters for configuration of the automated speech recognition system for use during a runtime stage in a user device associated with a user; and

during the runtime stage, at the user device,processing a signal representing an acoustic input monitored at the user device in an environment, the acoustic input including an utterance of a trigger word by the user, the processing including processing successive sections of the signal to form successive numerical feature vectors representing the acoustic content of respective sections of the acoustic input,mapping each successive numerical feature vector to form a corresponding state distribution using the first part and the second part of the acoustic configuration parameters, without using the third part of the acoustic configuration parameters, the mapping including computing the state distribution using a numerical computation of the artificial neural network that includes the values of the first part and the second part of the acoustic configuration parameters, andusing the state distributions as input to a keyword spotter to detect the utterance of the trigger word in the acoustic input;

wherein the measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An approach to keyword spotting makes use of acoustic parameters that are trained on a keyword spotting task as well as on a second speech recognition task, for example, a large vocabulary continuous speech recognition task. The parameters may be optimized according to a weighted measure that weighs the keyword spotting task more highly than the other task, and that weighs utterances of a keyword more highly than utterances of other speech. In some applications, a keyword spotter configured with the acoustic parameters is used for trigger or wake word detection.

Citations

24 Claims

1. A method for automated detection of a trigger word in an automated speech recognition system, the method comprising:
- during a training stage,accepting acoustic training data and corresponding transcription training data, the acoustic training data including processed audio input for a first set of utterances for a trigger word detection task and processed audio input for a second set of utterances for a large vocabulary speech recognition task, andexecuting a computer-implemented acoustic parameter training procedure to determine acoustic configuration parameters that yield a best measure of quality of the acoustic configuration parameters on multiple tasks, the multiple tasks including the trigger word detection task and the large vocabulary speech recognition task, the acoustic configuration parameters comprise parameters of an artificial neural network comprising a plurality of layers, each layer comprising a plurality of units, and wherein a first part of the acoustic configuration parameters is associated with both the trigger word detection task and the large vocabulary speech recognition task and characterizes multiple layers of the artificial neural network, and a second part of the acoustic configuration parameters is associated with the trigger word detection task, and a third part of the acoustic configuration parameters is associated with the large vocabulary speech recognition task, and each of the second and third parts characterize at least one layer of the artificial neural network;
  
  after the training stage, storing the first part and the second part of the acoustic configuration parameters for configuration of the automated speech recognition system for use during a runtime stage in a user device associated with a user; and
  
  during the runtime stage, at the user device,processing a signal representing an acoustic input monitored at the user device in an environment, the acoustic input including an utterance of a trigger word by the user, the processing including processing successive sections of the signal to form successive numerical feature vectors representing the acoustic content of respective sections of the acoustic input,mapping each successive numerical feature vector to form a corresponding state distribution using the first part and the second part of the acoustic configuration parameters, without using the third part of the acoustic configuration parameters, the mapping including computing the state distribution using a numerical computation of the artificial neural network that includes the values of the first part and the second part of the acoustic configuration parameters, andusing the state distributions as input to a keyword spotter to detect the utterance of the trigger word in the acoustic input;
  
  wherein the measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter.
- View Dependent Claims (2)
- - 2. The method of claim 1 further comprising, during a runtime stage:
    - mapping each successive numerical feature vector to form a second state distribution using the third part of the acoustic configuration parameters; and
      
      using the second state distributions as input to a speech recognizer to recognize a sequence of words in acoustic input.

3. A method for automated speech recognition comprising storing acoustic configuration parameters for configuration of an automated speech recognition system for detection of a keyword in an acoustic input using the automatic speech recognition system configured with the acoustic configuration parameters, the acoustic parameters having been determined by a first process comprising:
- accepting acoustic training data and corresponding transcription training data, wherein the acoustic training data includes data for a first set of utterances associated with a keyword spotting task and data for a second set of utterances associated with a large vocabulary continuous speech recognition (LVCSR) task;
  
  executing an acoustic parameter training procedure to determine acoustic configuration parameters from both the first set of utterances and the second set of utterances according to a value of a combined measure of quality of the acoustic configuration parameters on multiple tasks including the keyword spotting task and the LVCSR task, the value of the combined measure of quality being determined from both the first set of utterances and the second set of utterances; and
  
  keyword spotting using the acoustic configuration parameters, the keyword spotting including processing, after the determining of the acoustic configuration parameters, data representing an acoustic input monitored in an environment to detect spoken occurrences of a keyword in the acoustic input;
  
  wherein the acoustic configuration parameters include a first part of the acoustic configuration parameters associated with both the keyword spotting task and the LVCSR task and that depends on both the first set of utterances and the second set of utterances, and a second part of the acoustic configuration parameters associated with the keyword spotting task; and
  
  wherein the combined measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 4. The method of claim 3 further comprising performing the first process to determine the acoustic configuration parameters.
  - 5. The method of claim 3 further comprising performing the LVCSR task using the acoustic configuration parameters.
  - 6. The method of claim 3 further comprising:
    - processing a succession of feature vectors using the acoustic configuration parameters, each feature vector representing acoustic content of a corresponding section of an acoustic input; and
      
      providing the feature vectors to a keyword spotter configured with at least some of the acoustic configuration parameters to implement the keyword spotting task.
  - 7. The method of claim 3 wherein the keyword spotting task comprises a trigger word or phrase detection task.
  - 8. The method of claim 3 wherein the keyword spotting task comprises command detection task.
  - 9. The method of claim 6 wherein the first set of utterances comprises utterances collected during a prior use of the keyword spotter, and the second set of utterances comprises read utterances.
  - 10. The method of claim 3 wherein the acoustic configuration parameters further include a third part of the acoustic configuration parameters associated with the LVCSR task.
  - 11. The method of claim 10 wherein storing the acoustic configuration parameters includes storing the first part of the acoustic configuration parameters and the second part of the acoustic configuration parameters and without storing the third part of the acoustic configuration parameters for configuring the system.
  - 12. The method of claim 3 wherein the combined measure of quality of the acoustic configuration parameters comprises a weighted cross-entropy comprising a combination of measures including a first cross-entropy associated with the first set of utterances and a second cross-entropy associated with the second set of utterances.
  - 13. The method of claim 12 wherein the combined measure of quality of the acoustic configuration parameters comprises a first contribution associated with keyword spotting task and a second contribution associated with the LVCSR task, and wherein the combined measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter characterizing an importance of each task.
  - 14. The method of claim 3 wherein the determined acoustic configuration parameters comprise parameters of an artificial neural network that comprises a plurality of layers, and wherein the first part of the acoustic configuration parameters characterize a plurality of layers of the artificial neural network, and the second part and the third part of the acoustic configuration parameters characterize at least one layer of the artificial neural network.
  - 15. The method of claim 14 wherein the artificial neural network comprises a deep neural network having at least three layers characterized by the first part of the acoustic configuration parameters.
  - 16. The method of claim 3 wherein the combined measure of quality (W) of the acoustic parameters, W, has a form of a weighted cross-entropy computable according to an expression
    (W)=γ
    - ⁽¹⁾(W)+(1−
      
      γ
      
      )⁽²⁾(W)where the first contribution associated with the trigger word detection task is computable according to
      ⁽¹⁾(W)=Σ
      
      _n=1^N_n⁽¹⁾(W), where
      _n⁽¹⁾(W)=−
      
      w_c_nlog y_c_n⁽¹⁾(x_n)and the second contribution associated with the large vocabulary speech recognition task is computable according to
      ⁽²⁾(W)=Σ
      
      _n=1^N_n⁽²⁾(W), where
      _n⁽²⁾(W)=−
      
      log y_l_n⁽²⁾(x_n),and where N is the number of feature vectors, x_nis the n^thfeature vector, c_nis a state of the trigger word detection task associated with the n^thfeature vector, and l_nis a state of the general speech recognition task associated with the n^thfeature vector, γ
      
      represents a task weighting parameter, and w_c_nrepresent the class weighting parameters.

17. A method for automated speech recognition comprising:
- storing acoustic configuration parameters in a storage, the acoustic configuration parameters including a first part of the acoustic configuration parameters associated with a keyword detection task and a LVCSR task, a second part of the acoustic configuration parameters associated with the keyword detection task, and a third part of the acoustic configuration parameters associated with the LVCSR task;
  
  processing successive sections of an acoustic input received at a device to form corresponding successive first numerical representations of the sections;
  
  processing the successive numerical representations using the acoustic configuration parameters to form successive first distributions and successive second distributions, each numerical representation being associated with a corresponding first distribution of the successive first distributions and with a corresponding second distribution of the successive second distributions, wherein processing a numerical representation of the successive numerical representations to form a first distribution of the successive first distributions uses the first part of the acoustic configuration parameters and the second part of the acoustic configuration parameters and not the third part of the acoustic configuration parameters, and wherein processing a numerical representation of the successive numerical representations to form a second distribution of the successive second distributions uses the first part of the acoustic configuration parameters and the third part of the acoustic configuration parameters and not the second part of the acoustic configuration parameters; and
  
  performing the keyword detection task using the successive first distributions as input, and performing the LVCSR task using the successive second distributions as input, wherein the acoustic configuration parameters comprise parameters of an artificial neural network comprising a plurality of layers, each layer comprising a plurality of units, and wherein the first part of the acoustic configuration parameters characterize multiple layers of the artificial neural network, and the second part and the third part of the acoustic configuration parameters characterize at least one layer of the artificial neural network.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The method of claim 17 wherein the keyword detection task and the LVCSR task each comprises a Hidden Markov Model (HMM) based speech processing task, wherein the first distributions and the second distributions characterize probability distributions over the states of HMMs.
  - 19. The method of claim 17 wherein the keyword detection task comprises a trigger word detection task.
  - 20. The method of claim 17 wherein the method further includes initiating the LVCSR task upon detection of a trigger word by the trigger word detection task.
  - 21. The method of claim 17 wherein the LVCSR task comprises a rescoring task, and wherein the method further comprises combining output from the keyword detection task and an output of the rescoring task to detect occurrences of a trigger word in the acoustic input.
  - 22. The method of claim 17 wherein the successive numerical representations comprise feature vectors representing frequency band energies of the acoustic input, and wherein the first distributions and the second distributions represent probability distributions over subword states.

23. Software stored on non-transitory machine-readable media having instructions stored thereupon, wherein instructions are executable by one or more processors to:
- accept acoustic training data and corresponding transcription training data, wherein the acoustic training data includes data for a first set of utterances associated with a keyword spotting task and data for a second set of utterances associated with a large vocabulary continuous speech recognition (LVCSR) task;
  
  execute an acoustic parameter training procedure to determine acoustic configuration parameters from both the first set of utterances and the second set of utterances according to a value of a combined measure of quality of the acoustic configuration parameters on multiple tasks including the keyword spotting task and the LVCSR task, the value of the combined measure of quality being determined from both the first set of utterances and the second set of utterance; and
  
  use the determined acoustic configuration parameters for automated speech recognition to detect a keyword in an acoustic input;
  
  wherein the acoustic configuration parameters include a first part of the acoustic configuration parameters associated with both the keyword spotting task and the LVCSR task and that depends on both the first set of utterances and the second set of utterances, and a second part of the acoustic configuration parameters associated with the keyword spotting task; and
  
  wherein the combined measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter.

24. A speech recognition system comprising:
- a storage for training data; and
  
  a computer-implemented trainer configured toaccept acoustic training data and corresponding transcription training data, wherein the acoustic training data includes data for a first set of utterances associated with a keyword spotting task and data for a second set of utterances associated with a large vocabulary continuous speech recognition (LVCSR) task, andexecute an acoustic parameter training procedure to determine acoustic configuration parameters from both the first set of utterances and the second set of utterances according to a value of a combined measure of quality of the acoustic configuration parameters on multiple tasks including the keyword spotting task and the LVCSR task, the value of the combined measure of quality being determined from both the first set of utterances and the second set of utterance; and
  
  a computer-implemented automated speech recognition system configured touse the determined acoustic configuration parameters to detect a keyword in an acoustic input;
  
  wherein the acoustic configuration parameters include a first part of the acoustic configuration parameters associated with both the keyword spotting task and the LVCSR task and that depends on both the first set of utterances and the second set of utterances, and a second part of the acoustic configuration parameters associated with the keyword spotting task; and
  
  wherein the combined measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Panchapagesan, Sankaran, Hoffmeister, Bjorn, Mandal, Arindam, Khare, Aparna, Vitaladevuni, Shiv Naga Prasad, Matsoukas, Spyridon, Sun, Ming
Primary Examiner(s)
Sirjani, Fariba

Application Number

US15/198,578
Time in Patent Office

1,062 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/063   Training

G10L 15/08   Speech classification or se...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/16   using artificial neural net...

G10L 15/285   Memory allocation or algori...

G10L 2015/088   Word spotting

Keyword spotting using multi-task configuration

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Keyword spotting using multi-task configuration

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links