×

Keyword spotting using multi-task configuration

  • US 10,304,440 B1
  • Filed: 06/30/2016
  • Issued: 05/28/2019
  • Est. Priority Date: 07/10/2015
  • Status: Active Grant
First Claim
Patent Images

1. A method for automated detection of a trigger word in an automated speech recognition system, the method comprising:

  • during a training stage,accepting acoustic training data and corresponding transcription training data, the acoustic training data including processed audio input for a first set of utterances for a trigger word detection task and processed audio input for a second set of utterances for a large vocabulary speech recognition task, andexecuting a computer-implemented acoustic parameter training procedure to determine acoustic configuration parameters that yield a best measure of quality of the acoustic configuration parameters on multiple tasks, the multiple tasks including the trigger word detection task and the large vocabulary speech recognition task, the acoustic configuration parameters comprise parameters of an artificial neural network comprising a plurality of layers, each layer comprising a plurality of units, and wherein a first part of the acoustic configuration parameters is associated with both the trigger word detection task and the large vocabulary speech recognition task and characterizes multiple layers of the artificial neural network, and a second part of the acoustic configuration parameters is associated with the trigger word detection task, and a third part of the acoustic configuration parameters is associated with the large vocabulary speech recognition task, and each of the second and third parts characterize at least one layer of the artificial neural network;

    after the training stage, storing the first part and the second part of the acoustic configuration parameters for configuration of the automated speech recognition system for use during a runtime stage in a user device associated with a user; and

    during the runtime stage, at the user device,processing a signal representing an acoustic input monitored at the user device in an environment, the acoustic input including an utterance of a trigger word by the user, the processing including processing successive sections of the signal to form successive numerical feature vectors representing the acoustic content of respective sections of the acoustic input,mapping each successive numerical feature vector to form a corresponding state distribution using the first part and the second part of the acoustic configuration parameters, without using the third part of the acoustic configuration parameters, the mapping including computing the state distribution using a numerical computation of the artificial neural network that includes the values of the first part and the second part of the acoustic configuration parameters, andusing the state distributions as input to a keyword spotter to detect the utterance of the trigger word in the acoustic input;

    wherein the measure of quality of the acoustic configuration parameters includes a first contribution associated with trigger word detection task and a second contribution associated with the large vocabulary speech recognition task, and wherein the measure of quality comprises a weighted combination of the first and the second contribution according to a task weighting parameter.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×