Low latency and memory efficient keywork spotting
First Claim
1. A system comprising:
- a computer-readable memory storing executable instructions; and
one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;
obtain a stream of audio data regarding an utterance of a user;
generate a sequence of feature vectors based at least partly on the audio data;
update a keyword score based at least partly on a first score indicating a probability that a particular feature vector of the sequence of feature vectors corresponds to a model for a keyword;
update a background score based at least partly on a second score indicating a probability that the particular feature corresponds to a background model;
generate traceback data regarding a relationship between the particular feature vector and a prior feature vector of the sequence of feature vectors;
determine a difference between a time associated with the traceback data and a time associated with previously-stored traceback data;
determine, based on the difference exceeding an expected maximum length of time to utter the keyword, to overwrite the previously-stored traceback data in memory;
overwrite the previously-stored traceback data in memory with the traceback data; and
determine, using the traceback data, that the stream of audio data likely comprises the keyword based at least partly on a difference between the keyword score and the background score.
1 Assignment
0 Petitions
Accused Products
Abstract
Features are disclosed for spotting keywords in utterance audio data without requiring the entire utterance to first be processed. Likelihoods that a portion of the utterance audio data corresponds to the keyword may be compared to likelihoods that the portion corresponds to background audio (e.g., general speech and/or non-speech sounds). The difference in the likelihoods may be determined, and keyword may be triggered when the difference exceeds a threshold, or shortly thereafter. Traceback information and other data may be stored during the process so that a second speech processing pass may be performed. For efficient management of system memory, traceback information may only be stored for those frames that may encompass a keyword; the traceback information for older frames may be overwritten by traceback information for newer frames.
-
Citations
23 Claims
-
1. A system comprising:
-
a computer-readable memory storing executable instructions; and one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least; obtain a stream of audio data regarding an utterance of a user; generate a sequence of feature vectors based at least partly on the audio data; update a keyword score based at least partly on a first score indicating a probability that a particular feature vector of the sequence of feature vectors corresponds to a model for a keyword; update a background score based at least partly on a second score indicating a probability that the particular feature corresponds to a background model; generate traceback data regarding a relationship between the particular feature vector and a prior feature vector of the sequence of feature vectors; determine a difference between a time associated with the traceback data and a time associated with previously-stored traceback data; determine, based on the difference exceeding an expected maximum length of time to utter the keyword, to overwrite the previously-stored traceback data in memory; overwrite the previously-stored traceback data in memory with the traceback data; and determine, using the traceback data, that the stream of audio data likely comprises the keyword based at least partly on a difference between the keyword score and the background score. - View Dependent Claims (2, 3, 18, 19)
-
-
4. A computer-implemented method comprising:
-
generating, by a speech recognition system comprising one or more computing devices configured to execute specific instructions, a sequence of feature vectors based at least partly on audio data, the sequence of feature vectors comprising a first feature vector and a second feature vector; processing, using a model for a keyword, the first feature vector to determine whether the audio data corresponds to the keyword, wherein the processing comprises generating traceback data regarding at least the first feature vector and the second feature vector; determining a difference between a time associated with the traceback data and a time associated with previously-stored traceback data; determining, based at least partly on the difference exceeding an expected maximum length of time to utter the keyword, to overwrite the previously-stored traceback data in memory; overwriting the previously-stored traceback data in memory with the traceback data; and determining, using the traceback data, that the audio data likely corresponds to audio of the keyword based at least partly on a difference between a keyword score and a background score. - View Dependent Claims (5, 6, 7, 8, 9, 10, 20, 21)
-
-
11. One or more non-transitory computer readable media comprising executable code that, when executed, cause one or more computing devices to perform a process comprising:
-
generating, by a speech recognition system comprising one or more computing devices configured to execute specific instructions, a sequence of feature vectors based at least partly on audio data, the sequence of feature vectors comprising a first feature vector and a second feature vector; processing, using a model for a keyword, the first feature vector to determine whether the audio data corresponds to the keyword, wherein the processing comprises generating traceback data regarding at least the first feature vector and the second feature vector; determining a difference between a time associated with the traceback data and a time associated with previously-stored traceback data; determining, based at least partly on the difference exceeding an expected maximum length of time to utter the keyword, to overwrite the previously-stored traceback data in memory; overwriting the previously-stored traceback data in memory with the traceback data; and
determining, using the traceback data, that the audio data likely corresponds to audio of the keyword based at least partly on a difference between a keyword score and a background score. - View Dependent Claims (12, 13, 14, 15, 16, 17, 22, 23)
-
Specification