LOW LATENCY AND MEMORY EFFICIENT KEYWORK SPOTTING

US 20170098442A1
Filed: 07/11/2016
Published: 04/06/2017
Est. Priority Date: 05/28/2013
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a computer-readable memory storing executable instructions; and

one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;

obtain a sequence of feature vectors, wherein the sequence of feature vectors represents at least a portion of a stream of audio data;

generate a keyword score based at least partly on a likelihood that a particular feature vector of the sequence of feature vectors represents audio data corresponding to a keyword;

generate a background score based at least partly on a likelihood that the particular feature vector represents audio data corresponding to background audio;

determine that a difference between the keyword score and the background score is greater than differences associated with feature vectors preceding the particular feature vector in a subset of the sequence of feature vectors, wherein the particular feature vector is in a center of the subset;

determine that the difference is greater than differences associated with feature vectors subsequent to the particular feature vector in the subset; and

generate data indicating the particular feature vector corresponds to an end of the keyword.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Features are disclosed for spotting keywords in utterance audio data without requiring the entire utterance to first be processed. Likelihoods that a portion of the utterance audio data corresponds to the keyword may be compared to likelihoods that the portion corresponds to background audio (e.g., general speech and/or non-speech sounds). The difference in the likelihoods may be determined, and keyword may be triggered when the difference exceeds a threshold, or shortly thereafter. Traceback information and other data may be stored during the process so that a second speech processing pass may be performed. For efficient management of system memory, traceback information may only be stored for those frames that may encompass a keyword; the traceback information for older frames may be overwritten by traceback information for newer frames.

71 Citations

View as Search Results

20 Claims

1. A system comprising:
- a computer-readable memory storing executable instructions; and
  
  one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;
  
  obtain a sequence of feature vectors, wherein the sequence of feature vectors represents at least a portion of a stream of audio data;
  
  generate a keyword score based at least partly on a likelihood that a particular feature vector of the sequence of feature vectors represents audio data corresponding to a keyword;
  
  generate a background score based at least partly on a likelihood that the particular feature vector represents audio data corresponding to background audio;
  
  determine that a difference between the keyword score and the background score is greater than differences associated with feature vectors preceding the particular feature vector in a subset of the sequence of feature vectors, wherein the particular feature vector is in a center of the subset;
  
  determine that the difference is greater than differences associated with feature vectors subsequent to the particular feature vector in the subset; and
  
  generate data indicating the particular feature vector corresponds to an end of the keyword.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the one or more processors are further programmed by the executable instructions to determine that the particular feature vector corresponds to the end of the keyword based at least partly on the difference being greater than a threshold.
  - 3. The system of claim 1, wherein the one or more processors are further programmed by the executable instructions to at least determine a size of the subset based at least partly on an expected length of time for the keyword to be uttered.
  - 4. The system of claim 1, wherein the one or more processors are further programmed by the executable instructions to suppress, for a period of time, generation of second data indicating a second feature vector of the sequence of feature vectors, subsequent to the subset of the sequence of feature vectors, corresponds to an end of the keyword.
  - 5. The system of claim 1, wherein the executable instructions to generate the keyword score comprise instructions to generate the keyword score using a hidden Markov model of audio data that corresponds to the keyword, and wherein the executable instructions to generate the background score comprise instructions to generate the background score using a hidden Markov model of audio data that does not correspond to the keyword.
  - 6. The system of claim 1, wherein the one or more processors are further programmed by the executable instructions generate traceback data linking the particular feature vector to a previous feature vector preceding the particular feature vector in the subset.
  - 7. The system of claim 6, further wherein the one or more processors are further programmed by the executable instructions to determine, based at least partly on speech recognition processing using the traceback data, that the sequence of feature vectors represents audio data corresponding to the keyword.

8. A computer-implemented method comprising:
- under control of one or more computing devices configured with specific computer-executable instructions,generating a first score based at least partly on a likelihood that a frame, of a window of sequential frames of audio data, comprises audio data corresponding to a keyword, wherein the window comprises the frame and an equal quantity of (1) frames before the frame and (2) frames after the frame;
  
  generating a second score based at least partly on a likelihood that the frame comprises audio data corresponding to background audio;
  
  determining a difference between the first score and the second score; and
  
  determining that the frame corresponds to an end of the keyword based at least partly on the difference being greater than differences determined for the frames before the frame, and differences determined for the frames after the frame.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer-implemented method of claim 8, wherein determining that the frame corresponds to the end of the keyword is further based at least partly on the difference being greater than a threshold.
  - 10. The computer-implemented method of claim 8, further comprising determining a size of the window based at least partly on an expected length of time for the keyword to be uttered.
  - 11. The computer-implemented method of claim 8, further comprising suppressing, for a period of time, determining that a second frame, different than the frame, corresponds to an end of the keyword.
  - 12. The computer-implemented method of claim 8, wherein generating the first score comprises using a hidden Markov model of audio data that corresponds to the keyword, and wherein generating the second score comprises using a hidden Markov model of audio data that does not correspond to the keyword.
  - 13. The computer-implemented method of claim 8, further comprising generating traceback data linking the frame to a frame of the frames before the frame.
  - 14. The computer-implemented method of claim 13, further comprising confirming, based at least partly on speech recognition processing using the traceback data, that the frame corresponds to the end of the keyword.

15. Non-transitory computer readable storage comprising executable instructions that, when executed, cause one or more computing devices to perform a process comprising:
- generating a first score based at least partly on a likelihood that a frame, of a window of sequential frames of audio data, comprises audio data corresponding to a keyword, wherein the window comprises the frame and an equal quantity of (1) frames before the frame and (2) frames after the frame;
  
  generating a second score based at least partly on a likelihood that the frame comprises audio data corresponding to background audio;
  
  determining a difference between the first score and the second score; and
  
  determining that the frame corresponds to an end of the keyword based at least partly on the difference being greater than differences determined for the frames before the frame, and differences determined for the frames after the frame.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer readable storage of claim 15, wherein determining that the frame corresponds to the end of the keyword is further based at least partly on the difference being greater than a threshold.
  - 17. The non-transitory computer readable storage of claim 15, wherein the process further comprises determining a size of the window based at least partly on an expected length of time for the keyword to be uttered.
  - 18. The non-transitory computer readable storage of claim 15, wherein the process further comprises suppressing, for a period of time, determining that a second frame, different than the frame, corresponds to an end of the keyword.
  - 19. The non-transitory computer readable storage of claim 15, wherein generating the first score comprises using a hidden Markov model of audio data that corresponds to the keyword, and wherein generating the second score comprises using a hidden Markov model of audio data that does not correspond to the keyword.
  - 20. The non-transitory computer readable storage of claim 15, wherein the process further comprises confirming, based at least partly on speech recognition processing using traceback data linking the frame to a frame of the frames before the frame, that the frame corresponds to the end of the keyword.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hoffmeister, Bjorn

Granted Patent

US 9,852,729 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/08   Speech classification or se...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/22   Procedures used during a sp...

G10L 2015/088   Word spotting

G10L 2015/223   Execution procedure of a sp...

LOW LATENCY AND MEMORY EFFICIENT KEYWORK SPOTTING

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

71 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

LOW LATENCY AND MEMORY EFFICIENT KEYWORK SPOTTING

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

71 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links