Low latency and memory efficient keywork spotting

US 9,390,708 B1
Filed: 05/28/2013
Issued: 07/12/2016
Est. Priority Date: 05/28/2013
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a computer-readable memory storing executable instructions; and

one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;

obtain a stream of audio data regarding an utterance of a user;

generate a sequence of feature vectors based at least partly on the audio data;

update a keyword score based at least partly on a first score indicating a probability that a particular feature vector of the sequence of feature vectors corresponds to a model for a keyword;

update a background score based at least partly on a second score indicating a probability that the particular feature corresponds to a background model;

generate traceback data regarding a relationship between the particular feature vector and a prior feature vector of the sequence of feature vectors;

determine a difference between a time associated with the traceback data and a time associated with previously-stored traceback data;

determine, based on the difference exceeding an expected maximum length of time to utter the keyword, to overwrite the previously-stored traceback data in memory;

overwrite the previously-stored traceback data in memory with the traceback data; and

determine, using the traceback data, that the stream of audio data likely comprises the keyword based at least partly on a difference between the keyword score and the background score.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Features are disclosed for spotting keywords in utterance audio data without requiring the entire utterance to first be processed. Likelihoods that a portion of the utterance audio data corresponds to the keyword may be compared to likelihoods that the portion corresponds to background audio (e.g., general speech and/or non-speech sounds). The difference in the likelihoods may be determined, and keyword may be triggered when the difference exceeds a threshold, or shortly thereafter. Traceback information and other data may be stored during the process so that a second speech processing pass may be performed. For efficient management of system memory, traceback information may only be stored for those frames that may encompass a keyword; the traceback information for older frames may be overwritten by traceback information for newer frames.

Citations

23 Claims

1. A system comprising:
- a computer-readable memory storing executable instructions; and
  
  one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;
  
  obtain a stream of audio data regarding an utterance of a user;
  
  generate a sequence of feature vectors based at least partly on the audio data;
  
  update a keyword score based at least partly on a first score indicating a probability that a particular feature vector of the sequence of feature vectors corresponds to a model for a keyword;
  
  update a background score based at least partly on a second score indicating a probability that the particular feature corresponds to a background model;
  
  generate traceback data regarding a relationship between the particular feature vector and a prior feature vector of the sequence of feature vectors;
  
  determine a difference between a time associated with the traceback data and a time associated with previously-stored traceback data;
  
  determine, based on the difference exceeding an expected maximum length of time to utter the keyword, to overwrite the previously-stored traceback data in memory;
  
  overwrite the previously-stored traceback data in memory with the traceback data; and
  
  determine, using the traceback data, that the stream of audio data likely comprises the keyword based at least partly on a difference between the keyword score and the background score.
- View Dependent Claims (2, 3, 18, 19)
- - 2. The system of claim 1, wherein the model for the keyword comprises a Gaussian mixture model and a hidden Markov model.
  - 3. The system of claim 1, wherein the one or more processors are further programmed to determine an end of the keyword based at least in part on a largest difference between the keyword score and the background score over a period of time.
  - 18. The system of claim 1, wherein the stream of audio data comprises a sequence of frames of audio data, wherein the particular feature vector comprises information about acoustic features of a particular frame of the sequence of frames of audio data.
  - 19. The system of claim 18, wherein the time associated with the traceback data comprises a first window of time corresponding to a first subset of the sequence of frames, and wherein the time associated with the previously-stored traceback data comprises a second window of time, prior to the first window of time, corresponding to a second subset of the sequence of frames.

4. A computer-implemented method comprising:
- generating, by a speech recognition system comprising one or more computing devices configured to execute specific instructions, a sequence of feature vectors based at least partly on audio data, the sequence of feature vectors comprising a first feature vector and a second feature vector;
  
  processing, using a model for a keyword, the first feature vector to determine whether the audio data corresponds to the keyword, wherein the processing comprises generating traceback data regarding at least the first feature vector and the second feature vector;
  
  determining a difference between a time associated with the traceback data and a time associated with previously-stored traceback data;
  
  determining, based at least partly on the difference exceeding an expected maximum length of time to utter the keyword, to overwrite the previously-stored traceback data in memory;
  
  overwriting the previously-stored traceback data in memory with the traceback data; and
  
  determining, using the traceback data, that the audio data likely corresponds to audio of the keyword based at least partly on a difference between a keyword score and a background score.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 20, 21)
- - 5. The computer-implemented method of claim 4, wherein the traceback data indicates, for a first state associated with the first feature vector, a second state that preceded the first state.
  - 6. The computer-implemented method of claim 4, further comprising:
    - generating additional traceback data regarding at least a third feature vector and a fourth feature vector;
      
      determining that a difference, between a time associated with the traceback data and a time associated with the additional traceback data, does not exceed the expected maximum length of time of the keyword; and
      
      allocating a new memory block.
  - 7. The computer-implemented method of claim 6, further comprising storing the additional traceback data in the new memory block.
  - 8. The computer-implemented method of claim 4, wherein the keyword score is based at least partly on a likelihood that the audio data corresponds to a model for the keyword, and wherein the background score is based at least partly on a likelihood that the audio data corresponds to a background model.
  - 9. The computer-implemented method of claim 4, wherein determining that the audio data corresponds to the keyword comprises:
    - identifying a largest difference between the keyword score and the background score in a window encompassing a plurality of feature vectors; and
      
      determining that the largest difference exceeds a threshold.
  - 10. The computer-implemented method of claim 4, further comprising determining, based at least partly on a second speech recognition pass using the traceback data, whether the audio data corresponds to the keyword.
  - 20. The computer-implemented method of claim 4, wherein generating traceback data comprises:
    - generating a pointer from a current state to a state immediately preceding the current state; and
      
      generating a likelihood score indicating a likelihood that a portion of audio data associated with the feature vector corresponds to the current state.
  - 21. The computer-implemented method of claim 4, wherein the audio data comprises a sequence of frames of a recording of an utterance, wherein the time associated with the traceback data comprises a first window of time corresponding to a first subset of the sequence of frames, and wherein the time associated with the previously-stored traceback data comprises a second window of time, prior to the first window of time, corresponding to a second subset of the sequence of frames.

11. One or more non-transitory computer readable media comprising executable code that, when executed, cause one or more computing devices to perform a process comprising:
- generating, by a speech recognition system comprising one or more computing devices configured to execute specific instructions, a sequence of feature vectors based at least partly on audio data, the sequence of feature vectors comprising a first feature vector and a second feature vector;
  
  processing, using a model for a keyword, the first feature vector to determine whether the audio data corresponds to the keyword, wherein the processing comprises generating traceback data regarding at least the first feature vector and the second feature vector;
  
  determining a difference between a time associated with the traceback data and a time associated with previously-stored traceback data;
  
  determining, based at least partly on the difference exceeding an expected maximum length of time to utter the keyword, to overwrite the previously-stored traceback data in memory;
  
  overwriting the previously-stored traceback data in memory with the traceback data; and
  
  determining, using the traceback data, that the audio data likely corresponds to audio of the keyword based at least partly on a difference between a keyword score and a background score.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 22, 23)
- - 12. The one or more non-transitory computer readable media of claim 11, wherein the traceback data indicates, for a first state associated with the first feature vector, a second state that preceded the first state.
  - 13. The one or more non-transitory computer readable media of claim 11, wherein the process further comprises:
    - generating additional traceback data regarding at least a third feature vector and a fourth feature vector;
      
      determining that a difference, between a time associated with the traceback data and a time associated with the additional traceback data, does not exceed the expected maximum length of time of the keyword; and
      
      allocating a new memory block.
  - 14. The one or more non-transitory computer readable media of claim 13, wherein the process further comprises storing the additional traceback data in the new memory block.
  - 15. The one or more non-transitory computer readable media of claim 11, wherein the keyword score is based at least partly on a likelihood that the audio data corresponds to a model for the keyword, and wherein the background score is based at least partly on a likelihood that the audio data corresponds to a background model.
  - 16. The one or more non-transitory computer readable media of claim 11, wherein determining that the audio data corresponds to the keyword comprises:
    - identifying a largest difference between the keyword score and the background score in a window encompassing a plurality of feature vectors; and
      
      determining that the largest difference exceeds a threshold.
  - 17. The one or more non-transitory computer readable media of claim 11, wherein the process further comprises determining, based at least partly on a second speech recognition pass using the traceback data, whether the audio data corresponds to the keyword.
  - 22. The one or more non-transitory computer-readable media of claim 11, wherein generating traceback data comprises:
    - generating a pointer from a current state to a state immediately preceding the current state; and
      
      generating a likelihood score indicating a likelihood that a portion of audio data associated with the feature vector corresponds to the current state.
  - 23. The one or more non-transitory computer-readable media of claim 11, wherein the audio data comprises a sequence of frames of a recording of an utterance, wherein the time associated with the traceback data comprises a first window of time corresponding to a first subset of the sequence of frames, and wherein the time associated with the previously-stored traceback data comprises a second window of time, prior to the first window of time, corresponding to a second subset of the sequence of frames.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hoffmeister, Bjorn
Primary Examiner(s)
Shah, Paras D
Assistant Examiner(s)
Thomas-Homescu, Anne

Application Number

US13/903,814
Time in Patent Office

1,141 Days
Field of Search

704/256, 704/238, 704/242, 704/240, 704/252, 704/231, 704/251, 704/254, 704/236, 704/277, 704/256.5, 704/270.1, 704/241, 704/275, 704/234
US Class Current

1/1
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/08   Speech classification or se...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/22   Procedures used during a sp...

G10L 2015/088   Word spotting

G10L 2015/223   Execution procedure of a sp...

Low latency and memory efficient keywork spotting

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Low latency and memory efficient keywork spotting

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links