Enhanced endpoint detection for speech recognition

US 9,437,186 B1
Filed: 06/19/2013
Issued: 09/06/2016
Est. Priority Date: 06/19/2013
Status: Active Grant

First Claim

Patent Images

1. A method for reducing latency in speech recognition, the method comprising:

receiving audio input data representing an utterance;

performing automatic speech recognition (ASR) processing on the audio input data to generate ASR output;

determining a first ending to the utterance in the audio input data at a first time corresponding to non-speech detected in the audio input data;

determining a first portion of the ASR output, the first portion corresponding to the audio input data up to the first ending;

providing the first portion of the ASR output to a natural language understanding (NLU) module to obtain a first NLU result;

storing the first NLU result;

determining a second ending to the user'"'"'s speech in the audio input data at a second time after the first time;

determining a second portion of the ASR output, the second portion corresponding to the audio input data up to the second ending;

comparing the first portion to the second portion; and

;

(1) if the first portion is the same as the second portion, initiating a first action to be executed on a first device, the first action based on the first NLU result, and(2) if the first portion is not the same as the second portion;

discarding the first NLU result,providing the second ASR output to the NLU module to obtain a second NLU result, andinitiating a second action to be executed on the first device, the second action based on the second NLU result.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Determining the end of an utterance for purposes of automatic speech recognition (ASR) may be improved with a system that provides early results and/or incorporates semantic tagging. Early ASR results of an incoming utterance may be prepared based at least in part on an estimated endpoint and processed by a natural language understanding (NLU) process while final results, based at least in part on a final endpoint, are determined. If the early results match the final results, the early NLU results are already prepared for early execution. The endpoint may also be determined based at least in part on the content of the utterance, as represented by semantic tagging output from ASR processing. If the tagging indicate completion of a logical statement, an endpoint may be declared, or a threshold for silent frames prior to declaring an endpoint may be adjusted.

Citations

22 Claims

1. A method for reducing latency in speech recognition, the method comprising:
- receiving audio input data representing an utterance;
  
  performing automatic speech recognition (ASR) processing on the audio input data to generate ASR output;
  
  determining a first ending to the utterance in the audio input data at a first time corresponding to non-speech detected in the audio input data;
  
  determining a first portion of the ASR output, the first portion corresponding to the audio input data up to the first ending;
  
  providing the first portion of the ASR output to a natural language understanding (NLU) module to obtain a first NLU result;
  
  storing the first NLU result;
  
  determining a second ending to the user'"'"'s speech in the audio input data at a second time after the first time;
  
  determining a second portion of the ASR output, the second portion corresponding to the audio input data up to the second ending;
  
  comparing the first portion to the second portion; and
  
  ;
  
  (1) if the first portion is the same as the second portion, initiating a first action to be executed on a first device, the first action based on the first NLU result, and(2) if the first portion is not the same as the second portion;
  
  discarding the first NLU result,providing the second ASR output to the NLU module to obtain a second NLU result, andinitiating a second action to be executed on the first device, the second action based on the second NLU result.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, further comprising sending one of the first NLU result or the second NLU result to the first device for initiating the first action or initiating the second action.
  - 3. The method of claim 1, further comprising determining the first ending using first semantic information in the ASR output.
  - 4. The method of claim 3, further comprising determining, based at least in part on the first semantic information, adjusting a threshold for a length of non-speech sufficient to determine the first ending.

5. A system comprising:
- at least one processor coupled to a memory, the memory including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor;
  
  to receive audio input data, including a speech utterance;
  
  to perform automatic speech recognition (ASR) processing on the audio input data to generate ASR output;
  
  to determine a first ending of the utterance in the audio input data;
  
  to generate a first speech recognition output including a first portion of the ASR output that corresponds to the audio input data from a beginning of the audio input data up to the first ending;
  
  to process the first speech recognition output to obtain a first speech processing result; and
  
  ;
  
  (1) if the audio input data after the first ending does not include speech, to initiate a first action to be executed on a first device based at least in part on the first speech processing result, and(2) if the audio input data after the first ending includes speech;
  
  to discard the first speech processing result,to determine a second ending of the utterance in the audio input data after the first ending,to generate a second speech recognition output including a second portion of the ASR output that corresponds to the audio input data from the beginning of the audio input data up to the second ending,to process the second speech recognition output to obtain a second speech processing result, andto initiate a second action that is different from the first action to be executed on the first device based at least in part on the second speech processing result.
- View Dependent Claims (6, 7, 8, 9, 10, 11)
- - 6. The system of claim 5, wherein the first speech processing result comprises a natural language processing result.
  - 7. The system of claim 5, wherein the at least one processor is further configured to determine if the audio input data after the first ending includes speech by:
    - determining the second ending of the utterance in the audio input data;
      
      generating the second speech recognition output; and
      
      comparing the first speech recognition output to the second speech recognition output.
  - 8. The system of claim 5, wherein the at least one processor is further configured to determine if the audio input data after the first ending does not include speech by:
    - detecting a period of non-speech following the first ending; and
      
      comparing the period of non-speech to a threshold.
  - 9. The system of claim 5, wherein the at least one processor is further configured to initiate the first action on the first device.
  - 10. The system of claim 5, wherein the at least one processor is further configured to determine the first ending based at least in part on first semantic information.
  - 11. The system of claim 10, wherein the at least one processor is further configured to determine, based at least in part on the first semantic information, a threshold for a length of non-speech sufficient to determine the first ending.

12. A computer-implemented method comprising:
- receiving audio input data including speech;
  
  performing automatic speech recognition (ASR) processing on the audio input data to obtain a speech recognition output;
  
  determining, based at least in part on the speech recognition output, a likelihood that the speech recognition output includes a complete user command;
  
  determining, based at least in part on the likelihood, a threshold length of non-speech that is processed before indicating an ending of the speech;
  
  determining an ending to the speech at a time after the threshold length of non-speech is detected in the audio input data; and
  
  after determining the ending, initiating an action to be executed on a first device based at least in part on the speech recognition output and the ending.
- View Dependent Claims (13, 14, 15)
- - 13. The computer-implemented method of claim 12, further comprising performing further processing on the speech recognition output, wherein outputting the audio output data is further based at least in part on the further processing.
  - 14. The computer-implemented method of claim 13, wherein the further processing comprises natural language processing.
  - 15. The computer-implemented method of claim 12, wherein the speech recognition output includes semantic tags and the likelihood is further determined based at least in part on the semantic tags.

16. A system, comprising:
- at least one processor coupled to a memory, the memory including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor;
  
  to receive audio input data including speech;
  
  to perform automatic speech recognition (ASR) processing on the audio input data to obtain a speech recognition output;
  
  to determine, based at least in part on the speech recognition output, a likelihood that the speech recognition output includes a complete user command;
  
  to determine, based at least in part on the likelihood, a threshold length of non-speech that is processed before indicating an ending of the speech;
  
  to determine an ending of an utterance in the audio input data after the threshold length of non-speech is detected in the audio input data; and
  
  to, after determining the ending, initiate an action to be executed on a first device based at least in part on the speech recognition output and the ending.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. The system of claim 16, wherein the speech recognition output includes semantic information and the at least one processor is configured to determine further based at least in part the semantic information.
  - 18. The system of claim 16, wherein the at least one processor is further configured to perform natural language processing on the speech following the determining of the ending.
  - 19. The system of claim 18, wherein the at least one processor is further configured to initiate the action based at least in part on the natural language processing.
  - 20. The system of claim 18, wherein the speech recognition output comprises at least one N-gram, the at least one N-gram comprising a score representing a likelihood that the at least one N-gram includes a complete user command.
  - 21. The system of claim 16, wherein the at least one processor is further configured:
    - to determine a first ending of the speech in the audio input data;
      
      to determine a first portion of the speech recognition output corresponding to the audio input data up to the first ending;
      
      to determine a second ending of the speech in the audio input data;
      
      to determine a second portion of the speech recognition output corresponding to the audio input data up to the second ending;
      
      to determine that the first portion is the same as the second portion; and
      
      to initiate the action based at least in part on the first portion.
  - 22. The system of claim 16, wherein the at least one processor is further configured:
    - to determine a first ending of the speech in the audio input data;
      
      to determine a first portion of the speech recognition output corresponding to the audio input data up to the first ending;
      
      to determine a second ending of the speech in the audio input data;
      
      to determine a second portion of the speech recognition output corresponding to the audio input data up to the second ending;
      
      to determine that the first portion is not the same as the second portion;
      
      to perform natural language processing on the second portion; and
      
      to initiate the action based at least in part on the natural language processing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Liu, Baiyang, Rosen, Alexander David, Secker-Walker, Hugh Evan
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Tzeng, Forrest F

Application Number

US13/921,671
Time in Patent Office

1,175 Days
Field of Search

704/251, 704/270, 704/275, 704/248, 704/253, 704/9, 704/E15.001
US Class Current

1/1
CPC Class Codes

G10L 15/00   Speech recognition G10L17/0...

G10L 15/05   Word boundary detection

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/19   Grammatical context, e.g. d...

G10L 15/22   Procedures used during a sp...

G10L 2015/223   Execution procedure of a sp...

G10L 25/78   Detection of presence or ab...

Enhanced endpoint detection for speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Enhanced endpoint detection for speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links