Enhanced endpoint detection for speech recognition
First Claim
1. A method for reducing latency in speech recognition, the method comprising:
- receiving audio input data representing an utterance;
performing automatic speech recognition (ASR) processing on the audio input data to generate ASR output;
determining a first ending to the utterance in the audio input data at a first time corresponding to non-speech detected in the audio input data;
determining a first portion of the ASR output, the first portion corresponding to the audio input data up to the first ending;
providing the first portion of the ASR output to a natural language understanding (NLU) module to obtain a first NLU result;
storing the first NLU result;
determining a second ending to the user'"'"'s speech in the audio input data at a second time after the first time;
determining a second portion of the ASR output, the second portion corresponding to the audio input data up to the second ending;
comparing the first portion to the second portion; and
;
(1) if the first portion is the same as the second portion, initiating a first action to be executed on a first device, the first action based on the first NLU result, and(2) if the first portion is not the same as the second portion;
discarding the first NLU result,providing the second ASR output to the NLU module to obtain a second NLU result, andinitiating a second action to be executed on the first device, the second action based on the second NLU result.
1 Assignment
0 Petitions
Accused Products
Abstract
Determining the end of an utterance for purposes of automatic speech recognition (ASR) may be improved with a system that provides early results and/or incorporates semantic tagging. Early ASR results of an incoming utterance may be prepared based at least in part on an estimated endpoint and processed by a natural language understanding (NLU) process while final results, based at least in part on a final endpoint, are determined. If the early results match the final results, the early NLU results are already prepared for early execution. The endpoint may also be determined based at least in part on the content of the utterance, as represented by semantic tagging output from ASR processing. If the tagging indicate completion of a logical statement, an endpoint may be declared, or a threshold for silent frames prior to declaring an endpoint may be adjusted.
-
Citations
22 Claims
-
1. A method for reducing latency in speech recognition, the method comprising:
-
receiving audio input data representing an utterance; performing automatic speech recognition (ASR) processing on the audio input data to generate ASR output; determining a first ending to the utterance in the audio input data at a first time corresponding to non-speech detected in the audio input data; determining a first portion of the ASR output, the first portion corresponding to the audio input data up to the first ending; providing the first portion of the ASR output to a natural language understanding (NLU) module to obtain a first NLU result; storing the first NLU result; determining a second ending to the user'"'"'s speech in the audio input data at a second time after the first time; determining a second portion of the ASR output, the second portion corresponding to the audio input data up to the second ending; comparing the first portion to the second portion; and
;(1) if the first portion is the same as the second portion, initiating a first action to be executed on a first device, the first action based on the first NLU result, and (2) if the first portion is not the same as the second portion; discarding the first NLU result, providing the second ASR output to the NLU module to obtain a second NLU result, and initiating a second action to be executed on the first device, the second action based on the second NLU result. - View Dependent Claims (2, 3, 4)
-
-
5. A system comprising:
at least one processor coupled to a memory, the memory including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor; to receive audio input data, including a speech utterance; to perform automatic speech recognition (ASR) processing on the audio input data to generate ASR output; to determine a first ending of the utterance in the audio input data; to generate a first speech recognition output including a first portion of the ASR output that corresponds to the audio input data from a beginning of the audio input data up to the first ending; to process the first speech recognition output to obtain a first speech processing result; and
;(1) if the audio input data after the first ending does not include speech, to initiate a first action to be executed on a first device based at least in part on the first speech processing result, and (2) if the audio input data after the first ending includes speech; to discard the first speech processing result, to determine a second ending of the utterance in the audio input data after the first ending, to generate a second speech recognition output including a second portion of the ASR output that corresponds to the audio input data from the beginning of the audio input data up to the second ending, to process the second speech recognition output to obtain a second speech processing result, and to initiate a second action that is different from the first action to be executed on the first device based at least in part on the second speech processing result. - View Dependent Claims (6, 7, 8, 9, 10, 11)
-
12. A computer-implemented method comprising:
-
receiving audio input data including speech; performing automatic speech recognition (ASR) processing on the audio input data to obtain a speech recognition output; determining, based at least in part on the speech recognition output, a likelihood that the speech recognition output includes a complete user command; determining, based at least in part on the likelihood, a threshold length of non-speech that is processed before indicating an ending of the speech; determining an ending to the speech at a time after the threshold length of non-speech is detected in the audio input data; and after determining the ending, initiating an action to be executed on a first device based at least in part on the speech recognition output and the ending. - View Dependent Claims (13, 14, 15)
-
-
16. A system, comprising:
at least one processor coupled to a memory, the memory including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor; to receive audio input data including speech; to perform automatic speech recognition (ASR) processing on the audio input data to obtain a speech recognition output; to determine, based at least in part on the speech recognition output, a likelihood that the speech recognition output includes a complete user command; to determine, based at least in part on the likelihood, a threshold length of non-speech that is processed before indicating an ending of the speech; to determine an ending of an utterance in the audio input data after the threshold length of non-speech is detected in the audio input data; and to, after determining the ending, initiate an action to be executed on a first device based at least in part on the speech recognition output and the ending. - View Dependent Claims (17, 18, 19, 20, 21, 22)
Specification