Direction-based speech endpointing
First Claim
1. A computer-implemented method for determining an utterance endpoint during automatic speech recognition (ASR) processing, the method comprising:
- receiving audio comprising speech;
determining audio data based on the audio;
determining a source direction corresponding to the audio data;
determining a duration associated with the audio data, wherein the duration indicates how long the audio has been continuously received from the source direction;
performing ASR processing on the audio data to determine;
a plurality of hypotheses, wherein each hypothesis of the plurality of hypotheses includes at least one word or a representation of at least one word potentially corresponding to the audio data, andfor each of the plurality of hypotheses, a respective probability that the respective hypothesis corresponds to an utterance represented in the audio data;
determining, for each of the plurality of hypotheses, a representation of a respective number of audio frames corresponding to non-speech immediately preceding a first point;
calculating, for each of the plurality of hypotheses, a respective weighted pause duration by multiplying the respective probability of a respective hypothesis by the respective number of audio frames of the respective hypothesis;
calculating a cumulative expected pause duration by summing the respective weighted pause durations for each of the plurality of hypotheses;
calculating an adjusted cumulative score using the cumulative expected pause duration; and
designating the first point as corresponding to a likely endpoint as a result of the adjusted cumulative score exceeding a first threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for determining an endpoint of an utterance during automatic speech recognition (ASR) processing that accounts for the direction and duration of the incoming speech. Beamformers of the ASR system may identify a source direction of the audio. The system may track the duration speech has been received from that source direction so that if speech is detected in another direction, the original source speech may be weighted differently for purposes of determining an endpoint of the utterance. Speech from a new direction may be discarded or treated like non-speech for purposes of determining an endpoint of speech from an original direction.
216 Citations
29 Claims
-
1. A computer-implemented method for determining an utterance endpoint during automatic speech recognition (ASR) processing, the method comprising:
-
receiving audio comprising speech; determining audio data based on the audio; determining a source direction corresponding to the audio data; determining a duration associated with the audio data, wherein the duration indicates how long the audio has been continuously received from the source direction; performing ASR processing on the audio data to determine; a plurality of hypotheses, wherein each hypothesis of the plurality of hypotheses includes at least one word or a representation of at least one word potentially corresponding to the audio data, and for each of the plurality of hypotheses, a respective probability that the respective hypothesis corresponds to an utterance represented in the audio data; determining, for each of the plurality of hypotheses, a representation of a respective number of audio frames corresponding to non-speech immediately preceding a first point; calculating, for each of the plurality of hypotheses, a respective weighted pause duration by multiplying the respective probability of a respective hypothesis by the respective number of audio frames of the respective hypothesis; calculating a cumulative expected pause duration by summing the respective weighted pause durations for each of the plurality of hypotheses; calculating an adjusted cumulative score using the cumulative expected pause duration; and designating the first point as corresponding to a likely endpoint as a result of the adjusted cumulative score exceeding a first threshold. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method comprising:
-
determining that received audio data corresponding to at least one utterance includes first audio data, wherein the first audio data corresponds to a first source direction; performing automatic speech recognition processing on the first audio data to determine a first hypothesis including one or more of at least one first word or a representation of at least one first word potentially corresponding to the first audio data; determining that a first portion of the first audio data corresponds to speech; determining a first value representing a first time duration of the first portion of the first audio data; determining a first duration weight factor based at least in part on the first value; determining, in the first hypothesis, a representation of first non-speech, the first non-speech following the first portion of the first audio data; determining a second value representing a second time duration of the first non-speech; determining a first pause duration value by using the first duration weight factor to adjust the second value; and determining an endpoint based at least in part on the first pause duration value. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computing system comprising:
-
at least one processor; and a computer-readable medium encoded with instructions operable to be executed by the at least one processor to cause the computing system to perform a set of actions comprising; determining that received audio data corresponding to at least one utterance includes first audio data, wherein the first audio data corresponds to a first source direction; performing automatic speech recognition processing on the first audio data to determine a first hypothesis including one or more of at least one first word or a representation of at least one first word potentially corresponding to the first audio data; determining that a first portion of the first audio data corresponds to speech; determining a first value representing a first time duration of the first portion of the first audio data; determining a first duration weight factor based at least in part on the first value; determining, in the first hypothesis, a representation of first non-speech, the first non-speech following the first portion of the first audio data; determining a second value representing a second time duration of the first non-speech; determining a first pause duration value by using the first duration weight factor to adjust the second value; and determining an endpoint based at least in part on the first pause duration value. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
-
Specification