Language model speech endpointing
First Claim
Patent Images
1. A computer-implemented method for determining an endpoint during automatic speech recognition (ASR) processing, the method comprising:
- receiving audio data representing speech detected using a microphone of a mobile device;
performing ASR processing on the audio data to determine a plurality of hypotheses;
determining, for each of the plurality of hypotheses, a respective probability that the respective hypothesis corresponds to the audio data;
determining, for each of the plurality of hypotheses, a respective number of non-speech audio frames immediately preceding a first point in the audio data;
determining, for each of the plurality of hypotheses, a respective score by multiplying the probability of the respective hypothesis by a factor corresponding to the number of non-speech audio frames of the respective hypothesis;
determining a cumulative score by summing the respective scores for each of the plurality of hypotheses;
determining that the cumulative score exceeds a first threshold; and
designating the first point as corresponding to a likely endpoint as a result of the cumulative score exceeding the first threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
An automatic speech recognition (ASR) system detects an endpoint of an utterance using the active hypotheses under consideration by a decoder. The ASR system calculates the amount of non-speech detected by a plurality of hypotheses and weights the non-speech duration by the probability of each hypotheses. When the aggregate weighted non-speech exceeds a threshold, an endpoint may be declared.
87 Citations
22 Claims
-
1. A computer-implemented method for determining an endpoint during automatic speech recognition (ASR) processing, the method comprising:
-
receiving audio data representing speech detected using a microphone of a mobile device; performing ASR processing on the audio data to determine a plurality of hypotheses; determining, for each of the plurality of hypotheses, a respective probability that the respective hypothesis corresponds to the audio data; determining, for each of the plurality of hypotheses, a respective number of non-speech audio frames immediately preceding a first point in the audio data; determining, for each of the plurality of hypotheses, a respective score by multiplying the probability of the respective hypothesis by a factor corresponding to the number of non-speech audio frames of the respective hypothesis; determining a cumulative score by summing the respective scores for each of the plurality of hypotheses; determining that the cumulative score exceeds a first threshold; and designating the first point as corresponding to a likely endpoint as a result of the cumulative score exceeding the first threshold. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method, comprising:
-
receiving audio data; performing speech recognition processing on the audio data to determine a plurality of hypotheses, each hypothesis of the plurality comprising at least one of a representation of a respective subword unit or a representation of a respective word, wherein the plurality of hypotheses includes; a first hypothesis comprising a first representation of first non-speech preceding a first point in the audio data, and a second hypothesis comprising a second representation of second non-speech preceding the first point in the audio data; determining a first probability corresponding to the first hypothesis; determining a second probability corresponding to the second hypothesis; determining a first weighted duration value by using the first probability to adjust a first value representing a first time duration of the first non-speech; determining a second weighted duration value by using the second probability to adjust a second value representing a second time duration of the second non-speech; combining at least the first weighted duration value and the second weighted duration value to determine a third value representing an expected non-speech time duration preceding the first point in the audio data; determining that the third value exceeds a threshold duration value; and determining, based at least in part on determining that the third value exceeds the threshold duration value, that an endpoint of speech occurs at the first point in the audio data. - View Dependent Claims (6, 7, 8, 9, 10, 17, 18, 19)
-
-
11. A computing system, comprising:
-
at least one processor; a memory including instructions operable to be executed by the at least one processor to cause the computing system to perform a set of actions comprising; receiving audio data; performing speech recognition processing on the audio data to determine a plurality of hypotheses, each hypothesis of the plurality comprising at least one of a representation of a respective subword unit or a representation of a respective word, wherein the plurality of hypotheses includes; a first hypothesis comprising a first representation of first non-speech preceding a first point in the audio data, and a second hypothesis comprising a second representation of second non-speech preceding the first point in the audio data; determining a first probability corresponding to the first hypothesis; determining a second probability corresponding to the second hypothesis; determining a first weighted duration value by using the first probability to adjust a first value representing a first time duration of the first non-speech; determining a second weighted duration value by using the second probability to adjust a second value representing a second time duration of the second non-speech; combining at least the first weighted duration value and the second weighted duration value to determine a third value representing an expected non-speech time duration preceding the first point in the audio data; determining that the third value exceeds a threshold duration value; and determining, based at least in part on determining that the third value exceeds the threshold duration value, that an endpoint of speech occurs at the first point in the audio data. - View Dependent Claims (12, 13, 14, 15, 16, 20, 21, 22)
-
Specification