Method and apparatus for recognition-based barge-in detection in the context of subword-based automatic speech recognition
First Claim
Patent Images
1. A method comprising the steps of:
- a. determining if a speech utterance has started, if an utterance has not started then obtaining next frame and re-running step a, otherwise continuing to step b;
b. obtaining a speech frame of the speech utterance that represents a frame period that is next in time;
c. extracting features from the speech frame;
d. computing likelihood scores for all active sub-word models for the present frame of speech;
e. performing dynamic programming to build a speech recognition network of likely sub-word paths;
f. performing a beam search using the speech recognition network;
g. updating a decoding tree of the speech utterance after the beam search;
h. finding the best scoring sub-word path of said likely sub-word paths and determining a number of sub-words in said best scoring sub-word path;
i. determining if said best scoring sub-word path has a sub-word length greater than a minimum number of sub-words and if the best scoring path is greater proceeding to step j, otherwise returning to step b;
j. determining if recorded root is a sub-string of best path and if recorded root is not a sub-string of best path recording best path as recorded root and returning to step b, otherwise proceeding to step k;
k. determining if the recorded root has remained stable for a threshold number of additional sub-words and if said root of said best scoring path has not remained stable for the threshold number returning to step b otherwise proceeding to step 1;
l. declaring barge-in;
m. disabling any prompt that is playing; and
n. backtracking through the best scoring path to obtain a string having a greatest likelihood of corresponding to the utterance; and
outputting the string.
8 Assignments
0 Petitions
Accused Products
Abstract
Robust, multi-faceted sub-word method for rapidly and reliably detecting a barge-in condition of a speaker talking while an automated audio prompt is being played. This sub-word method allows for rapid stopping of the prompt to improve automatic speech recognition and reduce speaker confusion and/or frustration. An automatic speech recognition system (ASR) that practices such a method is also presented.
-
Citations
16 Claims
-
1. A method comprising the steps of:
-
a. determining if a speech utterance has started, if an utterance has not started then obtaining next frame and re-running step a, otherwise continuing to step b;
b. obtaining a speech frame of the speech utterance that represents a frame period that is next in time;
c. extracting features from the speech frame;
d. computing likelihood scores for all active sub-word models for the present frame of speech;
e. performing dynamic programming to build a speech recognition network of likely sub-word paths;
f. performing a beam search using the speech recognition network;
g. updating a decoding tree of the speech utterance after the beam search;
h. finding the best scoring sub-word path of said likely sub-word paths and determining a number of sub-words in said best scoring sub-word path;
i. determining if said best scoring sub-word path has a sub-word length greater than a minimum number of sub-words and if the best scoring path is greater proceeding to step j, otherwise returning to step b;
j. determining if recorded root is a sub-string of best path and if recorded root is not a sub-string of best path recording best path as recorded root and returning to step b, otherwise proceeding to step k;
k. determining if the recorded root has remained stable for a threshold number of additional sub-words and if said root of said best scoring path has not remained stable for the threshold number returning to step b otherwise proceeding to step 1;
l. declaring barge-in;
m. disabling any prompt that is playing; and
n. backtracking through the best scoring path to obtain a string having a greatest likelihood of corresponding to the utterance; and
outputting the string.- View Dependent Claims (2, 3, 4, 5)
in parallel with step i, determining if a number of sub-words in said best path exceeds a maximum number of sub-words, and if said maximum number has been exceeded proceeding to step l and if said maximum number has not been exceeded returning to step b.
-
-
4. The method of claim 3, further comprising the step of:
in parallel with step i, determining if a speech endpoint has been reached, if yes said speech endpoint has been reached then begin backtracking to obtain recognized string and declaring barge-in and proceeding to step m, and if no said speech endpoint has not been reached then proceeding to step b.
-
5. The method of claim 1, further comprising the step of:
in parallel with step i, determining if a speech endpoint has been reached, if yes said speech endpoint has been reached then begin backtracking to obtain recognized string and declaring barge-in and proceeding to step m, and if no said speech endpoint has not been reached then proceeding to step b.
-
6. A method for speech recognition using barge-in comprising the steps of:
-
a. determining if a speech utterance has started, if an utterance has not started then returning to the beginning of step a, otherwise continuing to step b;
b. getting a speech frame that represents a frame period that is next in time;
c. extracting features from the speech frame;
d. using the features extracted from the present speech frame to score sub-word models of a speech recognition grammar;
e. dynamically programming an active network of sub-word sequences using a Viterbi algorithm;
f. pruning unlikely sub-word sequences and extending likely sub-word sequences to update the active network;
g. updating a decoding tree to said likely sub-word sequences;
h. finding the best scoring sub-word path of said likely sub-word paths and determining a number of sub-words in said best scoring sub-word path;
i. determining if said best scoring sub-word path has a sub-word length greater than a minimum number of sub-words and if the best scoring path is greater proceeding to step j, otherwise returning to step b;
j. determining if recorded root is a sub-string of best path and if recorded root is not a sub-string of best path recording best path as recorded root and returning to step b, otherwise proceeding to step k;
k. determining if the recorded root has remained stable for a threshold number of additional sub-words and if said root of said best scoring path has not remained stable for the threshold number returning to step b otherwise proceeding to step l;
l. declaring barge-in;
m. disabling any prompt that is playing; and
n. outputting the string corresponding to said best scoring path. - View Dependent Claims (7, 8, 9, 10, 11, 12)
in parallel with step i, determining if a number of sub-words in said best path exceeds a maximum number of sub-words, and if said maximum number has been exceeded proceeding to step l and if said maximum number has not been exceeded returning to step b.
-
-
9. The method of claim 8, wherein step h further comprises:
-
examining all viable sub-word sequences contained in the decoding tree for the present speech frame;
traversing through pointers that are associated with sub-word sequences of the decoding tree; and
counting a number of sub-words in the best scoring sub-word sequence path.
-
-
10. The method of claim 9, wherein only pointers that are associated with sub-word sequences of the decoding tree that have speech content are traversed.
-
11. The method of claim 6, wherein step h further comprises:
-
examining all viable sub-word sequences contained in the decoding tree for the present speech frame;
traversing through pointers that are associated with sub-word sequences of the decoding tree; and
counting a number of sub-words in the best scoring sub-word sequence path.
-
-
12. The method of claim 11, wherein only pointers that are associated with sub-word sequences of the decoding tree that have speech content are traversed.
-
13. An apparatus for automatic speech recognition of a speech utterance to declare barge-in comprising:
-
means for determining if the speech utterance has started, means responsive to said speech utterance start determining means for obtaining a speech frame of the speech utterance that represents a frame period that is next in time;
means for extracting features from said speech frame;
means for performing dynamic programming to build a speech recognition network of likely sub-word paths;
means for performing a beam search using the speech recognition network;
means for updating a decoding tree of the speech utterance after the beam search;
means for finding the best scoring sub-word path of said likely sub-word paths and determining a number of sub-words in said best scoring sub-word path; and
means for determining if said best scoring sub-word path has a sub-word length greater than a minimum number of sub-words;
means responsive to a condition that the best scoring path is greater recording a root of a sub-word sequence corresponding to said best scoring path for determining if a count of times the recorded root has remained stable for a threshold number of additional sub-words;
means responsive to a condition of the root of said best scoring path has remained stable during at least the threshold number of additional phonemes and declaring barge-in and disabling any prompt that is playing when the recorded count exceeds the threshold number. - View Dependent Claims (14, 15, 16)
means for backtracking through the best scoring path to obtain a string having a greatest likelihood of corresponding to the utterance; and
outputting the string.
-
-
15. The apparatus of claim 14, wherein all said means comprise a system having a processor running a program stored in connected memory.
-
16. The apparatus of claim 13, wherein all said means comprise a system having a processor running a program stored in connected memory.
Specification