System and method for multimodal utterance detection
First Claim
Patent Images
1. A computer-implemented speech utterance detection method, the method comprising:
- a) receiving a continuous stream of input;
b) generating at least one feature from the continuous stream of input;
c) generating a plurality of continuous stream segments using the at least one feature;
d) processing the continuous stream segments using characteristics of the continuous stream of input to create a plurality of candidate segments;
e) receiving a discrete set of inputs, wherein the discrete set of inputs comprise at least one multimodal input from a user interface, wherein each input of the discrete set is associated with a discrete timing information;
f) separating the candidate segments into at least one separated output, wherein separating comprises grouping a portion of the candidate segments into the at least one separated output by correlating at least one stream timing information associated with the candidate segments with at least one discrete timing information; and
g) outputting the separated output from the continuous stream of input, wherein the separated output represents a detected speech utterance, thereby selecting the detected speech utterance based on discrete timing information that is non-speech data.
0 Assignments
0 Petitions
Accused Products
Abstract
The disclosure describe a system and method for detecting one or more segments of desired speech utterances from an audio stream using timings of events from other modes that are correlated to the timings of the desired segments of speech. The redundant information from other modes results in a highly accurate and robust utterance detection.
-
Citations
23 Claims
-
1. A computer-implemented speech utterance detection method, the method comprising:
-
a) receiving a continuous stream of input; b) generating at least one feature from the continuous stream of input; c) generating a plurality of continuous stream segments using the at least one feature; d) processing the continuous stream segments using characteristics of the continuous stream of input to create a plurality of candidate segments; e) receiving a discrete set of inputs, wherein the discrete set of inputs comprise at least one multimodal input from a user interface, wherein each input of the discrete set is associated with a discrete timing information; f) separating the candidate segments into at least one separated output, wherein separating comprises grouping a portion of the candidate segments into the at least one separated output by correlating at least one stream timing information associated with the candidate segments with at least one discrete timing information; and g) outputting the separated output from the continuous stream of input, wherein the separated output represents a detected speech utterance, thereby selecting the detected speech utterance based on discrete timing information that is non-speech data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system comprising:
-
a memory for storing computer-readable instructions associated with speech detection; and a processor programmed to execute the computer-readable instructions to enable the operation of the speech detection, wherein when the computer-readable instructions are executed, the speech detection is programmed to; a) receive a continuous stream of input; b) generate at least one feature from the continuous stream of input; c) generate a plurality of continuous stream segments using the at least one feature; d) process the continuous stream segments using characteristics of the continuous stream to create a plurality of candidate segments; e) receive a discrete set of inputs, wherein the discrete set of inputs comprise at least one multimodal input from a user interface, wherein each input of the discrete set is associated with a discrete timing information; f) separate the candidate segments into at least one separated output, wherein the speech detection is programmed to separate the candidate segments into separated output by correlating at least one stream timing information associated with the candidate segments with at least one discrete timing information; and g) output the separated output, wherein the separated output represents a detected speech utterance, thereby selecting the detected speech utterance based on discrete timing information that is non-speech data. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
-
Specification