System and method for multimodal utterance detection

US 9,922,640 B2
Filed: 02/03/2014
Issued: 03/20/2018
Est. Priority Date: 10/17/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented speech utterance detection method, the method comprising:

a) receiving a continuous stream of input;

b) generating at least one feature from the continuous stream of input;

c) generating a plurality of continuous stream segments using the at least one feature;

d) processing the continuous stream segments using characteristics of the continuous stream of input to create a plurality of candidate segments;

e) receiving a discrete set of inputs, wherein the discrete set of inputs comprise at least one multimodal input from a user interface, wherein each input of the discrete set is associated with a discrete timing information;

f) separating the candidate segments into at least one separated output, wherein separating comprises grouping a portion of the candidate segments into the at least one separated output by correlating at least one stream timing information associated with the candidate segments with at least one discrete timing information; and

g) outputting the separated output from the continuous stream of input, wherein the separated output represents a detected speech utterance, thereby selecting the detected speech utterance based on discrete timing information that is non-speech data.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The disclosure describe a system and method for detecting one or more segments of desired speech utterances from an audio stream using timings of events from other modes that are correlated to the timings of the desired segments of speech. The redundant information from other modes results in a highly accurate and robust utterance detection.

Citations

23 Claims

1. A computer-implemented speech utterance detection method, the method comprising:
- a) receiving a continuous stream of input;
  
  b) generating at least one feature from the continuous stream of input;
  
  c) generating a plurality of continuous stream segments using the at least one feature;
  
  d) processing the continuous stream segments using characteristics of the continuous stream of input to create a plurality of candidate segments;
  
  e) receiving a discrete set of inputs, wherein the discrete set of inputs comprise at least one multimodal input from a user interface, wherein each input of the discrete set is associated with a discrete timing information;
  
  f) separating the candidate segments into at least one separated output, wherein separating comprises grouping a portion of the candidate segments into the at least one separated output by correlating at least one stream timing information associated with the candidate segments with at least one discrete timing information; and
  
  g) outputting the separated output from the continuous stream of input, wherein the separated output represents a detected speech utterance, thereby selecting the detected speech utterance based on discrete timing information that is non-speech data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The computer-implemented method of claim 1, wherein the continuous stream of input includes an audio stream.
  - 3. The computer-implemented method of claim 1, wherein the continuous stream of input includes a brain wave recording.
  - 4. The computer-implemented method of claim 1, wherein the continuous stream of input includes a recording by an eye-tracking system of at least one object viewed by a user'"'"'s eye.
  - 5. The computer-implemented method of claim 1, wherein one input of the discrete set of inputs comprises a touch event by a user.
  - 6. The computer-implemented method of claim 1, wherein one input of the discrete set of inputs comprises a gesture by a user.
  - 7. The computer-implemented method of claim 1, wherein one input of the discrete set of inputs comprises one of an eye track, a voice command, and brain wave detection via the user interface.
  - 8. The computer-implemented method of claim 1, wherein correlating the at least one stream timing information with the at least one discrete timing information comprises synchronizing the stream timing information and the discrete timing information.
  - 9. The computer-implemented method of claim 1, wherein correlating the at least one stream timing information with the at least one discrete timing information comprises determining a closest match between the stream timing information and the discrete timing information.
  - 10. The computer-implemented method of claim 1, wherein outputting the separated output from the continuous stream of input comprises outputting a speech utterance representative of the detected speech utterance.
  - 11. The computer-implemented method of claim 1, wherein outputting the separated output from the continuous stream of input comprises outputting a string of text representative of the detected speech utterance.
  - 12. The computer-implemented method of claim 1, wherein the selected speech segment is input for speech recognition.
  - 13. The computer-implemented method of claim 1, wherein generating the at least one feature from the continuous stream of input includes at least one of the following:
    - 1) signal processing to generate acoustic-phonetic or prosodic features,
      
      2) speech recognition processing to generate acoustic-linguistic or textual features.
  - 14. The computer-implemented method of claim 1, wherein generating continuous stream segments includes associating timing markers with the continuous stream segments, wherein the timing markers include start and stop times for each continuous stream segment.
  - 15. The computer-implemented method of claim 1, wherein processing the continuous stream segments using characteristics of the continuous stream of input to create candidate segments includes discarding portions of the continuous stream segments based on at least one of the following characteristics:
    - 1) discarded portion includes a short impulse of noise;
      
           2) discarded portion includes low energy;
      
           3) discarded portion lacks a desired format;
      
           4) discarded portion lacks a desired pitch;
      
      or
      
           5) discarded portion lacks a prosodic structure.

16. A system comprising:
- a memory for storing computer-readable instructions associated with speech detection; and
  
  a processor programmed to execute the computer-readable instructions to enable the operation of the speech detection, wherein when the computer-readable instructions are executed, the speech detection is programmed to;
  
  a) receive a continuous stream of input;
  
  b) generate at least one feature from the continuous stream of input;
  
  c) generate a plurality of continuous stream segments using the at least one feature;
  
  d) process the continuous stream segments using characteristics of the continuous stream to create a plurality of candidate segments;
  
  e) receive a discrete set of inputs, wherein the discrete set of inputs comprise at least one multimodal input from a user interface, wherein each input of the discrete set is associated with a discrete timing information;
  
  f) separate the candidate segments into at least one separated output, wherein the speech detection is programmed to separate the candidate segments into separated output by correlating at least one stream timing information associated with the candidate segments with at least one discrete timing information; and
  
  g) output the separated output, wherein the separated output represents a detected speech utterance, thereby selecting the detected speech utterance based on discrete timing information that is non-speech data.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
- - 17. The system of claim 16, wherein the continuous stream of input includes an audio stream.
  - 18. The system of claim 16, wherein the continuous stream of input includes a recording by an eye-tracking system of at least one object viewed by a user'"'"'s eye.
  - 19. The system of claim 16, wherein one input of the discrete set of inputs comprise a touch event by a user.
  - 20. The system of claim 16, wherein one input of the discrete set of inputs comprise a gesture by a user.
  - 21. The system of claim 16, wherein generate at least one feature from the continuous stream of input includes at least one of the following:
    - 1) signal processing to generate acoustic-phonetic or prosodic features,
      
      2) speech recognition processing to generate acoustic-linguistic or textual features.
  - 22. The system of claim 16, wherein generate continuous stream segments includes associating timing markers with the continuous stream segments, wherein the timing markers include start and stop times for each continuous stream segment.
  - 23. The system of claim 16, wherein processing the continuous stream segments using characteristics of the continuous stream of input to create a plurality of candidate segments includes discarding portions of the continuous stream segments based on at least one of the following characteristics:
    - 1) discarded portion includes a short impulse of noise;
      
           2) discarded portion includes low energy;
      
           3) discarded portion lacks a desired format;
      
           4) discarded portion lacks a desired pitch;
      
      or
      
           5) discarded portion lacks a prosodic structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ashwin P. Rao
Original Assignee
Ashwin P. Rao
Inventors
Rao, Ashwin P
Primary Examiner(s)
AZAD, ABUL K

Application Number

US14/171,735
Publication Number

US 20140222430A1
Time in Patent Office

1,506 Days
Field of Search

704231-257, 704270
US Class Current
CPC Class Codes

G10L 15/04 Segmentation; Word boundary...

G10L 2015/088 Word spotting

System and method for multimodal utterance detection

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for multimodal utterance detection

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links