METHODS AND APPARATUS FOR REDUCING LATENCY IN SPEECH RECOGNITION APPLICATIONS

US 20160351196A1
Filed: 05/26/2015
Published: 12/01/2016
Est. Priority Date: 05/26/2015
Status: Active Grant

First Claim

Patent Images

1. A computing device including a speech-enabled application installed thereon, the computing device comprising:

an input interface configured to receive first audio comprising speech from a user of the computing device;

an automatic speech recognition (ASR) engine configured to;

detect based, at least in part, on a threshold time for endpointing, an end of speech in the first audio; and

generate a first ASR result based, at least in part, on a portion of the first audio prior to the detected end of speech; and

at least one processor programmed to;

determine whether a valid action can be performed by the speech-enabled application using the first ASR result; and

instruct the ASR engine to process second audio when it is determined that a valid action cannot be performed by the speech-enabled application using the first ASR result.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus for reducing latency in speech recognition applications. The method comprises receive first audio comprising speech from a user of a computing device, detecting an end of speech in the first audio, generating an ASR result based, at least in part, on a portion of the first audio prior to the detected end of speech, determining whether a valid action can be performed by a speech-enabled application installed on the computing device using the ASR result, and processing second audio when it is determined that a valid action cannot be performed by the speech-enabled application using the ASR result.

Citations

20 Claims

1. A computing device including a speech-enabled application installed thereon, the computing device comprising:
- an input interface configured to receive first audio comprising speech from a user of the computing device;
  
  an automatic speech recognition (ASR) engine configured to;
  
  detect based, at least in part, on a threshold time for endpointing, an end of speech in the first audio; and
  
  generate a first ASR result based, at least in part, on a portion of the first audio prior to the detected end of speech; and
  
  at least one processor programmed to;
  
  determine whether a valid action can be performed by the speech-enabled application using the first ASR result; and
  
  instruct the ASR engine to process second audio when it is determined that a valid action cannot be performed by the speech-enabled application using the first ASR result.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The computing device of claim 1, wherein determining whether a valid action can be performed by the speech-enabled application using the first ASR result is based, at least in part, on a natural language understanding (NLU) result generated using the first ASR result.
  - 3. The computing device of claim 2, wherein the processor is further programmed to submit the NLU result to the speech-enabled application, and wherein determining whether a valid action can be performed by the speech-enabled application using the first ASR result comprises receiving an indication from the speech-enabled application that a valid action cannot be performed in response to submitting the NLU result to the speech-enabled application.
  - 4. The computing device of claim 1, wherein the input interface is further configured to receive the second audio, wherein the second audio includes audio recorded after the detected end of speech in the first audio, and wherein the ASR engine is further configured to process the second audio.
  - 5. The computing device of claim 4, wherein processing the second audio comprises:
    - determining whether the second audio includes speech; and
      
      generating a second ASR result based, at least in part, on at least a portion of the second audio in response to determining that the second audio comprises speech.
  - 6. The computing device of claim 5, wherein generating the second ASR result comprises generating the second ASR result based, at least in part, on at least a portion of the first audio and at least a portion of the second audio.
  - 7. The computing device of claim 5, wherein the at least one processor is further programmed to:
    - determine whether a valid action can be performed by the speech-enabled application using a natural language understanding (NLU) result generated based, at least in part, on at least a portion of the first ASR result and at least a portion of the second ASR result; and
      
      instruct the speech-enabled application to perform the valid action in response to determining that the valid action can be performed using the NLU result.
  - 8. The computing device of claim 5, further comprising:
    - at least one storage device configured to store one or more prefixes, each of which is associated with a corresponding threshold time for endpointing; and
      
      wherein determining whether a valid action can be performed by the speech-enabled application comprises determining whether the speech in the first audio includes a prefix of the one or more prefixes stored on the at least one storage device.
  - 9. The computing device of claim 8, wherein the ASR engine is further configured to:
    - process a plurality of time segments of the first audio prior to detecting the end of speech in the first audio, and wherein determining whether the speech in the first audio includes a prefix stored on the at least one storage device comprises comparing output of the ASR engine determined based on the processed plurality of time segments to the one or more prefixes stored on the at least one storage device.
  - 10. The computing device of claim 8, wherein the at least one processor is further programmed to:
    - update the threshold time used by the ASR engine for endpointing in response to determining that the speech in the first audio includes a prefix stored on the at least one storage device, wherein updating the threshold time comprises instructing the ASR engine to use the threshold time for endpointing associated with the prefix stored on the at least one storage device identified in the speech in the first audio to detect an end of speech in the first audio.
  - 11. The computing device of claim 1, wherein the at least one processor is further programmed to:
    - create a first hint based, at least in part, on the first ASR result, wherein the first hint prompts the user for speech input corresponding to a valid action that can be performed by the speech-enabled application; and
      
      present the first hint via a user interface of the computing device.
  - 12. The computing device of claim 11, wherein the input interface is further configured to receive the second audio, wherein the ASR engine is further configured to process the second audio to generate a second ASR result, and wherein the at least one processor is further programmed to:
    - create a second hint based, at least in part, on the first ASR result and/or the second ASR result, wherein the second hint prompts the user for speech input corresponding to a valid action that can be performed by the speech-enabled application; and
      
      present the second hint via a user interface of the computing device.
  - 13. The computing device of claim 11, wherein presenting the first hint via the user interface comprises visually displaying the first hint on the user interface, and wherein the first hint hints of additional information to supplement the first audio to perform the valid action.
  - 14. The computing device of claim 11, wherein the input interface is further configured to receive the second audio, wherein the ASR engine is further configured to process the second audio, wherein processing the second audio comprises performing ASR processing on the second audio based, at least in part, on information included in the first hint.
  - 15. The computing device of claim 1, further comprising:
    - at least one storage device configured to store at least one data structure including information describing a plurality of natural language understanding (NLU) results and corresponding ASR output used to generate the plurality of NLU results;
      
      wherein the at least one processor is further programmed to;
      
      determine whether to add the first ASR result and a corresponding NLU result generated using the first ASR result to the at least one data structure stored on the at least one storage device; and
      
      add the first ASR result and the corresponding NLU result generated using the first ASR result to the at least one data structure stored on the at least one storage device in response to determining that the first ASR result and the corresponding NLU result should be added.
  - 16. The computing device of claim 15, wherein determining whether to add the first ASR result and the corresponding NLU result generated using the first ASR result to the at least one data structure comprises:
    - determining a number of times the corresponding NLU result has been received by the computing device from an NLU engine remotely located from the computing device; and
      
      determining that the first ASR result and the corresponding NLU result should be added to the at least one data structure when the number of times the corresponding NLU result has been received by the computing device exceeds a threshold value.
  - 17. The computing device of claim 15, wherein the input interface is further configured to receive third audio including speech from the user of the computing device, wherein the ASR engine is further configured to generate a second ASR result based, at least in part, on at least a portion of the third audio, and wherein the processor is further programmed to:
    - identify an ASR output stored in the at least one data structure corresponding to the second ASR result; and
      
      submit the NLU result corresponding to the identified ASR output stored in the at least one data structure to the speech-enabled application to enable the speech-enabled application to perform an action based on the submitted NLU result.
  - 18. The computing device of claim 17, wherein the at least one processor is programmed to submit the NLU result corresponding to the identified ASR output stored in the at least one data structure to the at least one data structure without sending a request for remote NLU processing of the third audio to an NLU engine remotely located from the computing device.

19. A method, comprising:
- receiving, by an input interface of a computing device, first audio comprising speech from a user of the computing device;
  
  detecting, by an automatic speech recognition (ASR) engine of the computing device, an end of speech in the first audio;
  
  generating, by the ASR engine, an ASR result based, at least in part, on a portion of the first audio prior to the detected end of speech;
  
  determining whether a valid action can be performed by a speech-enabled application installed on the computing device using the ASR result; and
  
  instructing the ASR engine to process second audio when it is determined that a valid action cannot be performed by the speech-enabled application using the ASR result.

20. A computer-readable storage medium encoded with a plurality of instructions that, when executed by a computing device, performs a method, the method comprising:
- receiving first audio comprising speech from a user of the computing device;
  
  detecting an end of speech in the first audio;
  
  generating an ASR result based, at least in part, on a portion of the first audio prior to the detected end of speech;
  
  determining whether a valid action can be performed by a speech-enabled application installed on the computing device using the ASR result; and
  
  processing second audio when it is determined that a valid action cannot be performed by the speech-enabled application using the ASR result.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Fanty, Mark

Granted Patent

US 9,666,192 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 3/167   Audio in a user interface, ...

G10L 15/01   Assessment or evaluation of...

G10L 15/04   Segmentation; Word boundary...

G10L 15/05   Word boundary detection

G10L 15/08   Speech classification or se...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/183   using context dependencies,...

G10L 15/22   Procedures used during a sp...

G10L 15/30   Distributed recognition, e....

G10L 17/00   Speaker identification or v...

G10L 2015/223   Execution procedure of a sp...

G10L 2015/225   Feedback of the input speech

G10L 2015/228   of application context

G10L 25/87   Detection of discrete point...

METHODS AND APPARATUS FOR REDUCING LATENCY IN SPEECH RECOGNITION APPLICATIONS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

METHODS AND APPARATUS FOR REDUCING LATENCY IN SPEECH RECOGNITION APPLICATIONS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links