Reducing speech recognition latency

US 9,514,747 B1
Filed: 08/28/2013
Issued: 12/06/2016
Est. Priority Date: 08/28/2013
Status: Active Grant

First Claim

Patent Images

1. A method for dynamically adjusting speech recognition processing to reduce latency, the method comprising:

receiving a first portion of an audio input corresponding to an utterance;

identifying a time stamp associated with the first portion;

performing speech recognition processing on the first portion using a first graph pruning factor;

identifying a current time of processing of the first portion;

determining a current latency of the utterance by comparing the time stamp to the current time;

determining a property of a second portion of the audio input prior to performing speech recognition processing on the second portion, the property comprising an estimated difficulty of speech recognition processing, the estimated difficulty based on a percentage of the second portion of the audio input that has a signal to noise ratio below a threshold;

determining an estimated latency based at least in part on the property of the second portion and the current latency;

comparing the estimated latency to a target latency;

determining a second graph pruning factor based at least in part on the comparing;

performing additional speech recognition processing on the second portion using the second graph pruning factor; and

outputting speech processing results.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In an automatic speech recognition (ASR) processing system, ASR processing may be configured to reduce a latency of returning speech results to a user. The latency may be determined by comparing a time stamp of an utterance in process to a current time. Latency may also be estimated based on an endpoint of the utterance or other considerations such as how difficult the utterance may be to process. To improve latency the ASR system may be configured to adjust various processing parameters, such as graph pruning factors, path weights, ASR models, etc. Latency checks and corrections may occur dynamically for a particular utterance while it is being processed, thus allowing the ASR system to adjust to rapidly changing latency conditions.

Citations

23 Claims

1. A method for dynamically adjusting speech recognition processing to reduce latency, the method comprising:
- receiving a first portion of an audio input corresponding to an utterance;
  
  identifying a time stamp associated with the first portion;
  
  performing speech recognition processing on the first portion using a first graph pruning factor;
  
  identifying a current time of processing of the first portion;
  
  determining a current latency of the utterance by comparing the time stamp to the current time;
  
  determining a property of a second portion of the audio input prior to performing speech recognition processing on the second portion, the property comprising an estimated difficulty of speech recognition processing, the estimated difficulty based on a percentage of the second portion of the audio input that has a signal to noise ratio below a threshold;
  
  determining an estimated latency based at least in part on the property of the second portion and the current latency;
  
  comparing the estimated latency to a target latency;
  
  determining a second graph pruning factor based at least in part on the comparing;
  
  performing additional speech recognition processing on the second portion using the second graph pruning factor; and
  
  outputting speech processing results.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein the first graph pruning factor comprises at least one of a maximum number of paths of a graph to process or a threshold score for selecting paths of a graph to process.
  - 3. The method of claim 1, wherein determining the estimated latency of the utterance is further based at least in part on an endpoint location.

4. A computing device, comprising:
- at least one processor;
  
  a memory device including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor;
  
  to receive a first portion of audio data;
  
  to perform, beginning at a first time and with a first frame, speech processing on the first portion using a first value of a speech processing parameter;
  
  to determine, at a second time, a current location, in the first portion of the audio data, of data being processed during the speech processing;
  
  to determine a second frame at the current location;
  
  to determine a first number of frames between the first frame and the second frame;
  
  to determine a first processing rate based at least in part on the first number of frames, the first time and the second time;
  
  to estimate, based on the current location and the first processing rate, a speech processing latency corresponding to processing of a second portion of the audio data;
  
  to set the speech processing parameter to a second value based at least in part on the speech processing latency; and
  
  to perform speech processing on the second portion of the audio data using the second value of the speech processing parameter.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 5. The computing device of claim 4, wherein the at least one processor is further configured:
    - to estimate a second estimated speech processing latency based on the speech processing on the second portion;
      
      to perform second speech processing on the first portion using a third value of the speech processing parameter; and
      
      to select speech recognition results from the speech processing or the second speech processing based at least in part on the second estimated speech processing latency of the speech processing or an estimated speech processing latency of the second speech processing.
  - 6. The computing device of claim 4, wherein the at least one processor is further configured:
    - to identify a timestamp of processing associated with a beginning of the first portion, the timestamp corresponding to the first time;
      
      to determine a current time of processing associated with the current location, the current time corresponding to the second time;
      
      to determine a difference between the timestamp and the current time; and
      
      to determine the first processing rate based at least in part on the first number of frames and the difference.
  - 7. The computing device of claim 4, wherein the at least one processor is further configured:
    - to determine an endpoint of the audio; and
      
      to estimate the speech processing latency based at least in part on the current location, the endpoint, and the first processing rate.
  - 8. The computing device of claim 4, wherein the at least one processor is further configured to determine a property of the second portion of the audio data prior to performing speech processing on the second portion, and wherein the at least one processor is configured to estimate the speech processing latency based at least in part on the property.
  - 9. The computing device of claim 8, wherein the property of the second portion comprises at least one of a level of noise, or an estimated difficulty of speech recognition processing.
  - 10. The computing device of claim 4, wherein the at least one processor is configured to estimate the speech processing latency based at least in part on a load of the computing device.
  - 11. The computing device of claim 4, wherein the speech processing parameter comprises at least one of a graph pruning factor, a weight of a graph path, a grammar size, a numerical-precision parameter, a Gaussian mixture-component-count parameter, a frame-rate parameter, a score-caching parameter, an intent-difficulty parameter, a user-class parameter, an audio-quality parameter, a server-load parameter or a number of features of an audio frame to be processed.
  - 12. The computing device of claim 4, wherein the at least one processor is configured to set the speech processing parameter based at least in part on a speaker of the audio data.
  - 13. The computing device of claim 4, wherein the at least one processor is further configured:
    - to determine that the first processing rate is below a threshold; and
      
      to set the speech processing parameter to the second value because the first processed rate is below the threshold.
  - 14. The computing device of claim 4, wherein the at least one processor is further configured:
    - to perform speech processing on the first portion of the audio data from the current location using the second value of the speech processing parameter; and
      
      to perform speech processing on a third portion of the audio data using the first value of the speech processing parameter.

15. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising:
- program code to receive a first portion of audio data;
  
  program code to perform speech processing on the first portion using a first value of a speech processing parameter;
  
  program code to determine a property of a second portion of the audio data, the property comprising an estimated difficulty of speech recognition processing, the estimated difficulty based on a percentage of the second portion of the audio data that has a signal to noise ratio below a threshold;
  
  program code to estimate, based at least in part on the property, a speech processing latency corresponding to processing of the audio data;
  
  program code to set the speech processing parameter to a second value based at least in part on the speech processing latency; and
  
  program code to perform speech processing on a second portion of the audio data using the second value of the speech processing parameter.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
- - 16. The non-transitory computer-readable storage medium of claim 15, further comprising:
    - program code to estimate a second estimated speech processing latency based on the speech processing on the second portion;
      
      program code to perform second speech processing on the first portion using a third value of the speech processing parameter; and
      
      program code to select speech recognition results from the speech processing or the second speech processing based at least in part on the second estimated speech processing latency of the speech processing or an estimated speech processing latency of the second speech processing.
  - 17. The non-transitory computer-readable storage medium of claim 15, further comprising program code to identify a timestamp of the first portion and to determine a current time of processing, and wherein the program code is further configured to estimate the speech processing latency based at least in part on the timestamp, and the current time.
  - 18. The non-transitory computer-readable storage medium of claim 17, further comprising program code to determine an endpoint of the audio data, and wherein the program code is further configured to estimate the speech processing latency based at least in part on the endpoint.
  - 19. The non-transitory computer-readable storage medium of claim 15, wherein the program code is further configured to estimate the speech processing latency based at least in part on a load of the computing device.
  - 20. The non-transitory computer-readable storage medium of claim 15, wherein the speech processing parameter comprises at least one of a graph pruning factor, a weight of a graph path, a grammar size, a numerical-precision parameter, a Gaussian mixture-component-count parameter, a frame-rate parameter, a score-caching parameter, an intent-difficulty parameter, a user-class parameter, an audio-quality parameter, a server-load parameter or a number of features of an audio frame to be processed.
  - 21. The non-transitory computer-readable storage medium of claim 15, wherein the program code is further configured to set the speech processing parameter based at least in part on a speaker of the audio data.
  - 22. The non-transitory computer-readable storage medium of claim 15, further comprising:
    - program code to perform speech processing on the first portion of the audio data, from a current location of data being processed during speech processing, using the second value of the speech processing parameter; and
      
      program code to perform speech processing on a third portion of the audio data using the first value of the speech processing parameter.
  - 23. The non-transitory computer-readable storage medium of claim 15, further comprising:
    - program code to determine a first number of frames in the second portion of the audio data;
      
      program code to determine a first signal to noise ratio of a first frame of the first number of frames;
      
      program code to determine that the first signal to noise ratio is below the threshold;
      
      program code to determine a second number of frames of the second portion of the audio data that have a signal to noise ratio below the threshold; and
      
      program code to determine the estimated difficulty of speech recognition processing based on the first number of frames and the second number of frames.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Bisani, Michael Maximilian Emanuel, Secker-Walker, Hugh Evan, Basye, Kenneth John, Rosen, Alexander David
Primary Examiner(s)
Adesanya, Olujimi

Application Number

US14/011,898
Time in Patent Office

1,196 Days
Field of Search

704/249, 704/246, 704/270, 704/240, 704/254, 704/244, 704/231
US Class Current

1/1
CPC Class Codes

G10L 15/08   Speech classification or se...

G10L 2015/085   Methods for reducing search...

G10L 25/60   for measuring the quality o...

Reducing speech recognition latency

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Reducing speech recognition latency

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links