Direction-based speech endpointing

US 10,134,425 B1
Filed: 06/29/2015
Issued: 11/20/2018
Est. Priority Date: 06/29/2015
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for determining an utterance endpoint during automatic speech recognition (ASR) processing, the method comprising:

receiving audio comprising speech;

determining audio data based on the audio;

determining a source direction corresponding to the audio data;

determining a duration associated with the audio data, wherein the duration indicates how long the audio has been continuously received from the source direction;

performing ASR processing on the audio data to determine;

a plurality of hypotheses, wherein each hypothesis of the plurality of hypotheses includes at least one word or a representation of at least one word potentially corresponding to the audio data, andfor each of the plurality of hypotheses, a respective probability that the respective hypothesis corresponds to an utterance represented in the audio data;

determining, for each of the plurality of hypotheses, a representation of a respective number of audio frames corresponding to non-speech immediately preceding a first point;

calculating, for each of the plurality of hypotheses, a respective weighted pause duration by multiplying the respective probability of a respective hypothesis by the respective number of audio frames of the respective hypothesis;

calculating a cumulative expected pause duration by summing the respective weighted pause durations for each of the plurality of hypotheses;

calculating an adjusted cumulative score using the cumulative expected pause duration; and

designating the first point as corresponding to a likely endpoint as a result of the adjusted cumulative score exceeding a first threshold.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for determining an endpoint of an utterance during automatic speech recognition (ASR) processing that accounts for the direction and duration of the incoming speech. Beamformers of the ASR system may identify a source direction of the audio. The system may track the duration speech has been received from that source direction so that if speech is detected in another direction, the original source speech may be weighted differently for purposes of determining an endpoint of the utterance. Speech from a new direction may be discarded or treated like non-speech for purposes of determining an endpoint of speech from an original direction.

216 Citations

29 Claims

1. A computer-implemented method for determining an utterance endpoint during automatic speech recognition (ASR) processing, the method comprising:
- receiving audio comprising speech;
  
  determining audio data based on the audio;
  
  determining a source direction corresponding to the audio data;
  
  determining a duration associated with the audio data, wherein the duration indicates how long the audio has been continuously received from the source direction;
  
  performing ASR processing on the audio data to determine;
  
  a plurality of hypotheses, wherein each hypothesis of the plurality of hypotheses includes at least one word or a representation of at least one word potentially corresponding to the audio data, andfor each of the plurality of hypotheses, a respective probability that the respective hypothesis corresponds to an utterance represented in the audio data;
  
  determining, for each of the plurality of hypotheses, a representation of a respective number of audio frames corresponding to non-speech immediately preceding a first point;
  
  calculating, for each of the plurality of hypotheses, a respective weighted pause duration by multiplying the respective probability of a respective hypothesis by the respective number of audio frames of the respective hypothesis;
  
  calculating a cumulative expected pause duration by summing the respective weighted pause durations for each of the plurality of hypotheses;
  
  calculating an adjusted cumulative score using the cumulative expected pause duration; and
  
  designating the first point as corresponding to a likely endpoint as a result of the adjusted cumulative score exceeding a first threshold.
- View Dependent Claims (2, 3, 4)
- - 2. The computer-implemented method of claim 1, further comprising:
    - configuring a first non-speech duration threshold;
      
      configuring a second non-speech duration threshold, wherein the second non-speech duration threshold requires fewer non-speech frames than the first non-speech duration threshold to declare an endpoint;
      
      determining a signal-to-noise ratio (SNR) associated with the audio data;
      
      determining that the SNR is below a SNR threshold; and
      
      in response to determining that the SNR is below the SNR threshold, selecting the second non-speech duration threshold as the first threshold.
  - 3. The computer-implemented method of claim 1, further comprising:
    - receiving second audio data;
      
      determining a second source direction associated with the second audio data;
      
      determining a second duration indicating how long second audio corresponding to the second audio data has been continuously received from the second source direction;
      
      determining that the source direction is different from the second source direction; and
      
      discarding the second audio data in response to the second duration being under a second threshold.
  - 4. The computer-implemented method of claim 1, wherein the plurality of hypothesis comprises a first hypothesis, and the method further comprises:
    - associating the source direction with a first segment of the first hypothesis;
      
      receiving second audio data;
      
      determining a second source direction associated with the second audio data;
      
      performing ASR processing on the second audio data to determine a second segment of the first hypothesis;
      
      associating the second source direction with the second segment of the first hypothesis;
      
      determining that the first segment is associated with a different direction from the second segment; and
      
      treating the second segment as corresponding to non-speech for purposes of determining the respective number of audio frames corresponding to non-speech of the first hypothesis.

5. A computer-implemented method comprising:
- determining that received audio data corresponding to at least one utterance includes first audio data, wherein the first audio data corresponds to a first source direction;
  
  performing automatic speech recognition processing on the first audio data to determine a first hypothesis including one or more of at least one first word or a representation of at least one first word potentially corresponding to the first audio data;
  
  determining that a first portion of the first audio data corresponds to speech;
  
  determining a first value representing a first time duration of the first portion of the first audio data;
  
  determining a first duration weight factor based at least in part on the first value;
  
  determining, in the first hypothesis, a representation of first non-speech, the first non-speech following the first portion of the first audio data;
  
  determining a second value representing a second time duration of the first non-speech;
  
  determining a first pause duration value by using the first duration weight factor to adjust the second value; and
  
  determining an endpoint based at least in part on the first pause duration value.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 6. The computer-implemented method of claim 5, wherein determining the endpoint is further based on a first threshold and the method further comprises:
    - determining a signal-to-noise ratio (SNR) associated with the first audio data; and
      
      selecting the first threshold based on the SNR.
  - 7. The computer-implemented method of claim 6, further comprising one of:
    - selecting a first SNR threshold as the first threshold when the SNR is above an SNR threshold;
      
      orselecting a second SNR threshold as the first threshold when the SNR is below an SNR threshold, wherein the second SNR threshold is lower than the first SNR threshold.
  - 8. The computer-implemented method of claim 5, wherein determining the first duration weight factor uses a trained model configured to calculate how much weight to give the second value based on the first value.
  - 9. The computer-implemented method of claim 5, wherein determining the endpoint further comprises:
    - determining, based at least in part on the first pause duration value, an expected pause duration value; and
      
      determining that the expected pause duration value has exceeded a threshold.
  - 10. The computer-implemented method of claim 5, further comprising determining the second value by determining a number of audio frames corresponding to the first non-speech represented in the first hypothesis.
  - 11. The computer-implemented method of claim 5, wherein the first hypothesis comprises at least one node representing a number of audio frames corresponding to the first non-speech.
  - 12. The computer-implemented method of claim 5, wherein:
    - performing automatic speech recognition processing on the first audio data further comprises calculating a first probability that the first hypothesis corresponds to an utterance represented in the first audio data; and
      
      determining the first pause duration value further comprises determining the first pause duration value based at least in part based on the first probability.
  - 13. The computer-implemented method of claim 12, wherein determining the first pause duration value further comprises multiplying the first probability by a factor corresponding to the second value.
  - 14. The computer-implemented method of claim 5, further comprising:
    - determining that the received audio data corresponding to the at least one utterance further includes second audio data corresponding to a second source direction, the second source direction being different than the first source direction; and
      
      processing the second audio data to determine a second pause duration value,wherein determining the endpoint comprises determining the endpoint based at least in part on the first pause duration value and the second pause duration value.
  - 15. The computer-implemented method of claim 14, wherein processing the second audio data to determine the second pause duration value further comprises:
    - performing automatic speech recognition processing on the second audio data to determine a second hypothesis including one or more of at least one second word or a representation of at least one second word potentially corresponding to the second audio data;
      
      determining that a second portion of the second audio data corresponds to speech;
      
      determining a third value representing a third time duration of the second portion of the second audio data;
      
      determining a second duration weight factor based at least in part on the third value;
      
      determining, in the second hypothesis, a representation of second non-speech, the second non-speech following the second portion of the second audio data;
      
      determining a fourth value representing a fourth time duration of the second non-speech; and
      
      determining the second pause duration value by using the second duration weight factor to adjust the fourth value.
  - 16. The computer-implemented method of claim 15, wherein:
    - performing automatic speech recognition processing on the first audio data further comprises calculating a first probability that the first hypothesis corresponds to an utterance represented in the first audio data;
      
      determining the first pause duration value further comprises determining the first pause duration value based at least in part on the first probability;
      
      performing automatic speech recognition processing on the second audio data further comprises calculating a second probability that the second hypothesis corresponds to an utterance represented in the second audio data; and
      
      determining the second pause duration value further comprises determining the second pause duration value based at least in part on the second probability.
  - 17. The computer-implemented method of claim 15, further comprising:
    - determining that the first value is greater than the third value; and
      
      based at least in part on determining that the first value is greater than the third value, setting the first duration weight factor to be greater than the second duration weight factor.
  - 18. The computer-implemented method of claim 15, further comprising:
    - discarding the second audio data in response to the third value being under a threshold.

19. A computing system comprising:
- at least one processor; and
  
  a computer-readable medium encoded with instructions operable to be executed by the at least one processor to cause the computing system to perform a set of actions comprising;
  
  determining that received audio data corresponding to at least one utterance includes first audio data, wherein the first audio data corresponds to a first source direction;
  
  performing automatic speech recognition processing on the first audio data to determine a first hypothesis including one or more of at least one first word or a representation of at least one first word potentially corresponding to the first audio data;
  
  determining that a first portion of the first audio data corresponds to speech;
  
  determining a first value representing a first time duration of the first portion of the first audio data;
  
  determining a first duration weight factor based at least in part on the first value;
  
  determining, in the first hypothesis, a representation of first non-speech, the first non-speech following the first portion of the first audio data;
  
  determining a second value representing a second time duration of the first non-speech;
  
  determining a first pause duration value by using the first duration weight factor to adjust the second value; and
  
  determining an endpoint based at least in part on the first pause duration value.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 20. The computing system of claim 19, wherein determining the endpoint is further based on a first threshold and the computer-readable medium is encoded with additional instructions operable to be executed by the at least one processor to further cause the computing system to:
    - determine a signal-to-noise ratio (SNR) associated with the first audio data; and
      
      select the first threshold based on the SNR.
  - 21. The computing system of claim 20, wherein the computer-readable medium is encoded with additional instructions operable to be executed by the at least one processor to further cause the computing system to:
    - select a first SNR threshold as the first threshold when the SNR is above an SNR threshold;
      
      orselect a second SNR threshold as the first threshold when the SNR is below an SNR threshold, wherein the second SNR threshold is lower than the first SNR threshold.
  - 22. The computing system of claim 19, wherein determining the first duration weight factor uses a trained model configured to calculate how much weight to give the second value based on the first value.
  - 23. The computing system of claim 19, wherein:
    - performing automatic speech recognition processing on the first audio data further comprises calculating a first probability that the first hypothesis corresponds to an utterance represented in the first audio data; and
      
      determining the first pause duration value is further based on the first probability.
  - 24. The computing system of claim 19, wherein:
    - the computer-readable medium is encoded with additional instructions operable to be executed by the at least one processor to further cause the computing system to perform additional actions comprising;
      
      determining that the received audio data corresponding to the at least one utterance further includes second audio data corresponding to a second source direction, the second source direction being different than the first source direction, andprocessing the second audio data to determine a second pause duration value; and
      
      wherein determining the endpoint further comprises determining the endpoint based at least in part on the first pause duration value and the second pause duration value.
  - 25. The computing system of claim 24, wherein processing the second audio data to determine the second pause duration value further comprises:
    - performing automatic speech recognition processing on the second audio data to determine a second hypothesis including one or more of at least one second word or a representation of at least one second word potentially corresponding to the second audio data;
      
      determining that a second portion of the second audio data corresponds to speech;
      
      determining a third value representing a third time duration of the second portion of the second audio data;
      
      determining a second duration weight factor based at least in part on the third value;
      
      determining, in the second hypothesis, a representation of second non-speech, the second non-speech following the second portion of the second audio data;
      
      determining a fourth value representing a fourth time duration corresponding to the second non-speech; and
      
      determining the second pause duration value by using the second duration weight factor to adjust the fourth value.
  - 26. The computing system of claim 25, wherein:
    - performing automatic speech recognition processing on the first audio data further comprises calculating a first probability that the first hypothesis corresponds to an utterance represented in the first audio data;
      
      determining the first pause duration value further comprises determining the first pause duration value based at least in part on the first probability;
      
      performing automatic speech recognition processing on the second audio data further comprises calculating a second probability that the second hypothesis corresponds to an utterance represented in the second audio data; and
      
      determining the second pause duration value further comprises determining the second pause duration value based at least in part on the second probability.
  - 27. The computing system of claim 25, wherein the computer-readable medium is encoded with additional instructions operable to be executed by the at least one processor to further cause the computing system to:
    - determine that the first value is greater than the third value; and
      
      based at least in part on determining that the first value is greater than the third value, set the first duration weight factor to be greater than the second duration weight factor.
  - 28. The computing system of claim 25, wherein the computer-readable medium is encoded with additional instructions operable to be executed by the at least one processor to further cause the computing system to:
    - discard the second audio data in response to the third value being under a threshold.
  - 29. The computing system of claim 19, wherein determining the endpoint further comprises:
    - determining, based at least in part on the first pause duration value, an expected pause duration value; and
      
      determining that the expected pause duration value has exceeded a threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Johnson, Jr., Charles Melvin
Primary Examiner(s)
Thomas-Homescu, Anne L

Application Number

US14/753,828
Time in Patent Office

1,240 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/05 Word boundary detection

G10L 25/87 Detection of discrete point...

Direction-based speech endpointing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

216 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Direction-based speech endpointing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

216 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links