Language model speech endpointing

US 10,121,471 B2
Filed: 06/29/2015
Issued: 11/06/2018
Est. Priority Date: 06/29/2015
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for determining an endpoint during automatic speech recognition (ASR) processing, the method comprising:

receiving audio data representing speech detected using a microphone of a mobile device;

performing ASR processing on the audio data to determine a plurality of hypotheses;

determining, for each of the plurality of hypotheses, a respective probability that the respective hypothesis corresponds to the audio data;

determining, for each of the plurality of hypotheses, a respective number of non-speech audio frames immediately preceding a first point in the audio data;

determining, for each of the plurality of hypotheses, a respective score by multiplying the probability of the respective hypothesis by a factor corresponding to the number of non-speech audio frames of the respective hypothesis;

determining a cumulative score by summing the respective scores for each of the plurality of hypotheses;

determining that the cumulative score exceeds a first threshold; and

designating the first point as corresponding to a likely endpoint as a result of the cumulative score exceeding the first threshold.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automatic speech recognition (ASR) system detects an endpoint of an utterance using the active hypotheses under consideration by a decoder. The ASR system calculates the amount of non-speech detected by a plurality of hypotheses and weights the non-speech duration by the probability of each hypotheses. When the aggregate weighted non-speech exceeds a threshold, an endpoint may be declared.

87 Citations

View as Search Results

22 Claims

1. A computer-implemented method for determining an endpoint during automatic speech recognition (ASR) processing, the method comprising:
- receiving audio data representing speech detected using a microphone of a mobile device;
  
  performing ASR processing on the audio data to determine a plurality of hypotheses;
  
  determining, for each of the plurality of hypotheses, a respective probability that the respective hypothesis corresponds to the audio data;
  
  determining, for each of the plurality of hypotheses, a respective number of non-speech audio frames immediately preceding a first point in the audio data;
  
  determining, for each of the plurality of hypotheses, a respective score by multiplying the probability of the respective hypothesis by a factor corresponding to the number of non-speech audio frames of the respective hypothesis;
  
  determining a cumulative score by summing the respective scores for each of the plurality of hypotheses;
  
  determining that the cumulative score exceeds a first threshold; and
  
  designating the first point as corresponding to a likely endpoint as a result of the cumulative score exceeding the first threshold.
- View Dependent Claims (2, 3, 4)
- - 2. The computer-implemented method of claim 1, further comprising:
    - determining a second plurality of hypotheses, wherein;
      
      the second plurality of hypotheses is a subset of the plurality of hypotheses, andeach of the second plurality of hypotheses is at a respective end state;
      
      determining a second score by summing the respective score for each of the second plurality of hypotheses; and
      
      determining that the second score exceeds a second threshold.
  - 3. The computer-implemented method of claim 1, wherein determining the respective number of non-speech audio frames for each respective hypothesis comprises determining, for each hypothesis, a number of consecutive non-speech nodes immediately prior to the first point.
  - 4. The computer-implemented method of claim 1, wherein determining the respective number of non-speech audio frames for each respective hypothesis comprises maintaining a counter, for each hypothesis, of a number of consecutive non-speech frames immediately prior to the first point.

5. A computer-implemented method, comprising:
- receiving audio data;
  
  performing speech recognition processing on the audio data to determine a plurality of hypotheses, each hypothesis of the plurality comprising at least one of a representation of a respective subword unit or a representation of a respective word, wherein the plurality of hypotheses includes;
  
  a first hypothesis comprising a first representation of first non-speech preceding a first point in the audio data, anda second hypothesis comprising a second representation of second non-speech preceding the first point in the audio data;
  
  determining a first probability corresponding to the first hypothesis;
  
  determining a second probability corresponding to the second hypothesis;
  
  determining a first weighted duration value by using the first probability to adjust a first value representing a first time duration of the first non-speech;
  
  determining a second weighted duration value by using the second probability to adjust a second value representing a second time duration of the second non-speech;
  
  combining at least the first weighted duration value and the second weighted duration value to determine a third value representing an expected non-speech time duration preceding the first point in the audio data;
  
  determining that the third value exceeds a threshold duration value; and
  
  determining, based at least in part on determining that the third value exceeds the threshold duration value, that an endpoint of speech occurs at the first point in the audio data.
- View Dependent Claims (6, 7, 8, 9, 10, 17, 18, 19)
- - 6. The computer-implemented method of claim 5, wherein:
    - using the first probability to adjust the first value comprises multiplying the first value by the first probability;
      
      using the second probability to adjust the second value comprises multiplying the second value by the second probability, andcombining the first weighted duration value and the second weighted duration value comprises summing the first weighted duration value and the second weighted duration value.
  - 7. The computer-implemented method of claim 5, further comprising:
    - determining that the first hypothesis comprises a first indication of an end state; and
      
      determining that the second hypothesis comprises a second indication of an end state.
  - 8. The computer-implemented method of claim 5, further comprising:
    - determining the first time duration by determining a number of consecutive non-speech audio frames corresponding to the first representation of the first non-speech.
  - 9. The computer-implemented method of claim 8, wherein determining the number of consecutive non-speech audio frames further comprises maintaining a counter of the number of consecutive non-speech audio frames corresponding to the first representation of the first non-speech.
  - 10. The computer-implemented method of claim 8, wherein determining the number of consecutive non-speech audio frames further comprises determining a number of consecutive non-speech nodes represented in the first hypothesis.
  - 17. The computer-implemented method of claim 7, further comprising:
    - determining the first time duration by determining a number of consecutive non-speech audio frames corresponding to the first representation of the first duration of non-speech.
  - 18. The computer-implemented method of claim 17, wherein determining the number of consecutive non-speech audio frames further comprises maintaining a counter of the number of consecutive non-speech audio frames corresponding to the first representation of the first non-speech.
  - 19. The computer-implemented method of claim 17, wherein determining the number of non-speech audio frames further comprises determining a number of consecutive non-speech nodes represented in the first hypothesis.

11. A computing system, comprising:
- at least one processor;
  
  a memory including instructions operable to be executed by the at least one processor to cause the computing system to perform a set of actions comprising;
  
  receiving audio data;
  
  performing speech recognition processing on the audio data to determine a plurality of hypotheses, each hypothesis of the plurality comprising at least one of a representation of a respective subword unit or a representation of a respective word, wherein the plurality of hypotheses includes;
  
  a first hypothesis comprising a first representation of first non-speech preceding a first point in the audio data, anda second hypothesis comprising a second representation of second non-speech preceding the first point in the audio data;
  
  determining a first probability corresponding to the first hypothesis;
  
  determining a second probability corresponding to the second hypothesis;
  
  determining a first weighted duration value by using the first probability to adjust a first value representing a first time duration of the first non-speech;
  
  determining a second weighted duration value by using the second probability to adjust a second value representing a second time duration of the second non-speech;
  
  combining at least the first weighted duration value and the second weighted duration value to determine a third value representing an expected non-speech time duration preceding the first point in the audio data;
  
  determining that the third value exceeds a threshold duration value; and
  
  determining, based at least in part on determining that the third value exceeds the threshold duration value, that an endpoint of speech occurs at the first point in the audio data.
- View Dependent Claims (12, 13, 14, 15, 16, 20, 21, 22)
- - 12. The computing system of claim 11, wherein:
    - using the first probability to adjust the first value comprises multiplying the first value by the first probability;
      
      using the second probability to adjust the second value comprises multiplying the second value by the second probability, andcombining the first weighted duration value and the second weighted duration value comprises summing the first weighted duration value and the second weighted duration value.
  - 13. The computing system of claim 11, wherein the memory includes additional instructions operable to be executed by the at least one processor to further cause the computing system to perform additional actions comprising:
    - determining that the first hypothesis comprises a first indication of an end state; and
      
      determining that the second hypothesis comprises a second indication of an end state.
  - 14. The computing system of claim 11, wherein the memory includes additional instructions operable to be executed by the at least one processor to further cause the computing system to perform additional actions comprising:
    - determining the first time duration by determining a number of consecutive non-speech audio frames corresponding to the first representation of the first non-speech.
  - 15. The computing system of claim 14, wherein determining the number of consecutive non-speech audio frames further comprises maintaining a counter of the number of consecutive non-speech audio frames corresponding to the first representation of the first non-speech.
  - 16. The computing system of claim 14, wherein determining the number of consecutive non-speech audio frames further comprises determining a number of consecutive non-speech nodes represented in the first hypothesis.
  - 20. The computing system of claim 13, wherein the memory includes additional instructions operable to be executed by the at least one processor to further cause the computing system to perform additional actions comprising:
    - determining the first time duration by determining a number of consecutive non-speech audio frames corresponding to the first representation of the first non-speech.
  - 21. The computing system of claim 20, wherein determining the number of consecutive non-speech audio frames further comprises maintaining a counter of the number of consecutive non-speech audio frames corresponding to the first representation of the first non-speech.
  - 22. The computing system of claim 20, wherein determining the number of consecutive non-speech audio frames further comprises determining a number of consecutive non-speech nodes represented in the first hypothesis.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hoffmeister, Bjorn, Rastrow, Ariya, Liu, Baiyang
Primary Examiner(s)
Thomas-Homescu, Anne L

Application Number

US14/753,811
Publication Number

US 20160379632A1
Time in Patent Office

1,226 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/18   using natural language mode...

G10L 15/183   using context dependencies,...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 2025/783   based on threshold decision

G10L 25/87   Detection of discrete point...

G10L 25/93   Discriminating between voic...

Language model speech endpointing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

87 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Language model speech endpointing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

87 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links