Speech endpointing based on word comparisons

US 10,546,576 B2
Filed: 10/09/2018
Issued: 01/28/2020
Est. Priority Date: 04/23/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, by a computing device, audio that includes an utterance spoken by a userdetermining, by the computing device, that an energy level of the audio that includes the utterance is above a threshold energy level;

while the computing device receives the audio that includes the utterance and while the energy level of the audio that includes the utterance remains above the threshold energy level, determining, by the computing device, to delay designating an endpoint of the utterance spoken by the user until the energy level of the audio that includes the utterance is below the threshold energy level;

while the computing device receives the audio that includes the utterance and while the energy level of the audio that includes the utterance remains above the threshold energy level, obtaining, by the computing device, a transcription of the utterance spoken by the user; and

while the computing device receives the audio that includes the utterance, while the energy level of the audio that includes the utterance remains above the threshold energy level, and based on the transcription of the utterance, overriding, by the computing device, the determination to delay designating an endpoint of the utterance spoken by the user and designating an endpoint of the utterance spoken by the user.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech endpointing based on word comparisons are described. In one aspect, a method includes the actions of obtaining a transcription of an utterance. The actions further include determining, as a first value, a quantity of text samples in a collection of text samples that (i) include terms that match the transcription, and (ii) do not include any additional terms. The actions further include determining, as a second value, a quantity of text samples in the collection of text samples that (i) include terms that match the transcription, and (ii) include one or more additional terms. The actions further include classifying the utterance as a likely incomplete utterance or not a likely incomplete utterance based at least on comparing the first value and the second value.

Citations

20 Claims

1. A computer-implemented method comprising:
- receiving, by a computing device, audio that includes an utterance spoken by a userdetermining, by the computing device, that an energy level of the audio that includes the utterance is above a threshold energy level;
  
  while the computing device receives the audio that includes the utterance and while the energy level of the audio that includes the utterance remains above the threshold energy level, determining, by the computing device, to delay designating an endpoint of the utterance spoken by the user until the energy level of the audio that includes the utterance is below the threshold energy level;
  
  while the computing device receives the audio that includes the utterance and while the energy level of the audio that includes the utterance remains above the threshold energy level, obtaining, by the computing device, a transcription of the utterance spoken by the user; and
  
  while the computing device receives the audio that includes the utterance, while the energy level of the audio that includes the utterance remains above the threshold energy level, and based on the transcription of the utterance, overriding, by the computing device, the determination to delay designating an endpoint of the utterance spoken by the user and designating an endpoint of the utterance spoken by the user.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, comprising:
    - determining, by the computing device, that the utterance is likely complete,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on determining that the utterance is likely complete.
  - 3. The method of claim 1, comprising:
    - comparing, by the computing device, the transcription to a collection of text samples,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on comparing the transcription to a collection of text samples.
  - 4. The method of claim 3, comprising:
    - based on comparing the transcription to a collection of text samples determining a number of text samples in the collection of text samples that match the transcription and do not include any additional terms,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on the number of text samples in the collection of text samples that match the transcription and do not include any additional terms.
  - 5. The method of claim 4, comprising:
    - determining that terms in each of the text samples in the collection of text samples that match the transcription and do not include any additional terms occur in a same order as in the transcription,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on determining that the terms in each of the text samples in the collection of text samples that match the transcription and do not include any additional terms occur in the same order as in the transcription.
  - 6. The method of claim 3, comprising:
    - based on comparing the transcription to a collection of text samples determining a number of text samples in the collection of text samples that match the transcription and include one or more additional terms,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on the number of text samples in the collection of text samples that match the transcription and include one or more additional terms.
  - 7. The method of claim 1, wherein:
    - receiving the audio that includes the utterance spoken by the user and that has the energy level that is above the threshold energy level comprises receiving, by a microphone of the computing device, the audio that includes the utterance, andoverriding the determination to delay designating an endpoint of the utterance spoken by the user and designating an endpoint of the utterance comprises deactivating, by the computing device, the microphone while the energy level of the audio that includes the utterance and that is received by the microphone remains above the threshold energy level.

8. A system comprising:
- one or more computers; and
  
  one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving, by a computing device, audio that includes an utterance spoken by a userdetermining, by the computing device, that an energy level of the audio that includes the utterance is above a threshold energy level;
  
  while the computing device receives the audio that includes the utterance and while the energy level of the audio that includes the utterance remains above the threshold energy level, determining, by the computing device, to delay designating an endpoint of the utterance spoken by the user until the energy level of the audio that includes the utterance is below the threshold energy level;
  
  while the computing device receives the audio that includes the utterance and while the energy level of the audio that includes the utterance remains above the threshold energy level, obtaining, by the computing device, a transcription of the utterance spoken by the user; and
  
  while the computing device receives the audio that includes the utterance, while the energy level of the audio that includes the utterance remains above the threshold energy level, and based on the transcription of the utterance, overriding, by the computing device, the determination to delay designating an endpoint of the utterance spoken by the user and designating an endpoint of the utterance spoken by the user.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the operations comprise:
    - determining, by the computing device, that the utterance is likely complete,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on determining that the utterance is likely complete.
  - 10. The system of claim 8, wherein the operations comprise:
    - comparing, by the computing device, the transcription to a collection of text samples,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on comparing the transcription to a collection of text samples.
  - 11. The system of claim 10, wherein the operations comprise:
    - based on comparing the transcription to a collection of text samples determining a number of text samples in the collection of text samples that match the transcription and do not include any additional terms,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on the number of text samples in the collection of text samples that match the transcription and do not include any additional terms.
  - 12. The system of claim 11, wherein the operations comprise:
    - determining that terms in each of the text samples in the collection of text samples that match the transcription and do not include any additional terms occur in a same order as in the transcription,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on determining that the terms in each of the text samples in the collection of text samples that match the transcription and do not include any additional terms occur in the same order as in the transcription.
  - 13. The system of claim 10, wherein the operations comprise:
    - based on comparing the transcription to a collection of text samples determining a number of text samples in the collection of text samples that match the transcription and include one or more additional terms,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on the number of text samples in the collection of text samples that match the transcription and include one or more additional terms.
  - 14. The system of claim 8, wherein:
    - receiving the audio that includes the utterance spoken by the user and that has the energy level that is above the threshold energy level comprises receiving, by a microphone of the computing device, the audio that includes the utterance, andoverriding the determination to delay designating an endpoint of the utterance spoken by the user and designating an endpoint of the utterance comprises deactivating, by the computing device, the microphone while the energy level of the audio that includes the utterance and that is received by the microphone remains above the threshold energy level.

15. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- receiving, by a computing device, audio that includes an utterance spoken by a userdetermining, by the computing device, that an energy level of the audio that includes the utterance is above a threshold energy level;
  
  while the computing device receives the audio that includes the utterance and while the energy level of the audio that includes the utterance remains above the threshold energy level, determining, by the computing device, to delay designating an endpoint of the utterance spoken by the user until the energy level of the audio that includes the utterance is below the threshold energy level;
  
  while the computing device receives the audio that includes the utterance and while the energy level of the audio that includes the utterance remains above the threshold energy level, obtaining, by the computing device, a transcription of the utterance spoken by the user; and
  
  while the computing device receives the audio that includes the utterance, while the energy level of the audio that includes the utterance remains above the threshold energy level, and based on the transcription of the utterance, overriding, by the computing device, the determination to delay designating an endpoint of the utterance spoken by the user and designating an endpoint of the utterance spoken by the user.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The medium of claim 15, wherein the operations comprise:
    - determining, by the computing device, that the utterance is likely complete,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on determining that the utterance is likely complete.
  - 17. The medium of claim 15, wherein the operations comprise:
    - comparing, by the computing device, the transcription to a collection of text samples,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on comparing the transcription to a collection of text samples.
  - 18. The medium of claim 17, wherein the operations comprise:
    - based on comparing the transcription to a collection of text samples determining a number of text samples in the collection of text samples that match the transcription and do not include any additional terms,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on the number of text samples in the collection of text samples that match the transcription and do not include any additional terms.
  - 19. The medium of claim 17, wherein the operations comprise:
    - based on comparing the transcription to a collection of text samples determining a number of text samples in the collection of text samples that match the transcription and include one or more additional terms,wherein the computing device overrides the determination to delay designating an endpoint of the utterance spoken by the user and designates an endpoint of the utterance based on the number of text samples in the collection of text samples that match the transcription and include one or more additional terms.
  - 20. The medium of claim 15, wherein:
    - receiving the audio that includes the utterance spoken by the user and that has the energy level that is above the threshold energy level comprises receiving, by a microphone of the computing device, the audio that includes the utterance, andoverriding the determination to delay designating an endpoint of the utterance spoken by the user and designating an endpoint of the utterance comprises deactivating, by the computing device, the microphone while the energy level of the audio that includes the utterance and that is received by the microphone remains above the threshold energy level.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Buchanan, Michael, Gupta, Pravir Kumar, Tandiono, Christopher Bo
Primary Examiner(s)
Zhu, Richard Z

Application Number

US16/154,875
Publication Number

US 20190043480A1
Time in Patent Office

476 Days
Field of Search

704248
US Class Current
CPC Class Codes

G06F 40/216   using statistical methods

G06F 40/253   Grammatical analysis; Style...

G06F 40/284   Lexical analysis, e.g. toke...

G10L 15/04   Segmentation; Word boundary...

G10L 15/05   Word boundary detection

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 17/06   Decision making techniques;...

G10L 2015/088   Word spotting

G10L 2015/223   Execution procedure of a sp...

G10L 2025/783   based on threshold decision

G10L 25/51   for comparison or discrimin...

G10L 25/78   Detection of presence or ab...

G10L 25/87   Detection of discrete point...

G10L 25/90   Pitch determination of spee...

Speech endpointing based on word comparisons

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Speech endpointing based on word comparisons

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links