Speech endpointing based on word comparisons

US 10,140,975 B2
Filed: 05/17/2016
Issued: 11/27/2018
Est. Priority Date: 04/23/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, from a given user and by a microphone of a mobile device that includes (i) the microphone, (ii) an automated speech recognition system, and (iii) an end of utterance detector that is configured to identify an endpoint of an utterance spoken by a user in response to determining that a speaker has stopped speaking for a fixed duration, a first utterance;

determining, by the end of utterance detector, that the given user has stopped speaking for the fixed duration after the first utterance;

generating, by the automated speech recognition system, a first transcription of the first utterance;

based on the first transcription of the first utterance, maintaining the microphone in an active state without endpointing the first utterance;

after the given user has stopped speaking for at least the fixed duration after the first utterance, receiving, by the microphone and from the given user, a second utterance;

generating, by the automated speech recognition system, a second transcription of the second utterance;

based on both the first transcription and the second transcription, deactivating the microphone and endpointing the second utterance;

in response to endpointing the second utterance, submitting, by the mobile device, a single search query that includes both the first transcription and the second transcription;

receiving, by the mobile device, search results in response to the single search query that includes both the first transcription and the second transcription; and

providing, for output by the mobile device, the search results.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech endpointing based on word comparisons are described. In one aspect, a method includes the actions of obtaining a transcription of an utterance. The actions further include determining, as a first value, a quantity of text samples in a collection of text samples that (i) include terms that match the transcription, and (ii) do not include any additional terms. The actions further include determining, as a second value, a quantity of text samples in the collection of text samples that (i) include terms that match the transcription, and (ii) include one or more additional terms. The actions further include classifying the utterance as a likely incomplete utterance or not a likely incomplete utterance based at least on comparing the first value and the second value.

Citations

23 Claims

1. A computer-implemented method comprising:
- receiving, from a given user and by a microphone of a mobile device that includes (i) the microphone, (ii) an automated speech recognition system, and (iii) an end of utterance detector that is configured to identify an endpoint of an utterance spoken by a user in response to determining that a speaker has stopped speaking for a fixed duration, a first utterance;
  
  determining, by the end of utterance detector, that the given user has stopped speaking for the fixed duration after the first utterance;
  
  generating, by the automated speech recognition system, a first transcription of the first utterance;
  
  based on the first transcription of the first utterance, maintaining the microphone in an active state without endpointing the first utterance;
  
  after the given user has stopped speaking for at least the fixed duration after the first utterance, receiving, by the microphone and from the given user, a second utterance;
  
  generating, by the automated speech recognition system, a second transcription of the second utterance;
  
  based on both the first transcription and the second transcription, deactivating the microphone and endpointing the second utterance;
  
  in response to endpointing the second utterance, submitting, by the mobile device, a single search query that includes both the first transcription and the second transcription;
  
  receiving, by the mobile device, search results in response to the single search query that includes both the first transcription and the second transcription; and
  
  providing, for output by the mobile device, the search results.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, wherein the single search query that includes both the first transcription and the second transcription is submitted without submitting a search query that includes the first transcription without the second transcription.
  - 3. The method of claim 1, wherein the single search query that includes both the first transcription and the second transcription is submitted without submitting a search query that includes only the first transcription.
  - 4. The method of claim 1, comprising:
    - after the given user has stopped speaking for longer than the fixed duration after the end of the first utterance and before the given user speaks the second utterance, updating a user interface to include the first transcription.
  - 5. The method of claim 1, comprising:
    - after receiving the second utterance, updating a user interface to include the first transcription and the second transcription, without updating the user interface after the given user has stopped speaking for longer than the fixed duration after the end of the first utterance to include only the first transcription.
  - 6. The method of claim 5, wherein the user interface is updated after the second utterance to include both the first transcription and the second transcription in response to receiving the first utterance and the second utterance.
  - 7. The method of claim 1, comprising:
    - determining that the first utterance is a likely incomplete utterance,wherein maintaining the microphone in an active state without endpointing the first utterance comprises maintaining the microphone in an active state without endpointing the first utterance based on determining that the first utterance is a likely incomplete utterance.
  - 8. The method of claim 1, comprising:
    - after providing, for output by the mobile device, the search results, reactivating the microphone;
      
      receiving, by the microphone of the mobile device and from the given user, a third utterance;
      
      determining, by the end of utterance detector, that the given user has stopped speaking for the fixed duration after the third utterance;
      
      generating, by the automated speech recognition system, a third transcription of the third utterance;
      
      based on the third transcription of the third utterance, endpointing the third utterance; and
      
      in response to endpointing the third utterance, submitting a second search query that includes the third transcription.
  - 9. The method of claim 8, comprising:
    - after the given user has stopped speaking for the fixed duration after the third utterance, updating the user interface to include the third transcription.
  - 10. The method of claim 9, wherein the user interface is updated to include the third transcription in response to determining that the given user has stopped speaking for the fixed duration after the end of the third utterance.
  - 11. The method of claim 8, comprising:
    - after the given user has stopped speaking for the fixed duration after the third utterance, receiving, by the microphone and from the given user, a fourth utterance;
      
      generating, by the automated speech recognition system, a fourth transcription of the fourth utterance;
      
      updating the user interface after the fourth utterance to include the fourth transcription.
  - 12. The method of claim 11, comprising:
    - submitting a third search query that includes the transcription of the fourth utterance.
  - 13. The method of claim 12, wherein the fourth utterance is followed by a pause of longer than the fixed duration,wherein the third search query is submitted after the pause.
  - 14. The method of claim 11, wherein updating the user interface after the fourth utterance comprises removing the third transcription from the user interface.
  - 15. The method of claim 8, wherein the third utterance is not a likely incomplete utterance.
  - 16. The method of claim 1, wherein the second utterance is followed by a pause of longer than the fixed duration,wherein the search query is submitted after the pause.
  - 17. The method of claim 1, wherein deactivating the microphone and endpointing the second utterance comprises deactivating the microphone and endpointing the second utterance before the fixed duration has elapsed after the given user has stopped speaking after the second utterance.
  - 18. The method of claim 1, wherein generating, by the automated speech recognition system, a first transcription of the first utterance comprises generating, by the automated speech recognition system, a first transcription of the first utterance in response to determining that the given user has stopped speaking for the fixed duration after the first utterance.
  - 19. The method of claim 1, wherein, based on the first transcription of the first utterance, maintaining the microphone in an active state without endpointing the first utterance comprises:
    - comparing the first transcription to a collection of text samples;
      
      determining a first quantity of text samples in a collection of text samples that (i) include terms that match the first transcription, and (ii) do not include any additional terms;
      
      determining a second quantity of text samples in the collection of text samples that (i) include terms that match the first transcription, and (ii) include one or more additional terms; and
      
      based on the first quantity and the second quantity, maintaining the microphone in an active state without endpointing the first utterance.

20. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving, from a given user and by a microphone of a mobile device that includes (i) the microphone, (ii) an automated speech recognition system, and (iii) an end of utterance detector that is configured to identify an endpoint of an utterance spoken by a user in response to determining that a speaker has stopped speaking for a fixed duration, a first utterance;
  
  determining, by the end of utterance detector, that the given user has stopped speaking for the fixed duration after the first utterance;
  
  generating, by the automated speech recognition system, a first transcription of the first utterance;
  
  based on the first transcription of the first utterance, maintaining the microphone in an active state without endpointing the first utterance;
  
  after the given user has stopped speaking for at least the fixed duration after the first utterance, receiving, by the microphone and from the given user, a second utterance;
  
  generating, by the automated speech recognition system, a second transcription of the second utterance;
  
  based on both the first transcription and the second transcription, deactivating the microphone and endpointing the second utterance;
  
  in response to endpointing the second utterance, submitting, by the mobile device, a single search query that includes both the first transcription and the second transcription;
  
  receiving, by the mobile device, search results in response to the single search query that includes both the first transcription and the second transcription; and
  
  providing, for output by the mobile device, the search results.
- View Dependent Claims (21, 22)
- - 21. The system of claim 20, wherein the single search query that includes both the first transcription and the second transcription is submitted without submitting a search query that includes the first transcription without the second transcription.
  - 22. The system of claim 20, wherein the operations further comprise:
    - after receiving the second utterance, updating a user interface to include the first transcription and the second transcription, without updating the user interface after the given user has stopped speaking for longer than the fixed duration after the first utterance to include only the first transcription.

23. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- receiving, from a given user and by a microphone of a mobile device that includes (i) the microphone, (ii) an automated speech recognition system, and (iii) an end of utterance detector that is configured to identify an endpoint of an utterance spoken by a user in response to determining that a speaker has stopped speaking for a fixed duration, a first utterance;
  
  determining, by the end of utterance detector, that the given user has stopped speaking for the fixed duration after the first utterance;
  
  generating, by the automated speech recognition system, a first transcription of the first utterance;
  
  based on the first transcription of the first utterance, maintaining the microphone in an active state without endpointing the first utterance;
  
  after the given user has stopped speaking for at least the fixed duration after the first utterance, receiving, by the microphone and from the given user, a second utterance;
  
  generating, by the automated speech recognition system, a second transcription of the second utterance;
  
  based on both the first transcription and the second transcription, deactivating the microphone and endpointing the second utterance;
  
  in response to endpointing the second utterance, submitting, by the mobile device, a single search query that includes both the first transcription and the second transcription;
  
  receiving, by the mobile device, search results in response to the single search query that includes both the first transcription and the second transcription; and
  
  providing, for output by the mobile device, the search results.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Buchanan, Michael, Gupta, Pravir Kumar, Tandiono, Christopher Bo
Primary Examiner(s)
Zhu, Richard

Application Number

US15/156,478
Publication Number

US 20160260427A1
Time in Patent Office

924 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 40/216   using statistical methods

G06F 40/253   Grammatical analysis; Style...

G06F 40/284   Lexical analysis, e.g. toke...

G10L 15/04   Segmentation; Word boundary...

G10L 15/05   Word boundary detection

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 17/06   Decision making techniques;...

G10L 2015/088   Word spotting

G10L 2015/223   Execution procedure of a sp...

G10L 2025/783   based on threshold decision

G10L 25/51   for comparison or discrimin...

G10L 25/78   Detection of presence or ab...

G10L 25/87   Detection of discrete point...

G10L 25/90   Pitch determination of spee...

Speech endpointing based on word comparisons

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Speech endpointing based on word comparisons

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links