Speech recognition using repeated utterances

US 9,123,339 B1
Filed: 11/23/2010
Issued: 09/01/2015
Est. Priority Date: 11/23/2010
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

receiving, by a computing system and at a first time, a first spoken input from a user of an electronic device, the first spoken input comprising an original utterance by the user;

based on the original utterance, determining, by the computing system, a first set of character string candidates wherein each character string candidate represents the first spoken input converted to textual characters, and wherein determining the first set of character string candidates comprises using a speech recognizer to determine a first word lattice that represents the first set of character string candidates and a first set of probabilities, each probability corresponding to a character string candidate;

providing, for display to the user, a selection of one or more of the character string candidates in response to receiving the first spoken input;

receiving, by the computing system and at a second time, a second spoken input from the user;

determining, by the computing system, that the second spoken input is a repeat utterance of the original utterance;

based on determining that the second spoken input is a repeat utterance of the original utterance, and using the original utterance and the repeat utterance, determining, by the computing system, a second set of character string candidates, wherein determining the second set of character string candidates using the original utterance and the repeat utterance comprises;

using the speech recognizer and the first word lattice as a language model to determine a second word lattice that represents the second set of character string candidates and a second set of probabilities, each probability corresponding to a character string candidate of the second set of character string candidates;

determining an intersection or union of the first word lattice and the second word lattice and, for each character string candidate included in the intersection or union, determining a combined probability based on the probabilities from the first set of probabilities and the second set of probabilities that correspond to the character string candidate; and

determining a third set of character string candidates based on the intersection or union and the determined combined probabilities.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Subject matter described in this specification can be embodied in methods, computer program products and systems relating to speech-to-text conversion. A first spoken input is received from a user of an electronic device (an “original utterance”). Based on the original utterance, a first set of character string candidates are determined that each represent the original utterance converted to textual characters and a selection of one or more of the character string candidates are provided in a format for display to the user. A second spoken input is received from the user and a determination is made that the second spoken input is a repeat utterance of the original utterance. Based on this determination and using the original utterance and the repeat utterance, a second set of character string candidates is determined.

310 Citations

23 Claims

1. A computer-implemented method, comprising:
- receiving, by a computing system and at a first time, a first spoken input from a user of an electronic device, the first spoken input comprising an original utterance by the user;
  
  based on the original utterance, determining, by the computing system, a first set of character string candidates wherein each character string candidate represents the first spoken input converted to textual characters, and wherein determining the first set of character string candidates comprises using a speech recognizer to determine a first word lattice that represents the first set of character string candidates and a first set of probabilities, each probability corresponding to a character string candidate;
  
  providing, for display to the user, a selection of one or more of the character string candidates in response to receiving the first spoken input;
  
  receiving, by the computing system and at a second time, a second spoken input from the user;
  
  determining, by the computing system, that the second spoken input is a repeat utterance of the original utterance;
  
  based on determining that the second spoken input is a repeat utterance of the original utterance, and using the original utterance and the repeat utterance, determining, by the computing system, a second set of character string candidates, wherein determining the second set of character string candidates using the original utterance and the repeat utterance comprises;
  
  using the speech recognizer and the first word lattice as a language model to determine a second word lattice that represents the second set of character string candidates and a second set of probabilities, each probability corresponding to a character string candidate of the second set of character string candidates;
  
  determining an intersection or union of the first word lattice and the second word lattice and, for each character string candidate included in the intersection or union, determining a combined probability based on the probabilities from the first set of probabilities and the second set of probabilities that correspond to the character string candidate; and
  
  determining a third set of character string candidates based on the intersection or union and the determined combined probabilities.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein determining that the second spoken input is a repeat utterance of the original utterance comprises determining that the second spoken input was received within a predetermined time period of the first time.
  - 3. The method of claim 1, wherein determining that the second spoken input is a repeat utterance of the original utterance comprises generating a score that indicates an acoustic alignment between a first sequence of vectors that represents the first spoken input and a second sequence of vectors that represents the second spoken input and determining that the score exceeds a predetermined threshold score.
  - 4. The method of claim 1, wherein determining that the second spoken input is a repeat utterance of the original utterance comprises determining whether the first and second inputs were spoken by different voices.
  - 5. The method of claim 1, wherein determining the first word lattice and the second word lattice comprises using Hidden Markov Modeling.
  - 6. The method of claim 1, wherein the determining a combined probability includes weighting the probabilities from the first word lattice and the probabilities from the second word lattice differently when combining the probabilities.
  - 7. The method of claim 1, further comprising selecting for provision to the electronic device, a character string that is in the third set of character string candidates.
  - 8. The method of claim 7, wherein the selected character string is determined to be a better match for one or more of the original and repeat utterances than other character strings that appear in the third set of character string candidates.
  - 9. The method of claim 1, further comprising transmitting to the electronic device for selection by the user a list of one or more character strings that are determined to be present in the third set of character string candidates.
  - 10. The method of claim 1, further comprising transmitting to the electronic device for selection by the user a list of one or more character strings that are determined to be present in the second set of character string candidates.

11. A computer-implemented method, comprising:
- receiving, by a computing system and at a first time, a first spoken input from a user of an electronic device, the spoken input comprising an original utterance;
  
  based on the original utterance, determining, by the computing system, a first set of character string candidates and a confidence level corresponding to each character string candidate in the set, wherein each character string candidate represents the first spoken input converted to text;
  
  determining, by the computing system, that less than a threshold number of character string candidates in the set have a corresponding confidence level that meets or exceeds a predetermined threshold level and in response to the determination, requesting the user to provide a second spoken input;
  
  determining, using a speech recognizer, a first word lattice that represents the first set of character string candidates and a first set of probabilities, each probability corresponding to a character string candidate in the first word lattice;
  
  receiving, by the computing system and at a second time, the second spoken input from the user;
  
  determining, by the computing system and using the speech recognizer and the first word lattice as a language model, a second word lattice that represents a second set of character string candidates and a second set of probabilities, each probability corresponding to a character string candidate in the first word lattice;
  
  determining an intersection or union of the first word lattice and the second word lattice and, for each character string candidate included in the intersection or union, determining a combined probability based on the probabilities from the first set of probabilities and the second set of probabilities that correspond to the character string candidate;
  
  determining a third set of character string candidates based on the intersection or union and the determined combined probabilities;
  
  determining, by the computing system, a selection of one or more character string candidates from the third set of character string candidates; and
  
  transmitting, by the computing system, the selection of one or more character string candidates to the electronic device for display to the user.

12. A non-transitory computer-readable medium having instructions encoded thereon, which, when executed by a processor, cause the processor to perform operations comprising:
- receiving, at a first time, a first spoken input from a user of an electronic device, the first spoken input comprising an original utterance by the user;
  
  based on the original utterance, determining a first set of character string candidates wherein each character string candidate represents the first spoken input converted to textual characters, and wherein determining the first set of character string candidates comprises using a speech recognizer to determine a first word lattice that represents the first set of character string candidates and a first set of probabilities, each probability corresponding to a character string candidate;
  
  providing for display to the user a selection of one or more of the character string candidates in response to receiving the first spoken input;
  
  receiving, at a second time, a second spoken input from the user;
  
  determining that the second spoken input is a repeat utterance of the original utterance;
  
  based on determining that the second spoken input is a repeat utterance of the original utterance, and using the original utterance and the repeat utterance, determining a second set of character string candidates, wherein determining the second set of character string candidates using the original utterance and the repeat utterance comprises;
  
  using the speech recognizer and the first word lattice as a language model to determine a second word lattice that represents the second set of character string candidates and a second set of probabilities, each probability corresponding to a character string candidate of the second set of character string candidates;
  
  determining an intersection or union of the first word lattice and the second word lattice and, for each character string candidate included in the intersection or union, determining a combined probability based on the probabilities from the first set of probabilities and the second set of probabilities that correspond to the character string candidate; and
  
  determining a third set of character string candidates based on the intersection or union and the determined combined probabilities.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The non-transitory computer-readable medium of claim 12, wherein determining that the second spoken input is a repeat utterance of the original utterance comprises determining the second spoken input was received within a predetermined time period of the first time.
  - 14. The non-transitory computer-readable medium of claim 12, wherein determining that the second spoken input is a repeat utterance of the original utterance comprises generating a score that indicates an acoustic alignment between a first sequence of vectors that represents the first spoken input and a second sequence of vectors that represents the second spoken input and determining that the score exceeds a predetermined threshold score.
  - 15. The non-transitory computer-readable medium of claim 12, wherein determining that the second spoken input is a repeat utterance of the original utterance comprises determining whether the first and second inputs were spoken by different voices.
  - 16. The non-transitory computer-readable medium of claim 12, further comprising instructions to cause the processor to perform operations comprising:
    - transmitting to the electronic device for selection by the user a list of one or more character strings that are determined to be present in the third set of character string candidates.
  - 17. The non-transitory computer-readable medium of claim 12, further comprising instructions to cause the processor to perform operations comprising:
    - transmitting to the electronic device for selection by the user a list of one or more character strings that are determined to be present in the third set of character string candidates.

18. A system comprising:
- one or more computers;
  
  one or more data storage devices coupled to the one or more computers and storing instructions, which, when executed by the processor cause the one or more computers to perform operations comprising;
  
  receiving, at a first time, a first spoken input from a user of an electronic device, the first spoken input comprising an original utterance by the user;
  
  based on the original utterance, determining a first set of character string candidates wherein each character string candidate represents the first spoken input converted to textual characters, and wherein determining the first set of character string candidates comprises using a speech recognizer to determine a first word lattice that represents the first set of character string candidates and a first set of probabilities, each probability corresponding to a character string candidate;
  
  providing for display to the user a selection of one or more of the character string candidates in response to receiving the first spoken input;
  
  receiving, at a second time, a second spoken input from the user;
  
  determining that the second spoken input is a repeat utterance of the original utterance;
  
  based on determining that the second spoken input is a repeat utterance of the original utterance, and using the original utterance and the repeat utterance, determining a second set of character string candidates, wherein determining the second set of character string candidates using the original utterance and the repeat utterance comprises;
  
  using the speech recognizer and the first word lattice as a language model to determine a second word lattice that represents the second set of character string candidates and a second set of probabilities, each probability corresponding to a character string candidate of the second set of character string candidates;
  
  determining an intersection or union of the first word lattice and the second word lattice and, for each character string candidate included in the intersection or union, determining a combined probability based on the probabilities from the first set of probabilities and the second set of probabilities that correspond to the character string candidate; and
  
  determining a third set of character string candidates based on the intersection or union and the determined combined probabilities.
- View Dependent Claims (19, 20, 21, 22, 23)
- - 19. The system of claim 18, wherein determining that the second spoken input is a repeat utterance of the original utterance comprises determining that the second spoken input was received within a predetermined time period of the first time.
  - 20. The system of claim 18, wherein determining that the second spoken input is a repeat utterance of the original utterance comprises generating a score that indicates an acoustic alignment between a first sequence of vectors that represents the first spoken input and a second sequence of vectors that represents the second spoken input and determining that the score exceeds a predetermined threshold score.
  - 21. The system of claim 18, wherein determining that the second spoken input is a repeat utterance of the original utterance comprises determining whether the first and second inputs were spoken by different voices.
  - 22. The system of claim 18, the instructions further comprising instructions to cause the processor to perform operations comprising:
    - transmitting to the electronic device for selection by the user a list of one or more character strings that are determined to be present in the third set of character string candidates.
  - 23. The system of claim 18, the instructions further comprising instructions to cause the processor to perform operations comprising:
    - transmitting to the electronic device for selection by the user a list of one or more character strings that are determined to be present in the third set of character string candidates.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Shaw, Hayden, Kristjansson, Trausti, Senior, Andrew W.
Primary Examiner(s)
He, Jialong

Application Number

US12/953,344
Time in Patent Office

1,743 Days
Field of Search

704/251, 704/254, 704/256, 704/270
US Class Current

1/1
CPC Class Codes

G10L 15/10   using distance or distortio...

G10L 15/18   using natural language mode...

G10L 15/22   Procedures used during a sp...

G10L 2015/085   Methods for reducing search...

Speech recognition using repeated utterances

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

310 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition using repeated utterances

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

310 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links