PRONUNCIATION LEARNING THROUGH CORRECTION LOGS

US 20150243278A1
Filed: 02/21/2014
Published: 08/27/2015
Est. Priority Date: 02/21/2014
Status: Active Grant

First Claim

Patent Images

1. A method for dynamically learning new pronunciations for speech recognition assisted by subsequent user inputs, the method comprising:

generating hypothetical pronunciations for a misrecognized word in spoken utterances based on a predicted intended word derived from corresponding subsequent user inputs;

recognizing misrecognized spoken utterances using a language model containing the hypothetical pronunciations to find matching hypothetical pronunciations; and

accepting a new pronunciation for the predicted intended word from the matching hypothetical pronunciations.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A new pronunciation learning system for dynamically learning new pronunciations assisted by user correction logs. The user correction logs provide a record of speech recognition events and subsequent user behavior that implicitly confirms or rejects the recognition result and/or shows the user'"'"'s intended words by via subsequent input. The system analyzes the correction logs and distills them down to a set of words which lack acceptable pronunciations. Hypothetical pronunciations, constrained by spelling and other linguistic knowledge, are generated for each of the words. Offline recognition determines the hypothetical pronunciations with a good acoustical match to the audio data likely to contain the words. The matching pronunciations are aggregated and adjudicated to select new pronunciations for the words to improve general or personalized recognition models.

Citations

20 Claims

1. A method for dynamically learning new pronunciations for speech recognition assisted by subsequent user inputs, the method comprising:
- generating hypothetical pronunciations for a misrecognized word in spoken utterances based on a predicted intended word derived from corresponding subsequent user inputs;
  
  recognizing misrecognized spoken utterances using a language model containing the hypothetical pronunciations to find matching hypothetical pronunciations; and
  
  accepting a new pronunciation for the predicted intended word from the matching hypothetical pronunciations.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 further comprising the act of determining that a spoken utterance was misrecognized based on a user confirmation.
  - 3. The method of claim 1 further comprising the act of determining that subsequent user inputs correspond to spoken utterances.
  - 4. The method of claim 3 wherein the act of determining that subsequent user inputs correspond to spoken utterances containing the misrecognized word further comprises the acts of:
    - determining that tasks initiated by spoken utterances were not successfully completed; and
      
      determining that tasks initiated by subsequent user inputs were successfully completed; and
      
      pairing tasks initiated by spoken utterances that were not successfully completed successive input pairs with tasks initiated by subsequent user inputs that were successfully completed into successive input pairs.
  - 5. The method of claim 4 wherein the act of determining that subsequent user inputs correspond to spoken utterances containing the misrecognized word further comprises the acts of determining that successive input pairs correspond to spoken utterances containing the misrecognized word and subsequent user inputs correcting the misrecognized word using at least one of session features, lexical features, and phonetic features.
  - 6. The method of claim 4 wherein the act of determining that subsequent user inputs correspond to spoken utterances containing the misrecognized word further comprises the acts of determining that subsequent user inputs are closely related in time to spoken utterances containing the misrecognized word.
  - 7. The method of claim 4 wherein the act of determining that subsequent user inputs correspond to spoken utterances containing the misrecognized word further comprises the acts of determining that subsequent user inputs are lexically similar to spoken utterances containing the misrecognized word.
  - 8. The method of claim 4 wherein the act of determining that subsequent user inputs correspond to spoken utterances containing the misrecognized word further comprises the acts of determining that subsequent user inputs are phonetically similar to spoken utterances containing the misrecognized word.
  - 9. The method of claim 3 wherein the act of accepting a new pronunciation for the predicted intended word from the matching hypothetical pronunciations further comprises the act of accepting the most frequent matching hypothetical pronunciation as the new pronunciation for the predicted intended word.
  - 10. The method of claim 9 wherein the act of selecting a new pronunciation for the predicted intended word from the matching hypothetical pronunciations further comprises the acts of:
    - determining how often phonemes occur in the matching hypothetical pronunciations for each grapheme; and
      
      accepting the most frequent matching hypothetical pronunciation as the new pronunciation if the most frequent phonemes for each common phone occur in the most frequent matching hypothetical pronunciation.
  - 11. The method of claim 1 wherein the act of generating hypothetical pronunciations for misrecognized word in spoken utterances based on a predicted intended word from corresponding subsequent user inputs further comprises the acts of selecting possible phonemes corresponding to graphemes in the predicted intended word according to linguistic knowledge.
  - 12. The method of claim 1 further comprising the act of collecting audio of recognized spoken utterances, subsequent user inputs following spoken utterances, and information indicating whether tasks initiated by the spoken utterances and subsequent user inputs were successful.

13. A speech recognition system for assisted dynamic learning of new pronunciations for user input, the speech recognition system comprising:
- a recognition event store storing recognition event data for tasks initiated by spoken utterances, the recognition event data including audio data of the spoken utterances, the recognition results obtained by decoding the spoken utterances, subsequent user inputs, and indicators of whether outcomes of the recognition results and the subsequent user inputs were accepted or rejected by users;
  
  an event classifier operable to classify recognition results as misrecognized spoken utterances based on an indication that outcomes of recognition results were not accepted while outcomes of subsequent user inputs were successful and a determination that a subsequent user input and recognition result pair from a single source have significant similarity, and operable to identify misrecognized portions of the recognition results from the subsequent user inputs;
  
  a pronunciation generator operable to generate hypothetical pronunciations for the misrecognized portions using the corresponding portions of the subsequent user inputs;
  
  a speech recognizer operable to match hypothetical pronunciations with the audio data of spoken utterances that produced recognition results classified as misrecognized spoken utterances; and
  
  an aggregation adjudicator operable to select new pronunciations for the misrecognized words from the matching pronunciations.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The speech recognition system of claim 13 wherein the aggregation adjudicator is further operable to aggregate the matching pronunciations and select a new pronunciation based on the occurrence frequency of the matching pronunciations.
  - 15. The speech recognition system of claim 14 wherein the aggregation adjudicator is operable to confirm a selection of the new pronunciation by determining that the new pronunciation is constructed from the most popular phonetic units for each common phone in the aggregated matching pronunciations.
  - 16. The speech recognition system of claim 13 wherein the event classifier uses temporal proximity of the subsequent user input and recognition result pair and at least one lexical or phonetic similarity measures to determine that the subsequent user input is a corrective input for a misrecognized recognition result.
  - 17. The speech recognition system of claim 13 wherein the pronunciation generator is operable to identify possible graphic and phonetic combinations for the misrecognized portions using the corresponding portions of the subsequent user inputs, create a model to score the possible graphic and phonetic combinations using linguistic knowledge, and build a custom grammar from a set of the possible graphic and phonetic combinations selected based on scores obtained using the model.

18. A computer readable medium containing computer executable instructions which, when executed by a computer, perform a method for assisted dynamic learning of new pronunciations for user input, the method comprising:
- determining that a spoken utterance is likely to have been misrecognized based on implicit indications that a voice input based on the spoken utterance was unsuccessful and a subsequent user input similar to the voice input was successful;
  
  identifying a difference in linguistic unit between the voice input and the subsequent user input;
  
  generating hypothetical pronunciations for misrecognized word in spoken utterances based on a similar word from corresponding subsequent user inputs;
  
  recognizing misrecognized spoken utterances using a language model containing the hypothetical pronunciations to find matching hypothetical pronunciations; and
  
  accepting a new pronunciation for the similar word from the matching hypothetical pronunciations.
- View Dependent Claims (19, 20)
- - 19. The computer readable medium of claim 18 wherein the method further comprises the act of adding the accepted pronunciation to a general language model used to provide real-time speech recognition to multiple users.
  - 20. The computer readable medium of claim 18 wherein the method further comprises the act of adding the accepted pronunciation to a personal language model used to provide real-time speech recognition to a single user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Kibre, Nicholas, Parthasarathy, Sarangarajan, Ozertem, Umut, Al Bawab, Ziad

Granted Patent

US 9,589,562 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 15/01   Assessment or evaluation of...

G10L 15/075   supervised, i.e. under mach...

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/225   Feedback of the input speech

PRONUNCIATION LEARNING THROUGH CORRECTION LOGS

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

PRONUNCIATION LEARNING THROUGH CORRECTION LOGS

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links