Pronunciation learning through correction logs

US 9,947,317 B2
Filed: 02/13/2017
Issued: 04/17/2018
Est. Priority Date: 02/21/2014
Status: Active Grant

First Claim

Patent Images

1. A method for unsupervised learning of new pronunciations for speech recognition based on a plurality of recognition events, the method comprising:

extracting a first recognition event from a plurality of recognition events, wherein the first recognition event comprises a failed voice task according to the plurality of definitions for one or more successfully completed tasks;

extracting a second recognition event from the plurality of recognition events, wherein the second recognition event comprises a successfully completed task according to a plurality of definitions for one or more successfully completed tasks;

determining the first recognition event as a misrecognized spoken utterance of at least one predicted intended word derived from the successfully completed task based at least in part on a plurality of features for misrecognition of speech;

associating the misrecognized spoken utterance and the at least one predicted intended word based on at least one feature;

generating at least one pronunciation from the misrecognized spoken utterance and the at least one predicted intended word;

based at least in part on the at least one pronunciation, determining a new pronunciation; and

applying said new pronunciation output for unsupervised learning to train speech recognition for processing subsequent voice recognition events.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A new pronunciation learning system for dynamically learning new pronunciations assisted by user correction logs. The user correction logs provide a record of speech recognition events and subsequent user behavior that implicitly confirms or rejects the recognition result and/or shows the user'"'"'s intended words by via subsequent input. The system analyzes the correction logs and distills them down to a set of words which lack acceptable pronunciations. Hypothetical pronunciations, constrained by spelling and other linguistic knowledge, are generated for each of the words. Offline recognition determines the hypothetical pronunciations with a good acoustical match to the audio data likely to contain the words. The matching pronunciations are aggregated and adjudicated to select new pronunciations for the words to improve general or personalized recognition models.

Citations

20 Claims

1. A method for unsupervised learning of new pronunciations for speech recognition based on a plurality of recognition events, the method comprising:
- extracting a first recognition event from a plurality of recognition events, wherein the first recognition event comprises a failed voice task according to the plurality of definitions for one or more successfully completed tasks;
  
  extracting a second recognition event from the plurality of recognition events, wherein the second recognition event comprises a successfully completed task according to a plurality of definitions for one or more successfully completed tasks;
  
  determining the first recognition event as a misrecognized spoken utterance of at least one predicted intended word derived from the successfully completed task based at least in part on a plurality of features for misrecognition of speech;
  
  associating the misrecognized spoken utterance and the at least one predicted intended word based on at least one feature;
  
  generating at least one pronunciation from the misrecognized spoken utterance and the at least one predicted intended word;
  
  based at least in part on the at least one pronunciation, determining a new pronunciation; and
  
  applying said new pronunciation output for unsupervised learning to train speech recognition for processing subsequent voice recognition events.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the definitions for one or more successfully completed tasks are statistically derived from the plurality of recognition events.
  - 3. The method of claim 1, wherein the definitions for one or more successfully completed tasks comprise at least in part a speech misrecognition followed by a successful task completion within a threshold time.
  - 4. The method of claim 1, wherein the definitions for one or more successfully completed tasks comprise at least in part receiving user interaction on a send button of a messaging application after receiving at least one spoken utterance.
  - 5. The method of claim 1, wherein the definitions for one or more successfully completed tasks comprise at least in part receiving at least one selection on search results from user interactions after receiving at least one spoken utterance in a search application.
  - 6. The method of claim 1, further comprising:
    - compiling the at least one predicted intended word; and
      
      determine semantic similarity between the misrecognized speech utterance and the successfully completed task wherein the successfully completed task comprises at least one spoken utterance.
  - 7. The method of claim 1, wherein the at least one definition of a successful task comprises at least one of:
    - selecting through user interaction to proceed to a next action based on prior recognition events within a predetermined time period; and
      
      receiving a speech utterance without selecting a result of voice recognition from at least one previous recognition event.
  - 8. The method of claim 1, wherein the at least one definition of misrecognition comprises at least one of:
    - assessing misrecognition probability using session features of the failed voice task and the successfully completed task;
      
      assessing misrecognition probability using lexical features of the failed voice task and the successfully completed task; and
      
      assessing misrecognition probability using phonetic features of the failed voice task and the successfully completed task.
  - 9. The method of claim 1, wherein determining the first recognition event as the misrecognized spoken utterance of the at least one predicted intended word further comprises determining based on similarity in the features between the failed voice task and the successfully completed task, and wherein the features comprise at least one of:
    - session features;
      
      lexical features; and
      
      phonetic features.
  - 10. The method of claim 9, wherein the session features further comprise at least one of:
    - occurrence of the first recognition event followed by the second recognition event;
      
      occurrence of an inverse successive input pair, wherein the inverse successive input pair comprises the first recognition event occurring after the second recognition event; and
      
      occurrence of the second recognition event, wherein the second recognition event further comprising at least one of infrequently used word.
  - 11. The method of claim 9, wherein the similarity in phonetic feature weighs more than the similarity in lexical feature.
  - 12. The method of claim 9, further comprising comparing acoustic features of the failed voice task and the successful task.

13. A system for unsupervised learning of new pronunciations for speech recognition based on a plurality of recognition events, the system comprising:
- at least one processing unit; and
  
  at least one memory storing computer executable instructions that when executed by the at least one processing unit perform a method for automatically processing misrecognition of voice input, the method comprising;
  
  parsing a series of user interaction events related to voice recognition;
  
  determining a voice misrecognition when the series of user interaction events, wherein the determining the voice misrecognition further comprises;
  
  receiving a first query, wherein the first query is at least in part a speech utterance;
  
  encoding the first query into at least one text data;
  
  providing the at least one text data; and
  
  receiving a second query without selecting the provided at least one text data within in a predetermined time period;
  
  determining the first query as a misrecognized spoken utterance of at least one predicted intended word derived from the second query;
  
  associating the misrecognized spoken utterance and the at least one predicted intended word based on at least one feature;
  
  generating at least one pronunciation from the misrecognized spoken utterance and the at least one predicted intended word;
  
  based at least in part on the at least one pronunciation, determining a new pronunciation; and
  
  applying said new pronunciation output for unsupervised learning to train speech recognition in processing subsequent voice recognition events.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The system of claim 13, further comprising:
    - encoding the second query into at a second text data when the second query is non-text data;
      
      generating a level of similarity between the at least one text data and the second query; and
      
      determining the first query as a misrecognized spoken utterance of at least one predicted intended word based on the second query when the level of similarity is greater than a threshold level.
  - 15. The system of claim 14, wherein determining the first query as the misrecognized spoken utterance of the at least one predicted intended word further comprises comparing for similarity in the features between the first query and the second query, and wherein the features comprise at least one of:
    - session features;
      
      lexical features; and
      
      phonetic features.
  - 16. The system of claim 15, wherein the session features further comprise at least one of:
    - occurrence of the first query followed by the second query within a temporal proximity;
      
      occurrence of an inverse successive input pair, wherein the inverse successive input pair comprises the first query occurring after the second query; and
      
      occurrence of the second query, wherein the second query further comprises at least one word, wherein the at least one word appears in the series of user interaction events at a frequency that is less than a threshold frequency.
  - 17. The system of claim 15, further comprising comparing acoustic features of the first query and the second query.

18. A computer-readable storage device with a memory storing computer executable instructions which, when connected to and executed by at least one processor, perform a method for unsupervised learning of new pronunciation based on voice input, the method comprising:
- extracting a first recognition event from a plurality of recognition events, wherein the first recognition event comprises a failed voice task according to the plurality of definitions for one or more successfully completed tasks;
  
  extracting a second recognition event from the plurality of recognition events, wherein the second recognition event comprises a successfully completed task according to a plurality of definitions for one or more successfully completed tasks;
  
  determining the first recognition event as a misrecognized spoken utterance of at least one predicted intended word derived from the successfully completed task based at least in part on a plurality of features for misrecognition of speech;
  
  associating the misrecognized spoken utterance and the at least one predicted intended word based on at least one feature;
  
  generating at least one pronunciation from the misrecognized spoken utterance and the at least one predicted intended word;
  
  based at least in part on the at least one pronunciation, determining a new pronunciation; and
  
  applying said new pronunciation output for unsupervised learning to train speech recognition in processing subsequent voice recognition events.
- View Dependent Claims (19, 20)
- - 19. The computer-readable storage device of claim 18, wherein determining the first recognition event as the misrecognized spoken utterance of the at least one predicted intended word further comprises determining based on similarity in the features between the first recognition event and the second recognition event, and wherein the features comprise at least one of:
    - session features;
      
      lexical features; and
      
      phonetic features.
  - 20. The computer-readable storage device of claim 18, wherein the session features further comprise at least one of:
    - occurrence of the first recognition event followed by the second recognition event within a temporal proximity;
      
      occurrence of an inverse successive input pair, wherein the inverse successive input pair comprises the first recognition event occurring after the second recognition event; and
      
      occurrence of the second recognition event, wherein the second recognition event further comprising at least one word, wherein the at least one word appears in the plurality of recognition events at a frequency that is less than a threshold frequency.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Kibre, Nicholas, Ozertem, Umut, Parthasarathy, Sarangarajan, Al Bawab, Ziad
Primary Examiner(s)
Yang, Qian

Application Number

US15/431,350
Publication Number

US 20170154623A1
Time in Patent Office

428 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/01   Assessment or evaluation of...

G10L 15/075   supervised, i.e. under mach...

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/225   Feedback of the input speech

Pronunciation learning through correction logs

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Pronunciation learning through correction logs

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links