Pronunciation learning through correction logs
First Claim
1. A method for unsupervised learning of new pronunciations for speech recognition based on a plurality of recognition events, the method comprising:
- extracting a first recognition event from a plurality of recognition events, wherein the first recognition event comprises a failed voice task according to the plurality of definitions for one or more successfully completed tasks;
extracting a second recognition event from the plurality of recognition events, wherein the second recognition event comprises a successfully completed task according to a plurality of definitions for one or more successfully completed tasks;
determining the first recognition event as a misrecognized spoken utterance of at least one predicted intended word derived from the successfully completed task based at least in part on a plurality of features for misrecognition of speech;
associating the misrecognized spoken utterance and the at least one predicted intended word based on at least one feature;
generating at least one pronunciation from the misrecognized spoken utterance and the at least one predicted intended word;
based at least in part on the at least one pronunciation, determining a new pronunciation; and
applying said new pronunciation output for unsupervised learning to train speech recognition for processing subsequent voice recognition events.
2 Assignments
0 Petitions
Accused Products
Abstract
A new pronunciation learning system for dynamically learning new pronunciations assisted by user correction logs. The user correction logs provide a record of speech recognition events and subsequent user behavior that implicitly confirms or rejects the recognition result and/or shows the user'"'"'s intended words by via subsequent input. The system analyzes the correction logs and distills them down to a set of words which lack acceptable pronunciations. Hypothetical pronunciations, constrained by spelling and other linguistic knowledge, are generated for each of the words. Offline recognition determines the hypothetical pronunciations with a good acoustical match to the audio data likely to contain the words. The matching pronunciations are aggregated and adjudicated to select new pronunciations for the words to improve general or personalized recognition models.
-
Citations
20 Claims
-
1. A method for unsupervised learning of new pronunciations for speech recognition based on a plurality of recognition events, the method comprising:
-
extracting a first recognition event from a plurality of recognition events, wherein the first recognition event comprises a failed voice task according to the plurality of definitions for one or more successfully completed tasks; extracting a second recognition event from the plurality of recognition events, wherein the second recognition event comprises a successfully completed task according to a plurality of definitions for one or more successfully completed tasks; determining the first recognition event as a misrecognized spoken utterance of at least one predicted intended word derived from the successfully completed task based at least in part on a plurality of features for misrecognition of speech; associating the misrecognized spoken utterance and the at least one predicted intended word based on at least one feature; generating at least one pronunciation from the misrecognized spoken utterance and the at least one predicted intended word; based at least in part on the at least one pronunciation, determining a new pronunciation; and applying said new pronunciation output for unsupervised learning to train speech recognition for processing subsequent voice recognition events. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system for unsupervised learning of new pronunciations for speech recognition based on a plurality of recognition events, the system comprising:
-
at least one processing unit; and at least one memory storing computer executable instructions that when executed by the at least one processing unit perform a method for automatically processing misrecognition of voice input, the method comprising; parsing a series of user interaction events related to voice recognition; determining a voice misrecognition when the series of user interaction events, wherein the determining the voice misrecognition further comprises; receiving a first query, wherein the first query is at least in part a speech utterance; encoding the first query into at least one text data; providing the at least one text data; and receiving a second query without selecting the provided at least one text data within in a predetermined time period; determining the first query as a misrecognized spoken utterance of at least one predicted intended word derived from the second query; associating the misrecognized spoken utterance and the at least one predicted intended word based on at least one feature; generating at least one pronunciation from the misrecognized spoken utterance and the at least one predicted intended word; based at least in part on the at least one pronunciation, determining a new pronunciation; and applying said new pronunciation output for unsupervised learning to train speech recognition in processing subsequent voice recognition events. - View Dependent Claims (14, 15, 16, 17)
-
-
18. A computer-readable storage device with a memory storing computer executable instructions which, when connected to and executed by at least one processor, perform a method for unsupervised learning of new pronunciation based on voice input, the method comprising:
-
extracting a first recognition event from a plurality of recognition events, wherein the first recognition event comprises a failed voice task according to the plurality of definitions for one or more successfully completed tasks; extracting a second recognition event from the plurality of recognition events, wherein the second recognition event comprises a successfully completed task according to a plurality of definitions for one or more successfully completed tasks; determining the first recognition event as a misrecognized spoken utterance of at least one predicted intended word derived from the successfully completed task based at least in part on a plurality of features for misrecognition of speech; associating the misrecognized spoken utterance and the at least one predicted intended word based on at least one feature; generating at least one pronunciation from the misrecognized spoken utterance and the at least one predicted intended word; based at least in part on the at least one pronunciation, determining a new pronunciation; and applying said new pronunciation output for unsupervised learning to train speech recognition in processing subsequent voice recognition events. - View Dependent Claims (19, 20)
-
Specification