Pronunciation learning through correction logs
First Claim
1. A method for dynamically learning new pronunciations for speech recognition assisted by subsequent user inputs, the method comprising:
- determining that a task initiated by a spoken utterance was not completed successfully;
determining that a recognition result for the spoken utterance includes a misrecognized word based on the determination that the task initiated by the recognition result was not completed successfully;
determining that a subsequent task initiated by subsequent user inputs was completed successfully;
associating the spoken utterance with the subsequent user inputs based on similarity between the spoken utterance and the subsequent user inputs;
after associating the spoken utterance with the subsequent user inputs based on similarity between the spoken utterance and the subsequent user inputs, generating hypothetical pronunciations for the misrecognized word in the spoken utterance based on a predicted intended word derived from the associated subsequent user inputs;
recognizing the spoken utterance using a language model containing the hypothetical pronunciations to find matching hypothetical pronunciations; and
accepting a new pronunciation for the predicted intended word from the matching hypothetical pronunciations.
2 Assignments
0 Petitions
Accused Products
Abstract
A new pronunciation learning system for dynamically learning new pronunciations assisted by user correction logs. The user correction logs provide a record of speech recognition events and subsequent user behavior that implicitly confirms or rejects the recognition result and/or shows the user'"'"'s intended words by via subsequent input. The system analyzes the correction logs and distills them down to a set of words which lack acceptable pronunciations. Hypothetical pronunciations, constrained by spelling and other linguistic knowledge, are generated for each of the words. Offline recognition determines the hypothetical pronunciations with a good acoustical match to the audio data likely to contain the words. The matching pronunciations are aggregated and adjudicated to select new pronunciations for the words to improve general or personalized recognition models.
-
Citations
20 Claims
-
1. A method for dynamically learning new pronunciations for speech recognition assisted by subsequent user inputs, the method comprising:
-
determining that a task initiated by a spoken utterance was not completed successfully; determining that a recognition result for the spoken utterance includes a misrecognized word based on the determination that the task initiated by the recognition result was not completed successfully; determining that a subsequent task initiated by subsequent user inputs was completed successfully; associating the spoken utterance with the subsequent user inputs based on similarity between the spoken utterance and the subsequent user inputs; after associating the spoken utterance with the subsequent user inputs based on similarity between the spoken utterance and the subsequent user inputs, generating hypothetical pronunciations for the misrecognized word in the spoken utterance based on a predicted intended word derived from the associated subsequent user inputs; recognizing the spoken utterance using a language model containing the hypothetical pronunciations to find matching hypothetical pronunciations; and accepting a new pronunciation for the predicted intended word from the matching hypothetical pronunciations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A speech recognition system for assisted dynamic learning of new pronunciations for user input, the speech recognition system comprising:
-
at least one processor; a memory connected to the at least one processor; and a recognition event store encoded on the memory storing recognition event data for tasks initiated by spoken utterances, the recognition event data including audio data of the spoken utterances, recognition results obtained by decoding the spoken utterances, subsequent user inputs, and indicators of whether outcomes of tasks initiated based on the recognition results and the subsequent user inputs were accepted or rejected by users; wherein the memory contains computer executable instructions for; an event classifier configured to classify recognition results as misrecognized spoken utterances based on determining that tasks initiated by the recognition result were not completed successfully based on an indication that outcomes of the tasks were not accepted, determining that subsequent tasks initiated based on subsequent user inputs were completed successfully based on an indication that outcomes of the subsequent tasks being accepted, and determining that a subsequent user input and recognition result pair from a single source have significant similarity, and configured to identify misrecognized portions of the recognition results based on the subsequent user inputs; a pronunciation generator configured to generate hypothetical pronunciations for the identified misrecognized portions using the corresponding portions of the subsequent user inputs after the event classifier has classified recognition results as misrecognized spoken utterances based on determining that tasks initiated by the recognition result were not completed successfully based on an indication that outcomes of the tasks were not accepted, determining that subsequent tasks initiate based on subsequent user inputs were completed successfully based on an indication that outcomes of the subsequent tasks being accepted, and determining that a subsequent user input and recognition result pair from a single source have significant similarity; a speech recognizer configured to match hypothetical pronunciations with the audio data of spoken utterances that produced recognition results classified as misrecognized spoken utterances; and an aggregation adjudicator configured to select new pronunciations for the misrecognized words from the matching pronunciations. - View Dependent Claims (14, 15, 16, 17)
-
-
18. A system for learning new pronunciations, the system comprising:
-
at least one processor; and a memory connected to the at least one processor and containing computer executable instructions which, when executed by the at least one processor, cause the at least one processor to perform a method for assisted dynamic learning of new pronunciations for user input, the method comprising; determining that a spoken utterance has been misrecognized based on implicit indications that a task initiated based on the spoken utterance was unsuccessful and a task initiated based on a subsequent user input was successful and that the subsequent user input is similar to the spoken utterance; after determining that a spoken utterance has been misrecognized based on implicit indications that a task initiated based on the spoken utterance was unsuccessful and a task initiated based on a subsequent user input was successful and that the subsequent user input is similar to the spoken utterance, identifying a misrecognized word based on a difference in linguistic unit between the spoken utterance and the subsequent user input; after identifying a misrecognized word based on a difference in linguistic unit between the spoken utterance and the subsequent user input, generating hypothetical pronunciations for the identified misrecognized word in spoken utterances based on a similar word from corresponding subsequent user inputs; recognizing misrecognized spoken utterances using a language model containing the hypothetical pronunciations to find matching hypothetical pronunciations; and accepting a new pronunciation for the similar word from the matching hypothetical pronunciations. - View Dependent Claims (19, 20)
-
Specification