Identifying substitute pronunciations
First Claim
1. A computer-implemented method comprising:
- selecting, by a system that includes (i) a confusion matrix manager configured to store, for like-pronunciation groups, phones of expected phonetic transcriptions as substitute pronunciations for corresponding phones of actual phonetic transcriptions in a confusion matrix and (ii) an enhanced speech recognizer configured to obtain, from the confusion matrix manager, using output of an acoustic model associated with the enhanced speech recognizer, the substitute pronunciations before inputting the substitute pronunciations to a language model associated with the enhanced speech recognizer, one or more terms;
obtaining, by the system, an expected phonetic transcription of an idealized native speaker of a natural language speaking the one or more terms;
after obtaining the expected phonetic transcription of the idealized native speaker of the natural language speaking the one or more terms, receiving audio data corresponding to a particular user that is not the idealized native speaker of the natural language speaking the one or more terms in the natural language;
receiving data identifying a like-pronunciation group associated with the particular user;
obtaining, by the system, based on the audio data, an actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language speaking the one or more terms in the natural language;
aligning, by the system, the expected phonetic transcription of the idealized native speaker of the natural language with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language;
identifying, by the system, based on aligning the expected phonetic transcription of the idealized native speaker with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language, one or more phones of the expected phonetic transcription that is different than one or more corresponding phones of the actual phonetic transcription;
in response to identifying, based on aligning the expected phonetic transcription of the idealized native speaker with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language, designating, by the system, the one or more phones of the expected phonetic transcription as a substitute pronunciation for the corresponding phones of the actual phonetic transcription for other terms that (i) are spoken by other users that are also associated with the like-pronunciation group, and (ii) have a respective phonetic transcription that includes the one or more corresponding phones;
obtaining, by the enhanced speech recognizer that is configured to obtain, from the confusion matrix manager, using output of the acoustic model associated with the enhanced speech recognizer, the substitute pronunciations before inputting the substitute pronunciations to the language model associated with the enhanced speech recognizer, a transcription of another term that (i) is spoken by another user that is also associated with the like-pronunciation group, and (ii) has a phonetic transcription that includes the one or more corresponding phones based on the one or more phones of the expected phonetic transcription being designated as a substitute pronunciation for the one or more corresponding phones of the actual phonetic transcription; and
inputting, by the enhanced speech recognizer, using the transcription of another term that (i) is spoken by another user that is also associated with the like-pronunciation group, and (ii) has a phonetic transcription that includes the one or more corresponding phones, the one or more phones of the expected phonetic transcription that is designated as the substitute pronunciation for the one or more corresponding phones of the actual phonetic transcription to the language model associated with the enhanced speech recognizer.
3 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, including selecting terms; obtaining an expected phonetic transcription of an idealized native speaker of a natural language speaking the terms; receiving audio data corresponding to a particular user speaking the terms in the natural language; obtaining, based on the audio data, an actual phonetic transcription of the particular user speaking the terms in the natural language; aligning the expected phonetic transcription of the idealized native speaker of the natural language with the actual phonetic transcription of the particular user; identifying, based on the aligning, a portion of the expected phonetic transcription that is different than a corresponding portion of the actual phonetic transcription; and based on identifying the portion of the expected phonetic transcription, designating the expected phonetic transcription as a substitute pronunciation for the corresponding portion of the actual phonetic transcription.
-
Citations
22 Claims
-
1. A computer-implemented method comprising:
-
selecting, by a system that includes (i) a confusion matrix manager configured to store, for like-pronunciation groups, phones of expected phonetic transcriptions as substitute pronunciations for corresponding phones of actual phonetic transcriptions in a confusion matrix and (ii) an enhanced speech recognizer configured to obtain, from the confusion matrix manager, using output of an acoustic model associated with the enhanced speech recognizer, the substitute pronunciations before inputting the substitute pronunciations to a language model associated with the enhanced speech recognizer, one or more terms; obtaining, by the system, an expected phonetic transcription of an idealized native speaker of a natural language speaking the one or more terms; after obtaining the expected phonetic transcription of the idealized native speaker of the natural language speaking the one or more terms, receiving audio data corresponding to a particular user that is not the idealized native speaker of the natural language speaking the one or more terms in the natural language; receiving data identifying a like-pronunciation group associated with the particular user; obtaining, by the system, based on the audio data, an actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language speaking the one or more terms in the natural language; aligning, by the system, the expected phonetic transcription of the idealized native speaker of the natural language with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language; identifying, by the system, based on aligning the expected phonetic transcription of the idealized native speaker with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language, one or more phones of the expected phonetic transcription that is different than one or more corresponding phones of the actual phonetic transcription; in response to identifying, based on aligning the expected phonetic transcription of the idealized native speaker with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language, designating, by the system, the one or more phones of the expected phonetic transcription as a substitute pronunciation for the corresponding phones of the actual phonetic transcription for other terms that (i) are spoken by other users that are also associated with the like-pronunciation group, and (ii) have a respective phonetic transcription that includes the one or more corresponding phones; obtaining, by the enhanced speech recognizer that is configured to obtain, from the confusion matrix manager, using output of the acoustic model associated with the enhanced speech recognizer, the substitute pronunciations before inputting the substitute pronunciations to the language model associated with the enhanced speech recognizer, a transcription of another term that (i) is spoken by another user that is also associated with the like-pronunciation group, and (ii) has a phonetic transcription that includes the one or more corresponding phones based on the one or more phones of the expected phonetic transcription being designated as a substitute pronunciation for the one or more corresponding phones of the actual phonetic transcription; and inputting, by the enhanced speech recognizer, using the transcription of another term that (i) is spoken by another user that is also associated with the like-pronunciation group, and (ii) has a phonetic transcription that includes the one or more corresponding phones, the one or more phones of the expected phonetic transcription that is designated as the substitute pronunciation for the one or more corresponding phones of the actual phonetic transcription to the language model associated with the enhanced speech recognizer. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; selecting, by a system that includes (i) a confusion matrix manager configured to store, for like-pronunciation groups, phones of expected phonetic transcriptions as substitute pronunciations for corresponding phones of actual phonetic transcriptions in a confusion matrix and (ii) an enhanced speech recognizer configured to obtain, from the confusion matrix manager, using output of an acoustic model associated with the enhanced speech recognizer, the substitute pronunciations before inputting the substitute pronunciations to a language model associated with the enhanced speech recognizer, one or more terms; obtaining, by the system, an expected phonetic transcription of an idealized native speaker of a natural language speaking the one or more terms; after obtaining the expected phonetic transcription of the idealized native speaker of the natural language speaking the one or more terms, receiving audio data corresponding to a particular user that is not the idealized native speaker of the natural language speaking the one or more terms in the natural language; receiving data identifying a like-pronunciation group associated with the particular user; obtaining, by the system, based on the audio data, an actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language speaking the one or more terms in the natural language; aligning, by the system, the expected phonetic transcription of the idealized native speaker of the natural language with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language; identifying, by the system, based on aligning the expected phonetic transcription of the idealized native speaker with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language, one or more phones of the expected phonetic transcription that is different than one or more corresponding phones of the actual phonetic transcription; in response to identifying, based on aligning the expected phonetic transcription of the idealized native speaker with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language, designating, by the system, the one or more phones of the expected phonetic transcription as a substitute pronunciation for the corresponding phones of the actual phonetic transcription for other terms that (i) are spoken by other users that are also associated with the like-pronunciation group, and (ii) have a respective phonetic transcription that includes the one or more corresponding phones; obtaining, by the enhanced speech recognizer that is configured to obtain, from the confusion matrix manager, using output of the acoustic model associated with the enhanced speech recognizer, the substitute pronunciations before inputting the substitute pronunciations to the language model associated with the enhanced speech recognizer, a transcription of another term that (i) is spoken by another user that is also associated with the like-pronunciation group, and (ii) has a phonetic transcription that includes the one or more corresponding phones based on the one or more phones of the expected phonetic transcription being designated as a substitute pronunciation for the one or more corresponding phones of the actual phonetic transcription; and inputting, by the enhanced speech recognizer, using the transcription of another term that (i) is spoken by another user that is also associated with the like-pronunciation group, and (ii) has a phonetic transcription that includes the one or more corresponding phones, the one or more phones of the expected phonetic transcription that is designated as the substitute pronunciation for the one or more corresponding phones of the actual phonetic transcription to the language model associated with the enhanced speech recognizer.
-
22. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
-
selecting, by a system that includes (i) a confusion matrix manager configured to store, for like-pronunciation groups, phones of expected phonetic transcriptions as substitute pronunciations for corresponding phones of actual phonetic transcriptions in a confusion matrix and (ii) an enhanced speech recognizer configured to obtain, from the confusion matrix manager, using output of an acoustic model associated with the enhanced speech recognizer, the substitute pronunciations before inputting the substitute pronunciations to a language model associated with the enhanced speech recognizer, one or more terms; obtaining, by the system, an expected phonetic transcription of an idealized native speaker of a natural language speaking the one or more terms; after obtaining the expected phonetic transcription of the idealized native speaker of the natural language speaking the one or more terms, receiving audio data corresponding to a particular user that is not the idealized native speaker of the natural language speaking the one or more terms in the natural language; receiving data identifying a like-pronunciation group associated with the particular user; obtaining, by the system, based on the audio data, an actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language speaking the one or more terms in the natural language; aligning, by the system, the expected phonetic transcription of the idealized native speaker of the natural language with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language; identifying, by the system, based on aligning the expected phonetic transcription of the idealized native speaker with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language, one or more phones of the expected phonetic transcription that is different than one or more corresponding phones of the actual phonetic transcription; in response to identifying, based on aligning the expected phonetic transcription of the idealized native speaker with the actual phonetic transcription of the particular user that is not the idealized native speaker of the natural language, designating, by the system, the one or more phones of the expected phonetic transcription as a substitute pronunciation for the corresponding phones of the actual phonetic transcription for other terms that (i) are spoken by other users that are also associated with the like-pronunciation group, and (ii) have a respective phonetic transcription that includes the one or more corresponding phones; obtaining, by the enhanced speech recognizer that is configured to obtain, from the confusion matrix manager, using output of the acoustic model associated with the enhanced speech recognizer, the substitute pronunciations before inputting the substitute pronunciations to the language model associated with the enhanced speech recognizer, a transcription of another term that (i) is spoken by another user that is also associated with the like-pronunciation group, and (ii) has a phonetic transcription that includes the one or more corresponding phones based on the one or more phones of the expected phonetic transcription being designated as a substitute pronunciation for the one or more corresponding phones of the actual phonetic transcription; and inputting, by the enhanced speech recognizer, using the transcription of another term that (i) is spoken by another user that is also associated with the like-pronunciation group, and (ii) has a phonetic transcription that includes the one or more corresponding phones, the one or more phones of the expected phonetic transcription that is designated as the substitute pronunciation for the one or more corresponding phones of the actual phonetic transcription to the language model associated with the enhanced speech recognizer.
-
Specification