Pronunciation guided by automatic speech recognition
First Claim
1. At least one non-transitory computer readable medium storing computer code that, if executed by at least one computer processor within a pronunciation training system, would cause the at least one computer processor to:
- receive a spoken utterance from a speaker;
detect an end of the spoken utterance;
perform automatic speech recognition on the spoken utterance to produce a transcription of the words in the spoken utterance;
perform speech synthesis from the transcription to produce synthesized speech according to a phonetic dictionary having mappings to single phoneme sequences representing standard pronunciation; and
responsive to detecting the end of the spoken utterance, output the synthesized speech,wherein the system, in response to detecting the end of the spoken utterance, provides a user with an audible output of the transcription, spoken with the standard pronunciation.
10 Assignments
0 Petitions
Accused Products
Abstract
Speech synthesis chooses pronunciations of words with multiple acceptable pronunciations based on an indication of a personal, class-based, or global preference or an intended non-preferred pronunciation. A speaker'"'"'s words can be parroted back on personal devices using preferred pronunciations for accent training. Degrees of pronunciation error are computed and indicated to the user in a visual transcription or audibly as word emphasis in parroted speech. Systems can use sets of phonemes extended beyond those generally recognized for a language. Speakers are classified in order to choose specific phonetic dictionaries or adapt global ones. User profiles maintain lists of which pronunciations are preferred among ones acceptable for words with multiple recognized pronunciations. Systems use multiple correlations of word preferences across users to predict use preferences of unlisted words. Speaker-preferred pronunciations are used to weight the scores of transcription hypotheses based on phoneme sequence hypotheses in speech engines.
-
Citations
21 Claims
-
1. At least one non-transitory computer readable medium storing computer code that, if executed by at least one computer processor within a pronunciation training system, would cause the at least one computer processor to:
-
receive a spoken utterance from a speaker; detect an end of the spoken utterance; perform automatic speech recognition on the spoken utterance to produce a transcription of the words in the spoken utterance; perform speech synthesis from the transcription to produce synthesized speech according to a phonetic dictionary having mappings to single phoneme sequences representing standard pronunciation; and responsive to detecting the end of the spoken utterance, output the synthesized speech, wherein the system, in response to detecting the end of the spoken utterance, provides a user with an audible output of the transcription, spoken with the standard pronunciation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of determining which of a plurality of pronunciations of a word in a digital phonetic dictionary to use in speech synthesis, the method comprising:
-
receiving a spoken utterance from a speaker; recognizing from the spoken utterance, using a speech engine, a word that has multiple known pronunciations; selecting a preferred pronunciation from the multiple known pronunciations of the word that matches the speaker'"'"'s pronunciation of the word used in the spoken utterance; and storing, in a user profile associated with the speaker, an indication of the preferred pronunciation that the speaker used, wherein the speech engine uses the stored preferred pronunciation to select between multiple known pronunciations of the word when synthesizing the speaker'"'"'s future spoken utterances that includes the word. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A method of determining which of a plurality of pronunciations of a word in a phonetic dictionary to use in speech synthesis, the method comprising:
-
determining a text word to synthesize, the text word having multiple known pronunciations; looking up, in a user profile, a preferred pronunciation of a profile word that has multiple known pronunciations; calculating a correlation between the preferred pronunciation of the profile word and each of multiple pronunciations of the text word; and choosing one of the multiple pronunciations of the text word at least partially based on its correlation to the preferred pronunciation of the profile word. - View Dependent Claims (17)
-
-
18. A method of configuring a phonetic dictionary, the method comprising:
-
analyzing, across a multiplicity of users, user'"'"'s profile word lists for a plurality of pronunciations of at least one word recognized with multiple pronunciations; determining, based on the analysis, that a pronunciation most frequently used across the multiplicity of users is a preferred one of the plurality of pronunciations; and update the phonetic dictionary to indicate the preferred one of the plurality of pronunciations for speech recognition. - View Dependent Claims (19, 20)
-
-
21. A method of improving the accuracy of automatic speech recognition, the method comprising:
-
determining, by a speech engine, a multiplicity of phoneme sequence hypotheses from a spoken utterance; determining, by the speech engine, a multiplicity of transcription hypotheses, each transcription hypothesis being based on; a match between the phoneme sequence hypothesis and a pronunciation of a word in a phonetic dictionary, the word having a plurality of pronunciations; and an indication of at least one pronunciation being preferred, wherein the preferred pronunciation is determined by frequency of usage determined by crowdsourcing; and calculating a likelihood score for each transcription hypothesis, the likelihood score being positively correlated to the matched pronunciation being the preferred pronunciation of the word.
-
Specification