Subword-Based Multi-Level Pronunciation Adaptation for Recognizing Accented Speech
First Claim
1. A system comprising:
- one or more processors; and
a modeling component executed by the one or more processors to;
receive accented speech data for a word;
generate, for the word, a parse table that includes a plurality of levels each corresponding to a different subword type, wherein each of the plurality of levels includes one or more subwords of the corresponding subword type;
determine a set of one or more possible mispronunciations for each of the one or more subwords, at each level of the parse table, based at least on the accented speech data; and
combine the sets to generate a model for accented speech recognition, wherein the model provides a probability of occurrence for each of one or more phone sequences corresponding to a mispronunciation of the word.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques are described for training a speech recognition model for accented speech. A subword parse table is employed that models mispronunciations at multiple subword levels, such as the syllable, position-specific cluster, and/or phone levels. Mispronunciation probability data is then generated at each level based on inputted training data, such as phone-level annotated transcripts of accented speech. Data from different levels of the subword parse table may then be combined to determine the accented speech model. Mispronunciation probability data at each subword level is based at least in part on context at that level. In some embodiments, phone-level annotated transcripts are generated using a semi-supervised method.
28 Citations
20 Claims
-
1. A system comprising:
-
one or more processors; and a modeling component executed by the one or more processors to; receive accented speech data for a word; generate, for the word, a parse table that includes a plurality of levels each corresponding to a different subword type, wherein each of the plurality of levels includes one or more subwords of the corresponding subword type; determine a set of one or more possible mispronunciations for each of the one or more subwords, at each level of the parse table, based at least on the accented speech data; and combine the sets to generate a model for accented speech recognition, wherein the model provides a probability of occurrence for each of one or more phone sequences corresponding to a mispronunciation of the word. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. One or more computer-readable storage media, storing instructions that enable a processor to perform actions comprising:
-
determining, based on accented speech data for a word, pronunciation information for the word at a plurality of levels corresponding to different subword types, each of the plurality of levels including one or more subwords of the corresponding subword type; determining a set of one or more possible mispronunciations for each of the one or more subwords at each level, based at least on the accented speech data; and combining the sets to generate a model for accented speech recognition. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A computer-implemented method comprising:
-
generating, by a server device, accented speech data for a word; generating for the word a parse table with a plurality of levels each corresponding to a subword type, including levels for a phone type, a position-specific cluster (PSC) type, and a syllable type, wherein each level includes one or more subwords of the corresponding subword type; for each level of the parse table, determining a lattice of one or more possible mispronunciations for the one or more subwords at the level, based on the accented speech data; and combining the determined lattices to generate a model for accented speech recognition, wherein the model includes a probability that each of one or more phone sequences will be generated by an accented speaker of the word. - View Dependent Claims (17, 18, 19, 20)
-
Specification