Subword-Based Multi-Level Pronunciation Adaptation for Recognizing Accented Speech

US 20130191126A1
Filed: 01/20/2012
Published: 07/25/2013
Est. Priority Date: 01/20/2012
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more processors; and

a modeling component executed by the one or more processors to;

receive accented speech data for a word;

generate, for the word, a parse table that includes a plurality of levels each corresponding to a different subword type, wherein each of the plurality of levels includes one or more subwords of the corresponding subword type;

determine a set of one or more possible mispronunciations for each of the one or more subwords, at each level of the parse table, based at least on the accented speech data; and

combine the sets to generate a model for accented speech recognition, wherein the model provides a probability of occurrence for each of one or more phone sequences corresponding to a mispronunciation of the word.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are described for training a speech recognition model for accented speech. A subword parse table is employed that models mispronunciations at multiple subword levels, such as the syllable, position-specific cluster, and/or phone levels. Mispronunciation probability data is then generated at each level based on inputted training data, such as phone-level annotated transcripts of accented speech. Data from different levels of the subword parse table may then be combined to determine the accented speech model. Mispronunciation probability data at each subword level is based at least in part on context at that level. In some embodiments, phone-level annotated transcripts are generated using a semi-supervised method.

28 Citations

View as Search Results

20 Claims

1. A system comprising:
- one or more processors; and
  
  a modeling component executed by the one or more processors to;
  
  receive accented speech data for a word;
  
  generate, for the word, a parse table that includes a plurality of levels each corresponding to a different subword type, wherein each of the plurality of levels includes one or more subwords of the corresponding subword type;
  
  determine a set of one or more possible mispronunciations for each of the one or more subwords, at each level of the parse table, based at least on the accented speech data; and
  
  combine the sets to generate a model for accented speech recognition, wherein the model provides a probability of occurrence for each of one or more phone sequences corresponding to a mispronunciation of the word.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The system of claim 1, wherein the plurality of levels includes levels corresponding to a phone type, a position-specific cluster (PSC) type, and a syllable type.
  - 3. The system of claim 1, further comprising a dictionary component that is executed by the one or more processors and that employs the model to generate an accented dictionary for speech recognition.
  - 4. The system of claim 1, wherein combining the sets includes:
    - aligning the sets based on a phone boundary;
      
      converting each of the sets to correspond to a phone type while preserving one or more probabilities associated with the one or more possible mispronunciations at each set; and
      
      adding the one or more probabilities associated with each set to calculate a posterior probability.
  - 5. The system of claim 1, further comprising a lexicon adaptation component that is executed by the one or more processors and that generates the accented speech data for the word using semi-supervised lexicon adaptation.
  - 6. The system of claim 5, wherein the semi-supervised lexicon adaptation includes:
    - determining a plurality of possible pronunciations for the word;
      
      performing speech recognition on each of the plurality of possible pronunciations; and
      
      selecting one of the plurality of possible pronunciations that best represents an audio recording of the word as spoken by an accented speaker.
  - 7. The system of claim 1, wherein determining the set of one or more possible mispronunciations for each of the one or more subwords at each level is further based on a context of each of the one or more subwords.
  - 8. The system of claim 1, wherein combining the sets employs a linear combination.
  - 9. The system of claim 1, wherein combining the sets employs a weighted linear combination.

10. One or more computer-readable storage media, storing instructions that enable a processor to perform actions comprising:
- determining, based on accented speech data for a word, pronunciation information for the word at a plurality of levels corresponding to different subword types, each of the plurality of levels including one or more subwords of the corresponding subword type;
  
  determining a set of one or more possible mispronunciations for each of the one or more subwords at each level, based at least on the accented speech data; and
  
  combining the sets to generate a model for accented speech recognition.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The one or more computer-readable storage media of claim 10, wherein the actions further comprise generating the accented speech data for the word using semi-supervised lexicon adaptation.
  - 12. The one or more computer-readable storage media of claim 10, wherein the actions further comprise employing the model to generate an accented dictionary for speech recognition.
  - 13. The one or more computer-readable storage media of claim 10, wherein the plurality of levels includes levels corresponding to at least two of a phone type, a position-specific cluster (PSC) type, and a syllable type.
  - 14. The one or more computer-readable storage media of claim 10, wherein combining the sets employs a linear combination.
  - 15. The one or more computer-readable storage media of claim 10, wherein combining the sets employs a weighted linear combination.

16. A computer-implemented method comprising:
- generating, by a server device, accented speech data for a word;
  
  generating for the word a parse table with a plurality of levels each corresponding to a subword type, including levels for a phone type, a position-specific cluster (PSC) type, and a syllable type, wherein each level includes one or more subwords of the corresponding subword type;
  
  for each level of the parse table, determining a lattice of one or more possible mispronunciations for the one or more subwords at the level, based on the accented speech data; and
  
  combining the determined lattices to generate a model for accented speech recognition, wherein the model includes a probability that each of one or more phone sequences will be generated by an accented speaker of the word.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer-implemented method of claim 16, further comprising employing the model to generate an accented dictionary for speech recognition.
  - 18. The computer-implemented method of claim 16, wherein combining the determined lattices to generate the model includes a linear combination of the determined lattices.
  - 19. The computer-implemented method of claim 18, wherein the linear combination includes weighting each of the lattices according to a back-off model.
  - 20. The computer-implemented method of claim 16, wherein generating the accented speech data for the word employs semi-supervised lexicon adaptation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Thambiratnam, Albert Joseph Kishan, Seide, Frank Torsten Bernd, Mertens, Timo Pascal

Granted Patent

US 8,825,481 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/245
CPC Class Codes

G10L 15/06 Creation of reference templ...

G10L 15/187 Phonemic context, e.g. pron...

Subword-Based Multi-Level Pronunciation Adaptation for Recognizing Accented Speech

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

28 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Subword-Based Multi-Level Pronunciation Adaptation for Recognizing Accented Speech

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

28 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links