System and method for discriminative pronunciation modeling for voice search

US 9,484,019 B2
Filed: 10/11/2012
Issued: 11/01/2016
Est. Priority Date: 11/19/2008
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

determining a context associated with an utterance received via a microphone that converts audible signals into electrical signals;

determining, via a processor, phoneme possibilities for a unit of speech in the utterance;

assigning weights to each phoneme possibility in the phoneme possibilities, to yield weighted phonemes, wherein the weights are based on a rate of occurrence of the phoneme possibility in utterances associated with the context and a likelihood of classification errors;

receiving additional utterances via the microphone; and

converting the additional utterances into text via a speech recognizer that uses the weighted phonemes.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein is a method for speech recognition. The method includes receiving speech utterances, assigning a pronunciation weight to each unit of speech in the speech utterances, each respective pronunciation weight being normalized at a unit of speech level to sum to 1, for each received speech utterance, optimizing the pronunciation weight by identifying word and phone alignments and corresponding likelihood scores, and discriminatively adapting the pronunciation weight to minimize classification errors, and recognizing additional received speech utterances using the optimized pronunciation weights. A unit of speech can be a sentence, a word, a context-dependent phone, a context-independent phone, or a syllable. The method can further include discriminatively adapting pronunciation weights based on an objective function. The objective function can be maximum mutual information, maximum likelihood training, minimum classification error training, or other functions known to those of skill in the art.

Citations

20 Claims

1. A method comprising:
- determining a context associated with an utterance received via a microphone that converts audible signals into electrical signals;
  
  determining, via a processor, phoneme possibilities for a unit of speech in the utterance;
  
  assigning weights to each phoneme possibility in the phoneme possibilities, to yield weighted phonemes, wherein the weights are based on a rate of occurrence of the phoneme possibility in utterances associated with the context and a likelihood of classification errors;
  
  receiving additional utterances via the microphone; and
  
  converting the additional utterances into text via a speech recognizer that uses the weighted phonemes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising:
    - comparing the weighted phonemes to stored weighted phonemes, to yield a comparison value; and
      
      determining a recognition response based on the comparison value.
  - 3. The method of claim 1, further comprising:
    - comparing the weighted phonemes to stored weighted phonemes, to yield a comparison value; and
      
      when the comparison value is above a threshold value, modifying the stored weighted phonemes based on the weighted phonemes.
  - 4. The method of claim 1, wherein the unit of speech is one of a syllable, a word, a sentence, a context-dependent phone, and a context-independent phone.
  - 5. The method of claim 1, further comprising:
    - prior to recognizing of the additional utterances, discriminatively adapting the weighted phonemes to minimize classification errors.
  - 6. The method of claim 5, wherein discriminatively adapting the weighted phonemes further comprises stochastically modeling pronunciations.
  - 7. The method of claim 1, wherein the weights assigned to the phoneme possibilities of the unit of speech are normalized to sum to 1.
  - 8. The method of claim 1, wherein the utterance comprises a name.
  - 9. The method of claim 1, wherein the utterance is part of a multimodal input.

10. A system comprising:
- a processor; and
  
  a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising;
  
  determining a context associated with an utterance received via a microphone that converts audible signals into electrical signals;
  
  determining, via a processor, phoneme possibilities for a unit of speech in the utterance;
  
  assigning weights to each phoneme possibility in the phoneme possibilities, to yield weighted phonemes, wherein the weights are based on a rate of occurrence of the phoneme possibility in utterances associated with the context and a likelihood of classification errors;
  
  receiving additional utterances via the microphone; and
  
  converting the additional utterances into text via a speech recognizer that uses the weighted phonemes.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system of claim 10, the computer-readable storage medium having additional instructions stored which, when executed by the processor, result in the processor performing operations comprising:
    - comparing the weighted phonemes to stored weighted phonemes, to yield a comparison value; and
      
      determining a recognition response based on the comparison value.
  - 12. The system of claim 10, the computer-readable storage medium having additional instructions stored which, when executed by the processor, result in processor performing operations comprising:
    - comparing the weighted phonemes to stored weighted phonemes, to yield a comparison value; and
      
      when the comparison value is above a threshold value, modifying the stored weighted phonemes based on the weighted phonemes.
  - 13. The system of claim 10, wherein the unit of speech is one of a syllable, a word, a sentence, a context-dependent phone, and a context-independent phone.
  - 14. The system of claim 10, the computer-readable storage medium having additional instructions stored which, when executed by the processor, result in the processor performing operations comprising:
    - prior to recognizing additional utterances, discriminatively adapting the weighted phoneme to minimize classification errors.
  - 15. The system of claim 14, wherein discriminatively adapting the weighted phonemes further comprises stochastically modeling pronunciations.

16. A non-transitory computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising:
- determining a context associated with an utterance received via a microphone that converts audible signals into electrical signals;
  
  determining, via a processor, phoneme possibilities for a unit of speech in the utterance;
  
  assigning weights to each phoneme possibility in the phoneme possibilities, to yield weighted phonemes, wherein the weights are based on a rate of occurrence of the phoneme possibility in utterances associated with the context and a likelihood of classification errors;
  
  receiving additional utterances via the microphone; and
  
  converting the additional utterances into text via a speech recognizer that uses the weighted phonemes.
- View Dependent Claims (17, 18, 19, 20)
- - 17. A non-transitory computer-readable storage device of claim 16, having additional instructions stored which, when executed by the computing device, result in the computing device performing operations comprising:
    - comparing the weighted phonemes to stored weighted phonemes, to yield a comparison value; and
      
      determining a recognition response based on the comparison value.
  - 18. A non-transitory computer-readable storage device of claim 16, having additional instructions stored which, when executed by the computing device, result in the computing device performing operations comprising:
    - comparing the weighted phonemes to stored weighted phonemes, to yield a comparison value; and
      
      when the comparison value is above a threshold value, modifying the stored weighted phoneme based on the weighted phonemes.
  - 19. A non-transitory computer-readable storage device of claim 16, wherein the unit of speech is one of a syllable, a word, a sentence, a context-dependent phone, and a context-independent phone.
  - 20. A non-transitory computer-readable storage device of claim 16, having additional instructions stored which, when executed by the computing device, result in the computing device performing operations comprising:
    - prior to recognizing additional utterances, discriminatively adapting the weighted phonemes to minimize classification errors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Gilbert, Mazin, Conkie, Alistair D., Ljolje, Andrej
Primary Examiner(s)
Chawan, Vijay B

Application Number

US13/649,680
Publication Number

US 20130035939A1
Time in Patent Office

1,482 Days
Field of Search

704/254, 704/243, 704/255, 704/240, 704/257, 704/231, 704/256, 704/256.2, 704/244, 704/251, 704/236, 704/235, 379 8801- 8804, 434/185
US Class Current

1/1
CPC Class Codes

G10L 15/063   Training

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/025   Phonemes, fenemes or fenone...

System and method for discriminative pronunciation modeling for voice search

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for discriminative pronunciation modeling for voice search

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links