Word-specific acoustic models in a speech recognition system

US 7,062,436 B1
Filed: 02/11/2003
Issued: 06/13/2006
Est. Priority Date: 02/11/2003
Status: Expired due to Fees

First Claim

Patent Images

1. An acoustic model in a speech recognition system having a lexicon in which words map to phones modeled in the acoustic model, the acoustic model comprising:

a plurality of shared phone models modeling a plurality of shared phones used to transcribe words in the lexicon, the shared phone models and shared phones being shared among the words in the lexicon;

a candidate word model modeling a word-specific phone representing a transcription of a portion of a candidate word in the lexicon, the word-specific phone replacing in a transcription of the candidate word one or more of the shared phones, the word-specific phone and the candidate word model being shared by fewer than all words in the lexicon that can be transcribed by the shared phones replaced by the word-specific phone.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An acoustic model includes word-specific models, that are specific to candidate words. The candidate words would otherwise be mapped to a series of general phones. A sub-series of the general phones representing the candidate word is modeled by a new phone and the new phone is dedicated to the candidate word, or a small group of similar words, but the new phone is not shared among all words that otherwise map to the sub-series of general phones.

Citations

22 Claims

1. An acoustic model in a speech recognition system having a lexicon in which words map to phones modeled in the acoustic model, the acoustic model comprising:
- a plurality of shared phone models modeling a plurality of shared phones used to transcribe words in the lexicon, the shared phone models and shared phones being shared among the words in the lexicon;
  
  a candidate word model modeling a word-specific phone representing a transcription of a portion of a candidate word in the lexicon, the word-specific phone replacing in a transcription of the candidate word one or more of the shared phones, the word-specific phone and the candidate word model being shared by fewer than all words in the lexicon that can be transcribed by the shared phones replaced by the word-specific phone.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The acoustic model of claim 1 wherein the candidate word is transcribed by a plurality of word-specific phones and wherein the acoustic model includes a candidate word model modeling each of the word-specific phones.
  - 3. The acoustic model of claim 2 wherein a number of the word-specific phones modeled for the candidate word is based on a pronunciation duration associated with the candidate word.
  - 4. The acoustic model of claim 1 wherein the word-specific phone is a monophone.
  - 5. The acoustic model of claim 1 wherein the word-specific phone is a context dependent phone.
  - 6. The acoustic model of claim 5 wherein the context dependent phone comprises a triphone.
  - 7. The acoustic model of claim 1 and further comprising:
    - a plurality of candidate word models each corresponding to one of a plurality of candidate words.
  - 8. The acoustic model of claim 1 wherein the candidate word model is shared only among a subset of other candidate words.
  - 9. The acoustic model of claim 1 wherein the transcription of the candidate word includes a first phone modeled by a shared phone model, a final phone modeled by a shared phone model and wherein the word-specific phone, modeled by the candidate word model, comprises at least one central phone that resides between the first phone and final phone in the transcription of the candidate word.
  - 10. The acoustic model of claim 1 wherein the transcription of the candidate word includes a first context dependent word-specific phone, modeled by a first candidate word model, having a left context corresponding to a shared phone, a final context dependent word-specific phone, modeled by a final candidate word model, having a right context corresponding to a shared phone, and wherein the word specific phone, modeled by the candidate word model, comprises at least one central context dependent phone that resides between the first context dependent word-specific phone and final context dependent word-specific phone in the transcription of the candidate word.
  - 11. The acoustic model of claim 1 wherein the candidate word model comprises a Hidden Markov chain having a topology that is based on a pronunciation of the candidate word.
  - 12. The acoustic model of claim 11 wherein the topology includes a transition from a central portion of the Hidden Markov chain out of the Hidden Markov chain.
  - 13. The acoustic model of claim 11 wherein the topology includes a transition from outside of the Hidden Markov chain into a central portion of the Hidden Markov chain.

14. A method of training an acoustic model, comprising:
- receiving a set of shared phone models and corresponding transcriptions with shared phones;
  
  initializing candidate word models each with data corresponding to one or more of the shared phones; and
  
  training the candidate word models on fewer than all instances of words that contain the shared phones used to initialize the candidate word models.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The method of claim 14 wherein training the candidate word models comprises:
    - training the candidate word models only using instances of corresponding candidate words.
  - 16. The method of claim 14 wherein training the candidate word models further comprises:
    - determining whether occurrences of the instances of the candidate words reached a threshold level; and
      
      if not, clustering data from additional words to train the candidate word models.
  - 17. The method of claim 16 wherein clustering comprises:
    - clustering data only from additional candidate words to train the candidate word models.
  - 18. The method of claim 14 and further comprising:
    - performing additional training on the shared phone models as the candidate word models are trained.
  - 19. The method of claim 14 wherein receiving the shared phone models comprises training the shared phone models.

20. A computer readable medium, comprising:
- an acoustic model in a speech recognition system having a lexicon in which words are transcribed as phones modeled in the acoustic model, the acoustic model comprising;
  
  a plurality of shared phone models modeling a plurality of shared phones used to transcribe words in the lexicon, the shared phone models and shared phones being shared among the words in the lexicon;
  
  a candidate word model modeling a word-specific phone representing a transcription of a portion of a candidate word in the lexicon, the word-specific phone replacing in a transcription of the candidate word one or more of the shared phones, the word-specific phone and the candidate word model being shared by fewer than all words in the lexicon that would otherwise be transcribed by the shared phones that are replaced by the word-specific phone.

21. A speech recognition system, comprising:
- an input receiving a signal indicative of speech;
  
  a lexicon including words transcribed by phones;
  
  an acoustic model modeling shared phones shared among the words in the lexicon and word-specific phones shared among a selected group of words that would otherwise be lexically transcribed with shared phones;
  
  a language model modeling word order; and
  
  a decoder coupled to the input, the acoustic model and the language model, recognizing speech represented by the signal.
- View Dependent Claims (22)
- - 22. The speech recognition system of claim 21 wherein the acoustic model comprises:
    - a plurality of shared phone models modeling a plurality of the shared phones that are used to transcribe words in the lexicon, the shared phone models and shared phones being shared among the words in the lexicon;
      
      a plurality of candidate word models modeling the word-specific phones that represent a transcription of a portion of a candidate word in the lexicon, the word-specific phones each replacing in a transcription of the candidate word one or more of the shared phones, the word-specific phones and the candidate word models being shared by fewer than all words in the lexicon that would otherwise be transcribed by the shared phones that are replaced by the word-specific phones.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Odell, Julian J., Durrani, Shahid
Primary Examiner(s)
MCFADDEN, SUSAN IRIS

Application Number

US10/364,084
Time in Patent Office

1,218 Days
Field of Search

None
US Class Current

704/255
CPC Class Codes

G10L 15/063   Training

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/0631   Creating reference template...

Word-specific acoustic models in a speech recognition system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Word-specific acoustic models in a speech recognition system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links