New-word pronunciation learning using a pronunciation graph

US 7,590,533 B2
Filed: 03/10/2004
Issued: 09/15/2009
Est. Priority Date: 03/10/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-readable storage medium having computer-executable instructions stored thereon that when executed by a computer cause the computer to perform steps comprising:

generating a set of syllable-like units using mutual information before decoding a speech signal to identify a sequence of syllable-like units;

generating a speech-based phonetic description of a word without reference to the text of the word by decoding a speech signal representing the user'"'"'s pronunciation of the word to generate the speech-based phonetic description of the word, wherein decoding a speech signal comprises identifying a sequence of syllable-like units from the speech signal;

generating a text-based phonetic description of the word based on the text of the word;

aligning the speech-based phonetic description and the text-based phonetic description on a phone-by-phone basis to form a single graph; and

selecting a phonetic description from the single graph.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and computer-readable medium convert the text of a word and a user'"'"'s pronunciation of the word into a phonetic description to be added to a speech recognition lexicon. Initially, a plurality of at least two possible phonetic descriptions are generated. One phonetic description is formed by decoding a speech signal representing a user'"'"'s pronunciation of the word. At least one other phonetic description is generated from the text of the word. The plurality of possible sequences comprising speech-based and text-based phonetic descriptions are aligned and scored in a single graph based on their correspondence to the user'"'"'s pronunciation. The phonetic description with the highest score is then selected for entry in the speech recognition lexicon.

Citations

23 Claims

1. A computer-readable storage medium having computer-executable instructions stored thereon that when executed by a computer cause the computer to perform steps comprising:
- generating a set of syllable-like units using mutual information before decoding a speech signal to identify a sequence of syllable-like units;
  
  generating a speech-based phonetic description of a word without reference to the text of the word by decoding a speech signal representing the user'"'"'s pronunciation of the word to generate the speech-based phonetic description of the word, wherein decoding a speech signal comprises identifying a sequence of syllable-like units from the speech signal;
  
  generating a text-based phonetic description of the word based on the text of the word;
  
  aligning the speech-based phonetic description and the text-based phonetic description on a phone-by-phone basis to form a single graph; and
  
  selecting a phonetic description from the single graph.
- View Dependent Claims (2, 3, 4)
- - 2. The computer-readable storage medium of claim 1, wherein generating a syllable-like unit using mutual information comprises:
    - calculating mutual information values for pairs of sub-word units in a training dictionary;
      
      selecting a pair of sub-word units based on the mutual information values; and
      
      merging the selected pair of sub-word units into a syllable-like unit.
  - 3. The computer-readable storage medium of claim 1, wherein generating the text-based phonetic description comprises using a letter-to-sound rule.
  - 4. The computer-readable storage medium of claim 1, wherein selecting a phonetic description from the single graph comprises comparing a speech sample to acoustic models of phonetic units in the single graph.

5. A computer-readable storage medium having computer-executable instructions stored thereon that when executed by a computer cause the computer to perform steps comprising:
- receiving text of a word for which a phonetic pronunciation is to be added to a speech recognition lexicon;
  
  receiving a representation of a speech signal produced by a person pronouncing the word;
  
  converting the text of the word into at least one text-based phonetic sequence of phonetic units;
  
  generating a speech-based phonetic sequence of phonetic units from the representation of the speech signal;
  
  placing the phonetic units of the at least one text-based phonetic sequence and the speech-based phonetic sequence in a search structure that allows for transitions between phonetic units in the text-based phonetic sequence and phonetic units in the speech-based phonetic description; and
  
  selecting a phonetic pronunciation from the search structure, wherein the selected phonetic pronunciation comprises phonetic units of the speech-based phonetic sequence that differ from phonetic units of the at least one text-based phonetic sequence and phonetic units other than phonetic units of the speech-based phonetic sequence.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 6. The computer-readable storage medium of claim 5, wherein placing the phonetic units in a search structure comprises aligning the speech-based phonetic sequence and the at least one text-based phonetic sequence to identify phonetic units that are alternatives of each other.
  - 7. The computer-readable storage medium of claim 6, wherein the search structure contains a single path for a phonetic unit that is found in both the text-based phonetic sequence and the speech-based phonetic sequence.
  - 8. The computer-readable storage medium of claim 6, wherein aligning the speech-based phonetic sequence and the at least one text-based phonetic sequence comprises calculating a minimum distance between two phonetic sequences.
  - 9. The computer-readable storage medium of claim 6, wherein selecting the phonetic pronunciation is based in part on a comparison between acoustic models of phonetic units and the representation of the speech signal.
  - 10. The computer-readable storage medium of claim 5, wherein generating a speech-based phonetic sequence of phonetic units comprises:
    - generating a plurality of possible phonetic sequences of phonetic units;
      
      using at least one model to generate a probability score for each possible phonetic sequence; and
      
      selecting the possible phonetic sequence with the highest score as the speech-based phonetic sequence of phonetic units.
  - 11. The computer-readable storage medium of claim 10, wherein using at least one model comprises using an acoustic model and a language model.
  - 12. The computer-readable storage medium of claim 11, wherein using a language model comprises using a language model that is based on syllable-like units.
  - 13. The computer-readable storage medium of claim 10, wherein selecting a phonetic pronunciation comprises scoring paths through the search structure based on at least one model.
  - 14. The computer-readable storage medium of claim 13, wherein the at least one model comprises an acoustic model.

15. A method for adding an acoustic description of a word to a speech recognition lexicon, the method comprising:
- generating a text-based phonetic description based on the text of a word;
  
  generating a speech-based phonetic description without reference to the text of the word;
  
  aligning the text-based phonetic description and the speech based phonetic description in a structure, the structure comprising paths representing phonetic units, at least one path for a phonetic unit from the text-based phonetic description being connected to a path for a phonetic unit from the speech-based phonetic description;
  
  selecting a sequence of paths through the structure; and
  
  generating the acoustic description of the word based on the selected sequence of paths wherein the acoustic description comprises a phonetic unit found in the speech-based phonetic description but not in the text-based phonetic description and a second phonetic unit found in the text-based phonetic description but not in the speech-based phonetic description.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
- - 16. The method of claim 15, wherein selecting a sequence of paths comprises generating a score for a path in the structure.
  - 17. The method of claim 16, wherein generating a score of a path comprises comparing a user'"'"'s pronunciation of a word to a model for a phonetic unit in the structure.
  - 18. The method of claim 16, further comprising generating a plurality of text-based phonetic descriptions based on the text of the word.
  - 19. The method of claim 18, wherein generating the speech-based phonetic description comprises decoding a speech signal comprising a user'"'"'s pronunciation of the word.
  - 20. The method of claim 19, wherein decoding a speech signal comprises using a language model of syllable-like-units.
  - 21. The method of claim 20, further comprising constructing the language model of syllable-like units though steps of:
    - calculating mutual information values for pairs of syllable-like units in a training dictionary;
      
      selecting a pair of syllable-like units based on the mutual information values; and
      
      removing the selected pair and substituting a new syllable-like unit in place of the removed selected pair in the training dictionary.
  - 22. The method of claim 21, further comprising:
    - recalculating mutual information values for remaining pairs of syllable-like units in the training dictionary;
      
      selecting a new pair of syllable-like units based on the recalculated mutual information values; and
      
      removing the new pair of syllable-like units and substituting a second new syllable-like unit in place of the new pair of syllable-like units in the training dictionary.
  - 23. The method of claim 22, further comprising using the training dictionary to generate a language model of syllable-like units.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Hwang, Mei-Yuh
Primary Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US10/796,921
Publication Number

US 20050203738A1
Time in Patent Office

2,015 Days
Field of Search

704/231, 704/235, 704/251, 704/257
US Class Current

704/231
CPC Class Codes

G10L 15/063   Training

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/025   Phonemes, fenemes or fenone...

New-word pronunciation learning using a pronunciation graph

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

New-word pronunciation learning using a pronunciation graph

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links