Methods and apparatus for automatic generation of multiple pronunciations from acoustic data

US 7,181,395 B1
Filed: 10/27/2000
Issued: 02/20/2007
Est. Priority Date: 10/27/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of automatically generating two or more phonetic baseforms from a spoken utterance representing a word, the method comprising the steps of:

transforming the spoken utterance into a stream of acoustic observations;

generating two or more strings of subphone units, wherein each string of subphone units represents a string of subphone units substantially maximizing a log-likelihood of the stream of acoustic observations, and wherein the log-likelihood is computed as a weighted sum of a transition score associated with a transition model between the subphone units and of an acoustic score associated with a separate context-dependent acoustic model of the subphone units; and

converting the two or more strings of subphone units into two or more phonetic baseforms.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus for automatically deriving multiple phonetic baseforms of a word from a speech utterance of this word are provided in accordance with the present invention. In one embodiment, a method of automatically generating two or more phonetic baseforms from a spoken utterance representing a word includes the steps of: transforming the spoken utterance into a stream of acoustic observations; generating two or more strings of subphone units, wherein each string of subphone units represents a string of subphone units substantially maximizing a log-likelihood of the stream of acoustic observations, and wherein the log-likelihood is computed as a weighted sum of a transition score associated with a transition model and of an acoustic score associated with an acoustic model; and converting the two or more strings of subphone units into two or more phonetic baseforms.

50 Citations

View as Search Results

21 Claims

1. A method of automatically generating two or more phonetic baseforms from a spoken utterance representing a word, the method comprising the steps of:
- transforming the spoken utterance into a stream of acoustic observations;
  
  generating two or more strings of subphone units, wherein each string of subphone units represents a string of subphone units substantially maximizing a log-likelihood of the stream of acoustic observations, and wherein the log-likelihood is computed as a weighted sum of a transition score associated with a transition model between the subphone units and of an acoustic score associated with a separate context-dependent acoustic model of the subphone units; and
  
  converting the two or more strings of subphone units into two or more phonetic baseforms.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising the step of adding the two or more phonetic baseforms to a recognition lexicon associated with a speech recognition system.
  - 3. The method of claim 2, wherein the word is a word not currently in a vocabulary of the speech recognition system.
  - 4. The method of claim 2, wherein the word is a word currently in a vocabulary of the speech recognition system but for which pronunciation variants are desired to be added to the recognition lexicon.
  - 5. The method of claim 1, wherein the stream of acoustic observations includes a stream of feature vectors.
  - 6. The method of claim 1, wherein the weighted sum includes weights respectively of wand (1−
    - w), wherein each value of w defines a distinct log-likelihood function which reaches its maximum value for possibly distinct strings of subphone units.
  - 7. The method of claim 6, wherein each value of w is chosen between 0 and 1.
  - 8. The method of claim 1, wherein the converting step comprises, for each string of subphone units:
    - replacing the subphone units with corresponding phones; and
      
      merging together repeated phones.

9. Apparatus for automatically generating two or more phonetic baseforms from a spoken utterance representing a word, the apparatus comprising:
- at least one processor operative to;
  
  (i) transform the spoken utterance into a stream of acoustic observations;
  
  (ii) generate two or more strings of subphone units, wherein each string of subphone units represents a string of subphone units substantially maximizing a log-likelihood of the stream of acoustic observations, and wherein the log-likelihood is computed as a weighted sum of transition score associated with a transition model between the subphone units and of an acoustic score associated with a separate context-dependent acoustic model of the subphone units; and
  
  (iii) convert the two or more strings of subphone units into two or more phonetic baseforms.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The apparatus of claim 9, wherein the at least one processor is further operative to add the two or more phonetic baseforms to a recognition lexicon associated with a speech recognition system.
  - 11. The apparatus of claim 10, wherein the word is a word not currently in a vocabulary of the speech recognition system.
  - 12. The apparatus of claim 10, wherein the word is a word currently in a vocabulary of the speech recognition system but for which pronunciation variants are desired to be added to the recognition lexicon.
  - 13. The apparatus of claim 9, wherein the stream of acoustic observations includes a stream of feature vectors.
  - 14. The apparatus of claim 9, wherein the weighted sum includes weights respectively of w and (1−
    - w), wherein each value of w defines a distinct log-likelihood function which reaches its maximum value for possibly distinct strings of subphone units.
  - 15. The apparatus of claim 14, wherein each value of w is chosen between 0 and 1.
  - 16. The apparatus of claim 9, wherein the converting operation comprises, for each string of subphone units, the operations of replacing the subphone units with corresponding phones, and merging together repeated phones.

17. An article of manufacture for automatically generating two or more phonetic baseforms from a spoken utterance representing a word, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
- transforming the spoken utterance into a stream of acoustic observations;
  
  generating two or more strings of subphone units, wherein each string of subphone units represents a string of subphone units substantially maximizing a log-likelihood of the stream of acoustic observations, and wherein the log-likelihood is computed as a weighted sum of a transition score associated with a transition model between the subphone units and of an acoustic score associated with a separate context-dependent acoustic model of the subphone units; and
  
  converting the two or more strings of subphone units into two or more phonetic baseforms.
- View Dependent Claims (18, 19, 20)
- - 18. The article of claim 17, further comprising the step of adding the two or more phonetic baseforms to a recognition lexicon associated with a speech recognition system.
  - 19. The article of claim 18, wherein the word is a word not currently in a vocabulary of the speech recognition system.
  - 20. The article of claim 18, wherein the word is a word currently in a vocabulary of the speech recognition system but for which pronunciation variants are desired to be added to the recognition lexicon.

21. A computing device having a speech recognition engine comprising apparatus for automatically generating two or more phonetic baseforms from a spoken utterance representing a word, the apparatus operative to:
- (i) transform the spoken utterance into a stream of acoustic observations;
  
  (ii) generate two or more strings of subphone units, wherein each string of subphone units represents a string of subphone units substantially maximizing a log-likelihood of the stream of acoustic observations, and wherein the log-likelihood is computed as a weighted sum of a transition score associated with a transition model between the subphone units and of an acoustic score associated with a separate context-dependent acoustic model of the subphone units;
  
  (iii) convert the two or more strings of subphone units into two or more phonetic baseforms;
  
  (iv) add the two or more phonetic baseforms to a recognition lexicon associated with the speech recognition engine.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Gopinath, Ramesh Ambat, Deligne, Sabine V., Maison, Benoit Emmanuel Ghislain
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
Han; Qi

Application Number

US09/698,470
Time in Patent Office

2,307 Days
Field of Search

704/249, 704/231, 704/235, 704/243, 704/257, 704/236, 704/240
US Class Current

704/249
CPC Class Codes

G10L 15/065 Adaptation

Methods and apparatus for automatic generation of multiple pronunciations from acoustic data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

50 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for automatic generation of multiple pronunciations from acoustic data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

50 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links