Methods and apparatus for reducing spurious insertions in speech recognition

US 20040199385A1
Filed: 04/04/2003
Published: 10/07/2004
Est. Priority Date: 04/04/2003
Status: Active Grant

First Claim

Patent Images

1. A method of automatically generating a phonetic baseform from a spoken utterance, the method comprising the steps of:

obtaining a stream of acoustic observations representing the spoken utterance;

generating a sequence of subphone units, wherein candidate subphones that relate to the same speech event and that overlap in time are merged into a single candidate subphone, and wherein the sequence of subphone units represents candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and

converting the sequence of subphone units into a phonetic baseform.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for improving an automatic baseform generation system. More particularly, the invention provides techniques for reducing insertion of spurious speech events in a word or phone sequence generated by an automatic baseform generation system. Such automatic baseform generation techniques may be accomplished by enhancing the scores of long-lasting speech events with respect to the scores of short-lasting events. For example, this may be achieved by merging competing candidates that relate to the same speech event (e.g., phone or word) and that overlap in time into a single candidate, the score of which may be equal to the sum of the scores of the merged candidates.

35 Citations

View as Search Results

30 Claims

1. A method of automatically generating a phonetic baseform from a spoken utterance, the method comprising the steps of:
- obtaining a stream of acoustic observations representing the spoken utterance;
  
  generating a sequence of subphone units, wherein candidate subphones that relate to the same speech event and that overlap in time are merged into a single candidate subphone, and wherein the sequence of subphone units represents candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and
  
  converting the sequence of subphone units into a phonetic baseform.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, wherein a score associated with the merged candidate subphones is equal to the sum of scores associated with the merged candidate subphones.
  - 3. The method of claim 1, wherein the generating step further comprises building a lattice from the stream of acoustic observations using acoustic models and a phone graph.
  - 4. The method of claim 3, wherein the lattice is a subphone graph specifying a starting time, an ending time and an acoustic score associated to each subphone in a candidate sequence of subphones.
  - 5. The method of claim 4, wherein the starting time, the ending time and the acoustic score of each subphone in a particular sequence of subphones are determined by finding a time-alignment with the highest likelihood between the stream of acoustic observations and the sequence of subphones.
  - 6. The method of claim 5, wherein the likelihood of a time-alignment is computed by multiplying acoustic scores associated with the stream of acoustic observations.
  - 7. The method of claim 6, wherein an acoustic score of an observation is given by an acoustic model of a subphone unit with which it is aligned.
  - 8. The method of claim 3, wherein the generating step further comprises transforming the lattice to produce the generated sequence of subphones.
  - 9. The method of claim 8, wherein the transforming step further comprises rescoring the lattice by using a transition model between the subphones.
  - 10. The method of claim 9, wherein the lattice comprises arcs and the transforming step further comprises computing a posterior probability for each arc in the lattice as the sum of the posterior probabilities of the paths which go through that particular arc.
  - 11. The method of claim 10, wherein the transforming step further comprises modifying a topology of the lattice by merging the arcs that bear the same subphone label and that overlap in time, while maintaining the arc order of the original lattice.
  - 12. The method of claim 11, wherein the transforming step further comprises assigning a new score to each new arc resulting from the merging of overlapping arcs by summing the posterior probabilities of the merged arcs.
  - 13. The method of claim 12, wherein the transforming step further comprises identifying the generated sequence of subphone units as the sequence with the highest cumulative score in the transformed lattice.
  - 14. The method of claim 1, wherein the converting step further comprises replacing subphone labels of the generated sequence of subphones with their phone counterparts and merging repeated phones.
  - 15. The method of claim 14, wherein the converting step further comprises filtering out beginning and ending silence labels.
  - 16. The method of claim 1, wherein the generating step further comprises use of an acoustic model that is context-dependent.
  - 17. The method of claim 1, wherein the generating step further comprises use of a transition model estimated off-line by aligning a dataset of speech with a known transcription on acoustic models of the subphone units, and by estimating a bigram language model on labels of the subphone units in the alignment.
  - 18. The method of claim 1, wherein the spoken utterance represents a word and further wherein multiple phonetic baseforms are generated for the word.

19. Apparatus for automatically generating a phonetic baseform from a spoken utterance, the apparatus comprising:
- a memory; and
  
  at least one processor coupled to the memory and operative to;
  
  (i) obtain a stream of acoustic observations representing the spoken utterance;
  
  (ii) generate a sequence of subphone units, wherein candidate subphones that relate to the same speech event and that overlap in time are merged into a single candidate subphone, and wherein the sequence of subphone units represents candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and
  
  (iii) convert the sequence of subphone units into a phonetic baseform.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 20. The apparatus of claim 19, wherein a score associated with the merged candidate subphones is equal to the sum of scores associated with the merged candidate subphones.
  - 21. The apparatus of claim 19, wherein the generating operation further comprises building a lattice from the stream of acoustic observations using acoustic models and a phone graph.
  - 22. The apparatus of claim 21, wherein the lattice is a subphone graph specifying a starting time, an ending time and an acoustic score associated to each subphone in a candidate sequence of subphones.
  - 23. The apparatus of claim 21, wherein the generating operation further comprises transforming the lattice to produce the generated sequence of subphones.
  - 24. The apparatus of claim 23, wherein the transforming operation further comprises rescoring the lattice by using a transition model between the subphones.
  - 25. The apparatus of claim 24, wherein the lattice comprises arcs and the transforming operation further comprises computing a posterior probability for each arc in the lattice as the sum of the posterior probabilities of the paths which go through that particular arc.
  - 26. The apparatus of claim 25, wherein the transforming operation further comprises modifying a topology of the lattice by merging the arcs that bear the same subphone label and that overlap in time, while maintaining the arc order of the original lattice.
  - 27. The apparatus of claim 26, wherein the transforming operation further comprises assigning a new score to each new arc resulting from the merging of overlapping arcs by summing the posterior probabilities of the merged arcs.
  - 28. The apparatus of claim 27, wherein the transforming operation further comprises identifying the generated sequence of subphone units as the sequence with the highest cumulative score in the transformed lattice.

29. An article of manufacture for automatically generating a phonetic baseform from a spoken utterance, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
- obtaining a stream of acoustic observations representing the spoken utterance;
  
  generating a sequence of subphone units, wherein candidate subphones that relate to the same speech event and that overlap in time are merged into a single candidate subphone, and wherein the sequence of subphone units represents candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and
  
  converting the sequence of subphone units into a phonetic baseform.

30. A speech recognition system, comprising:
- a speech recognition engine; and
  
  a recognition lexicon associated with the speech recognition engine, the recognition lexicon including at least one phonetic baseform automatically generated by;
  
  (i) obtaining a stream of acoustic observations representing the spoken utterance;
  
  (ii) generating a sequence of subphone units, wherein candidate subphones that relate to the same speech event and that overlap in time are merged into a single candidate subphone, and wherein the sequence of subphone units represents candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and
  
  (iii) converting the sequence of subphone units into the at least one phonetic baseform.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Deligne, Sabine V., Mangu, Lidia L.

Granted Patent

US 7,409,345 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/235
CPC Class Codes

G10L 15/063 Training

G10L 2015/025 Phonemes, fenemes or fenone...

Methods and apparatus for reducing spurious insertions in speech recognition

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

35 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for reducing spurious insertions in speech recognition

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links