Apparatus for Reducing Spurious Insertions in Speech Recognition

US 20080281593A1
Filed: 07/16/2008
Published: 11/13/2008
Est. Priority Date: 04/04/2003
Status: Active Grant

First Claim

Patent Images

1. Apparatus for automatically generating a phonetic baseform from a spoken utterance, the apparatus comprising:

a memory; and

at least one processor coupled to the memory and operative to;

(i) obtain a stream of acoustic observations representing the spoken utterance;

(ii) generate a sequence of subphone units, wherein candidate subphones that relate to the same speech event and that overlap in time are merged into a single candidate subphone, wherein a score associated with the single candidate subphone is equal to the sum of scores associated with the merged candidate subphones such that a score associated with a longer-lasting speech event is enhanced as compared with a score associated with a shorter-lasting speech event, and further wherein the sequence of subphone units represents candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and

(iii) convert the sequence of subphone units into a phonetic baseform.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for improving an automatic baseform generation system. More particularly, the invention provides techniques for reducing insertion of spurious speech events in a word or phone sequence generated by an automatic baseform generation system. Such automatic baseform generation techniques may be accomplished by enhancing the scores of long-lasting speech events with respect to the scores of short-lasting events. For example, this may be achieved by merging competing candidates that relate to the same speech event (e.g., phone or word) and that overlap in time into a single candidate, the score of which may be equal to the sum of the scores of the merged candidates.

11 Citations

View as Search Results

12 Claims

1. Apparatus for automatically generating a phonetic baseform from a spoken utterance, the apparatus comprising:
- a memory; and
  
  at least one processor coupled to the memory and operative to;
  
  (i) obtain a stream of acoustic observations representing the spoken utterance;
  
  (ii) generate a sequence of subphone units, wherein candidate subphones that relate to the same speech event and that overlap in time are merged into a single candidate subphone, wherein a score associated with the single candidate subphone is equal to the sum of scores associated with the merged candidate subphones such that a score associated with a longer-lasting speech event is enhanced as compared with a score associated with a shorter-lasting speech event, and further wherein the sequence of subphone units represents candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and
  
  (iii) convert the sequence of subphone units into a phonetic baseform.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The apparatus of claim 1, wherein a score associated with the merged candidate subphones is equal to the sum of scores associated with the merged candidate subphones.
  - 3. The apparatus of claim 1, wherein the generating operation further comprises building a lattice from the stream of acoustic observations using acoustic models and a phone graph.
  - 4. The apparatus of claim 3, wherein the lattice is a subphone graph specifying a starting time, an ending time and an acoustic score associated to each subphone in a candidate sequence of subphones.
  - 5. The apparatus of claim 3, wherein the generating operation further comprises transforming the lattice to produce the generated sequence of subphones.
  - 6. The apparatus of claim 5, wherein the transforming operation further comprises rescoring the lattice by using a transition model between the subphones.
  - 7. The apparatus of claim 6, wherein the lattice comprises arcs and the transforming operation further comprises computing a posterior probability for each arc in the lattice as the sum of the posterior probabilities of the paths which go through that particular arc.
  - 8. The apparatus of claim 7, wherein the transforming operation further comprises modifying a topology of the lattice by merging the arcs that bear the same subphone label and that overlap in time, while maintaining the arc order of the original lattice.
  - 9. The apparatus of claim 8, wherein the transforming operation further comprises assigning a new score to each new arc resulting from the merging of overlapping arcs by summing the posterior probabilities of the merged arcs.
  - 10. The apparatus of claim 9, wherein the transforming operation further comprises identifying the generated sequence of subphone units as the sequence with the highest cumulative score in the transformed lattice.

11. An article of manufacture for automatically generating a phonetic baseform from a spoken utterance, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
- obtaining a stream of acoustic observations representing the spoken utterance;
  
  generating a sequence of subphone units, wherein candidate subphones that relate to the same speech event and that overlap in time are merged into a single candidate subphone, wherein a score associated with the single candidate subphone is equal to the sum of scores associated with the merged candidate subphones such that a score associated with a longer-lasting speech event is enhanced as compared with a score associated with a shorter-lasting speech event, and further wherein the sequence of subphone units represents candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and
  
  converting the sequence of subphone units into a phonetic baseform.

12. A speech recognition system, comprising:
- a speech recognition engine; and
  
  a recognition lexicon associated with the speech recognition engine, the recognition lexicon including at least one phonetic baseform automatically generated by;
  
  (i) obtaining a stream of acoustic observations representing the spoken utterance;
  
  (ii) generating a sequence of subphone units, wherein candidate subphones that relate to the same speech event and that overlap in time are merged into a single candidate subphone, wherein a score associated with the single candidate subphone is equal to the sum of scores associated with the merged candidate subphones such that a score associated with a longer-lasting speech event is enhanced as compared with a score associated with a shorter-lasting speech event, and further wherein the sequence of subphone units represents candidate subphone units substantially maximizing a likelihood associated with the stream of acoustic observations; and
  
  (iii) converting the sequence of subphone units into the at least one phonetic baseform.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Deligne, Sabine V., Mangu, Lidia L.

Granted Patent

US 7,783,484 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/243
CPC Class Codes

G10L 15/063 Training

G10L 2015/025 Phonemes, fenemes or fenone...

Apparatus for Reducing Spurious Insertions in Speech Recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

11 Citations

12 Claims

Specification

Use Cases

Quick Links

Others

Apparatus for Reducing Spurious Insertions in Speech Recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

11 Citations

12 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others