Automatic segmentation in speech synthesis

US 20030187647A1
Filed: 01/14/2003
Published: 10/02/2003
Est. Priority Date: 03/29/2002
Status: Active Grant

First Claim

Patent Images

1. In a system that concatenates speech units to produce synthetic speech, a method for automatically segmenting unit labels, the method comprising:

training a set of Hidden Markov Models (HMMs) using seed data in a first iteration;

aligning the set of HMMs using a Viterbi alignment to produce segmented unit labels; and

adjusting boundaries of the unit labels using spectral boundary correction.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for automatically segmenting speech inventories. A set of Hidden Markov Models (HMMs) are initialized using bootstrap data. The HMMs are next re-estimated and aligned to produce phone labels. The phone boundaries of the phone labels are then corrected using spectral boundary correction. Optionally, this process of using the spectral-boundary-corrected phone labels as input instead of the bootstrap data is performed iteratively in order to further reduce mismatches between manual labels and phone labels assigned by the HMM approach.

Citations

24 Claims

1. In a system that concatenates speech units to produce synthetic speech, a method for automatically segmenting unit labels, the method comprising:
- training a set of Hidden Markov Models (HMMs) using seed data in a first iteration;
  
  aligning the set of HMMs using a Viterbi alignment to produce segmented unit labels; and
  
  adjusting boundaries of the unit labels using spectral boundary correction.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method as defined in claim 1, wherein training a set of Hidden Markov Models further comprises:
    - initializing the set of HMMs using at least one of hand-labeled bootstrapped data, speaker-independent HMM bootstrapped data, and flat start data;
      
      re-estimating the set of HMMs; and
      
      performing an embedded re-estimation on the set of HMMs.
  - 3. A method as defined in claim 1, wherein adjusting boundaries of the unit labels using spectral boundary correction further comprises adjusting boundaries of the unit labels within specified time windows.
  - 4. A method as defined in claim 1, wherein adjusting boundaries of the unit labels using spectral boundary correction further comprises:
    - combining HMM-based segmentation with spectral features to reduce misalignments between target unit boundaries and boundaries assigned by the HMM-based segmentation.
  - 5. A method as defined in claim 1, wherein adjusting boundaries of the phone labels using spectral boundary correction further comprises:
    - identifying context-dependent time windows around the unit boundaries, wherein the unit boundaries include one or more of;
      
      a vowel-to-vowel boundary;
      
      a vowel-to-nasal boundary;
      
      a vowel-to-voiced stop boundary;
      
      a vowel-to-liquid boundary;
      
      a vowel-to-unvoiced stop boundary;
      
      a vowel-to-voiced fricative boundary;
      
      an unvoiced stop-to-vowel boundary;
      
      a nasal-to-vowel boundary;
      
      a voiced stop-to-vowel boundary a liquid-to-vowel boundary;
      
      an unvoiced fricative-to-vowel boundary; and
      
      a voiced fricative-to-vowel boundary.
  - 6. A method as defined in claim 5, wherein context-dependent time windows are empirically determined by adjacent phones.
  - 7. A method as defined in claim 1, further comprising using the unit labels whose boundaries have been adjusted by spectral boundary correction as input for a next iteration of:
    - training a set of HMMs;
      
      aligning the set of HMMs using a Viterbi alignment to produce phone labels; and
      
      adjusting boundaries of the unit labels using spectral boundary correction.
  - 8. A computer-readable media having computer-executable instructions for implementing the method of claim 1.

9. In a system having a speech inventory that includes phone labels that are concatenated to form synthetic speech, a method for segmenting the phone labels, the method comprising:
- performing a first alignment on a trained set of HMMs to produce phone labels that are segmented, wherein each phone label has a spectral boundary; and
  
  performing spectral boundary correction on the phone labels, wherein spectral boundary correction re-aligns each spectral boundary using bending points of spectral transitions.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 10. A method as defined in claim 9, wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented further comprises bootstrapping the set of HMMs with at least one of speaker-dependent HMMs and speaker-independent HMMs.
  - 11. A method as defined in claim 9, wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented further comprises:
    - initializing the set of HMMs;
      
      re-estimating the set of HMMs; and
      
      performing embedded re-estimation on the set of HMMs.
  - 12. A method as defined in claim 9, wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented further comprises performing a Viterbi alignment on the trained set of HMMs to produce phone labels that are segmented.
  - 13. A method as defined in claim 11, wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented and performing spectral boundary correction on the phone labels are performed iteratively.
  - 14. A method as defined in claim 13, further comprising training the set of HMMs using phone labels having boundaries that have been re-aligned using spectral boundary correction.
  - 15. A method as defined in claim 9, wherein performing spectral boundary correction on the phone labels further comprises performing spectral boundary correction on the phone labels within a context-dependent time window.
  - 16. A method as defined in claim 15, further comprising empirically determining the context-dependent time window using adjacent phones.
  - 17. A method as defined in claim 15, wherein each spectral boundary is between a first phone class and a second phone class.
  - 18. A computer-readable media having computer-executable instructions for implementing the method of claim 9.

19. A method for segmenting phone labels to reduce misalignments in order to improve synthetic speech when the phone labels are concatenated, the method comprising:
- training a set of HMMs using one of a specific speaker'"'"'s hand-labeled speech data and speaker-independent speech data;
  
  segmenting the trained set of HMMs using a first alignment to produce phone labels, wherein each phone label has a spectral boundary; and
  
  using a weighted slope metric to identify bending points of spectral transitions, wherein each bending point corresponds to a spectral boundary; and
  
  correcting a particular spectral boundary of a particular phone label if the particular spectral boundary does not coincide with a particular bending point.
- View Dependent Claims (20, 21, 22, 23, 24)
- - 20. A method as defined in claim 19, wherein using a weighted slope metric to identify bending points of spectral transitions further comprises applying the weighted slope metric within context-dependent time windows such that spurious spectral boundaries are not applied to the phone labels.
  - 21. A method as defined in claim 20, further comprising retraining the set of HMMs using the phone labels that have been corrected using the weighted slope metric.
  - 22. A method as defined in claim 20, wherein each spectral boundary is defined by a first phone class and a second phone class, wherein the first phone class and the second phone class include at least one of a vowel, an unvoiced stop, a voiced stop, an unvoiced fricative, a voiced fricative, a liquid class and a nasal class.
  - 23. A method as defined in claim 20, further comprising determining context-dependent time windows empirically.
  - 24. A computer-readable media having computer-executable instructions for performing the method of claim 19.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Kim, Yeon-Jun, Conkie, Alistair D.

Granted Patent

US 7,266,497 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/258
CPC Class Codes

G10L 13/06 Elementary speech units use...

Automatic segmentation in speech synthesis

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic segmentation in speech synthesis

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links