Method and apparatus for improved duration modeling of phonemes

US 20020138270A1
Filed: 02/22/2002
Published: 09/26/2002
Est. Priority Date: 12/18/1997
Status: Active Grant

First Claim

Patent Images

1. A method for producing synthetic speech comprising the steps of:

receiving text into a processor;

processing the text using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model; and

generating speech signals representative of the received text.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and an apparatus for improved duration modeling of phonemes in a speech synthesis system are provided. According to one aspect, text is received into a processor of a speech synthesis system. The received text is processed using a sum-of-products phoneme duration model that is used in either the formant method or the concatenative method of speech generation. The phoneme duration model, which is used along with a phoneme pitch model, is produced by developing a non-exponential functional transformation form for use with a generalized additive model. The non-exponential functional transformation form comprises a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration. The minimum and maximum phoneme durations are observed in training data. The received text is processed by specifying at least one of a number of contextual factors for the generalized additive model. An inverse of the non-exponential functional transformation is applied to duration observations, or training data. Coefficients are generated for use with the generalized additive model. The generalized additive model comprising the coefficients is applied to at least one phoneme of the received text resulting in the generation of at least one phoneme having a duration. An acoustic sequence is generated comprising speech signals that are representative of the received text.

158 Citations

23 Claims

1. A method for producing synthetic speech comprising the steps of:
- receiving text into a processor;
  
  processing the text using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model; and
  
  generating speech signals representative of the received text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the non-exponential functional transformation form comprises a root sinusoidal transformation, the root sinusoidal transformation controlled in response to a minimum phoneme duration and a maximum phoneme duration.
  - 3. The method of claim 1, wherein the step of processing the text using a phoneme duration model comprises the steps of:
    - specifying at least one of a plurality of contextual factors for use in a generalized additive model;
      
      applying an inverse of the non-exponential functional transformation to duration training data;
      
      generating coefficients for use in the generalized additive model;
      
      applying the generalized additive model to at least one phoneme of the received text; and
      
      generating at least one phoneme having a duration.
  - 4. The method of claim 3, wherein the plurality of contextual factors comprises an interaction between accent and the identity of a following phoneme, an interaction between accent and the identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
  - 5. The method of claim 1, wherein a phoneme duration model is used to process a plurality of phonemes.
  - 6. The method of claim 1, wherein the phoneme duration model is used in a formant method of speech generation.
  - 7. The method of claim 1, wherein the phoneme duration model is used in a concatenative method of speech generation.
  - 8. The method of claim 1, further comprising the step of processing the text using a phoneme pitch model.
  - 9. The method of claim 1, wherein the phoneme duration model is a sum of products model.
  - 10. The method of claim 1, wherein the non-exponential functional transformation is expressed by

11. An apparatus for speech synthesis comprising:
- an input for receiving text signals into a processor;
  
  a processor configured to synthesize an acoustic sequence using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model; and
  
  an output for providing speech signals representative of the received text.
- View Dependent Claims (12, 13, 14, 15, 16, 18, 19)
- - 12. The apparatus of claim 11, wherein the non-exponential functional transformation form comprises a root sinusoidal transformation, the root sinusoidal transformation controlled in response to a minimum phoneme duration and a maximum phoneme duration.
  - 13. The apparatus of claim 11, wherein the processor is further configured to:
    - specify at least one of a plurality of contextual factors for use in a generalized additive model;
      
      apply an inverse of the non-exponential functional transformation to duration training data;
      
      generate coefficients for use in the generalized additive model;
      
      apply the generalized additive model to at least one phoneme of the received text; and
      
      generate at least one phoneme having a duration.
  - 14. The apparatus of claim 11, wherein the phoneme duration model is used in a formant method and a concatenative method of speech generation.
  - 15. The apparatus of claim 11, wherein the phoneme duration model is a sum of products model, and wherein the processor is further configured to synthesize the acoustic sequence using a phoneme pitch model.
  - 16. The apparatus of claim 11, wherein the non-exponential functional transformation is expressed by
  - 18. The process of claim 17, wherein the non-exponential functional transformation is expressed by
  - 19. The process of claim 17, wherein the phoneme duration model is a sum of products model, the phoneme duration model used with a pitch model to generate speech signals representative of received text.

17. A speech generation process comprising a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model.

20. A computer readable medium containing executable instructions which, when executed in a processing system, causes the system to perform the steps for synthesizing speech comprising:
- receiving text into a processor;
  
  processing the text using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model; and
  
  generating speech signals representative of the received text.
- View Dependent Claims (21, 22)
- - 21. The computer readable medium of claim 20, wherein the system is further caused to perform the step comprising processing the text using a phoneme pitch model.
  - 22. The computer readable medium of claim 20, wherein the non-exponential functional transformation form comprises a root sinusoidal transformation expressed by

23. A method for generating a phoneme duration model for use in a speech synthesis system, the method comprising the step of developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Computer Incorporated (Apple Inc.)
Inventors
Bellegarda, Jerome R., Silverman, Kim

Granted Patent

US 6,553,344 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/266
CPC Class Codes

G10L 13/04   Details of speech synthesis...

G10L 13/08   Text analysis or generation...

G10L 13/10   Prosody rules derived from ...

Method and apparatus for improved duration modeling of phonemes

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

158 Citations

23 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for improved duration modeling of phonemes

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

158 Citations

23 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others