Method and apparatus for improved duration modeling of phonemes

US 6,064,960 A
Filed: 12/18/1997
Issued: 05/16/2000
Est. Priority Date: 12/18/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method for producing synthetic speech comprising:

receiving text into a processor;

processing the text using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU3## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α

controls the amount of shrinking and expansion on either side of a main inflection point, and β

controls the position of the main inflection point; and

generating speech signals representative of the received text.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and an apparatus for improved duration modeling of phonemes in a speech synthesis system are provided. According to one aspect, text is received into a processor of a speech synthesis system. The received text is processed using a sum-of-products phoneme duration model that is used in either the formant method or the concatenative method of speech generation. The phoneme duration model, which is used along with a phoneme pitch model, is produced by developing a non-exponential functional transformation form for use with a generalized additive model. The non-exponential functional transformation form comprises a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration. The minimum and maximum phoneme durations are observed in training data. The received text is processed by specifying at least one of a number of contextual factors for the generalized additive model. An inverse of the non-exponential functional transformation is applied to duration observations, or training data. Coefficients are generated for use with the generalized additive model. The generalized additive model comprising the coefficients is applied to at least one phoneme of the received text resulting in the generation of at least one phoneme having a duration. An acoustic sequence is generated comprising speech signals that are representative of the received text.

280 Citations

22 Claims

1. A method for producing synthetic speech comprising:
- receiving text into a processor;
  
  processing the text using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU3## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
  
  controls the amount of shrinking and expansion on either side of a main inflection point, and β
  
  controls the position of the main inflection point; and
  
  generating speech signals representative of the received text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein processing the text using a phoneme duration model comprises:
    - specifying at least one of the plurality of contextual factors for use in the generalized additive model;
      
      applying an inverse of the non-exponential functional transformation to duration training data;
      
      generating coefficients for use in the generalized additive model;
      
      applying the generalized additive model to at least one phoneme of the received text; and
      
      generating at least one phoneme having a duration.
  - 3. The method of claim 2, wherein the plurality of contextual factors comprises an interaction between accent and the identity of a following phoneme, an interaction between accent and the identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
  - 4. The method of claim 1, wherein the phoneme duration model is used to process a plurality of phonemes.
  - 5. The method of claim 1, wherein the phoneme duration model is used in a formant method of speech generation.
  - 6. The method of claim 1, wherein the phoneme duration model is used in a concatenative method of speech generation.
  - 7. The method of claim 1, further comprising processing the text using a phoneme pitch model.
  - 8. The method of claim 1, wherein the phoneme duration model is a sum of products model.

9. An apparatus for speech synthesis comprising:
- an input for receiving text signals into a processor;
  
  a processor configured to synthesize an acoustic sequence using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU4## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
  
  controls the amount of shrinking and expansion on either side of a main inflection point, and β
  
  controls the position of the main inflection point; and
  
  an output for providing speech signals representative of the received text.
- View Dependent Claims (10, 11, 12)
- - 10. The apparatus of claim 9, wherein the processor is further configured to:
    - specify at least one of the plurality of contextual factors for use in the generalized additive model;
      
      apply an inverse of the non-exponential functional transformation to duration training data;
      
      generate coefficients for use in the generalized additive model;
      
      apply the generalized additive model to at least one phoneme of the received text; and
      
      generate at least one phoneme having a duration.
  - 11. The apparatus of claim 9, wherein the phoneme duration model is used in a formant method and a concatenative method of speech generation.
  - 12. The apparatus of claim 9, wherein the phoneme duration model is a sum of products model, and wherein the processor is further configured to synthesize the acoustic sequence using a phoneme pitch model.

13. A speech recognition process comprising:
- generating a speech output in response to a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU5## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
  
  controls the amount of shrinking and expansion on either side of a main inflection point, and β
  
  controls the position of the main inflection point.
- View Dependent Claims (14)
- - 14. The process of claim 13, wherein the phoneme duration model is a sum of products model, the phoneme duration model used with a pitch model to generate speech signals representative of received text.

15. A computer readable medium containing executable instructions which, when executed in a processing system, causes the system to perform a method for synthesizing speech comprising:
- receiving text into a processor;
  
  processing the text using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation form comprises a root sinusoidal transformation expressed by ##EQU6## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
  
  controls the amount of shrinking and expansion on either side of a main inflection point, and β
  
  controls the position of the main inflection point; and
  
  generating speech signals representative of the received text.
- View Dependent Claims (16)
- - 16. The computer readable medium of claim 15, wherein the system is further caused to perform processing the text using a phoneme pitch model.

17. A method for generating a phoneme duration model for use in a speech synthesis system, the method comprising:
- developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU7## where x is the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
  
  controls the amount of shrinking and expansion on either side of a main inflection point, and β
  
  controls the position of the main inflection point; and
  
  generating a speech output in response to said developing said non-exponential functional transformation.

18. A speech synthesis system comprising:
- a voice generation device for processing an acoustic phoneme sequence representative of a text; and
  
  a duration modeling device coupled to said voice generation device for receiving phonemes from said voice generation device and providing phoneme durations using a phoneme duration model, wherein said phoneme duration model generates model coefficients by developing a non-exponential functional transformation comprising a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration, wherein said root sinusoidal transformation is expressed by ##EQU8## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
  
  controls the amount of shrinking and expansion on either side of a main inflection point, and β
  
  controls the position of the main inflection point, and wherein said duration modeling device receives said model coefficients from said phoneme duration model and generates at least one phoneme having a duration using a generalized additive model for each phoneme of the received text.
- View Dependent Claims (19, 20, 21)
- - 19. The speech synthesis of claim 18 further comprising:
    - a pitch modeling device coupled to the duration modeling device that receives at least one phoneme having a duration and, using pitch information, provides an acoustic sequence of synthesized speech signals representative of said text.
  - 20. The speech synthesis of claim 18, wherein said voice generation device processes the text input using a concatenative speech generation model.
  - 21. The speech synthesis of claim 18, wherein said voice generation device processes the text input using a formant synthesis speech generation model.

22. A method for generating a phoneme duration in a speech synthesis system, said method comprising:
- developing a non-exponential functional transformation;
  
  applying an inverse of said non-exponential functional transformation to measured durations of observed training phonemes, wherein said non-exponential functional transformation comprises a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration, wherein said root sinusoidal transformation is expressed by ##EQU9## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
  
  controls the amount of shrinking and expansion on either side of a main inflection point, and β
  
  controls the position of the main inflection point;
  
  generating model coefficients for use in a generalized additive model;
  
  receiving at least one phoneme representative of a text;
  
  determining at least one of the plurality of contextual factors of said at least one phoneme for use in said generalized additive model;
  
  applying said generalized additive model for at least one phoneme of said text; and
  
  applying the non-exponential functional transformation for generating a phoneme having a duration.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Computer Incorporated (Apple Inc.)
Inventors
Bellegarda, Jerome R., Silverman, Kim
Primary Examiner(s)
Zele, Krista
Assistant Examiner(s)
Opsasnick, Michael N.

Application Number

US08/993,940
Time in Patent Office

880 Days
Field of Search

704/211, 704/260, 704/266, 704/267, 704/269
US Class Current

704/260
CPC Class Codes

G10L 13/04   Details of speech synthesis...

G10L 13/08   Text analysis or generation...

G10L 13/10   Prosody rules derived from ...

Method and apparatus for improved duration modeling of phonemes

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

280 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for improved duration modeling of phonemes

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

280 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links