Method and apparatus for improved duration modeling of phonemes
First Claim
1. A method for producing synthetic speech comprising:
- receiving text into a processor;
processing the text using a phoneme duration model, the phoneme duration model produced by developing a functional transformation form with an inflection point for use with a generalized additive model, wherein the generalized additive model is specifically designed to calculate phoneme durations for speech synthesis; and
generating speech signals representative of the received text.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and an apparatus for improved duration modeling of phonemes in a speech synthesis system are provided. According to one aspect, text is received into a processor of a speech synthesis system. The received text is processed using a sum-of-products phoneme duration model that is used in either the formant method or the concatenative method of speech generation. The phoneme duration model, which is used along with a phoneme pitch model, is produced by developing a non-exponential functional transformation form for use with a generalized additive model. The non-exponential functional transformation form comprises a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration. The minimum and maximum phoneme durations are observed in training data. The received text is processed by specifying at least one of a number of contextual factors for the generalized additive model. An inverse of the non-exponential functional transformation is applied to duration observations, or training data. Coefficients are generated for use with the generalized additive model. The generalized additive model comprising the coefficients is applied to at least one phoneme of the received text resulting in the generation of at least one phoneme having a duration. An acoustic sequence is generated comprising speech signals that are representative of the received text.
259 Citations
45 Claims
-
1. A method for producing synthetic speech comprising:
-
receiving text into a processor;
processing the text using a phoneme duration model, the phoneme duration model produced by developing a functional transformation form with an inflection point for use with a generalized additive model, wherein the generalized additive model is specifically designed to calculate phoneme durations for speech synthesis; and
generating speech signals representative of the received text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
specifying at least one of a plurality of contextual factors for use in a generalized additive model;
applying an inverse of the functional transformation form to duration training data;
generating coefficients for use in the generalized additive model;
applying the generalized additive model to at least one phoneme of the received text; and
generating at least one phoneme having a duration.
-
-
4. The method of claim 1, wherein the plurality of contextual factors comprises an interaction between accent and the identity of a following phoneme, an interaction between accent and the identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
-
5. The method of claim 1, wherein a phoneme duration model is used to process a plurality of phonemes.
-
6. The method of claim 1, wherein the phoneme duration model is used in a formant method of speech generation.
-
7. The method of claim 1, wherein the phoneme duration model is used in a concatenative method of speech generation.
-
8. The method of claim 1, further comprising processing the text using a phoneme pitch model.
-
9. The method of claims 1, wherein the phoneme duration model is a sum of products model.
-
10. An apparatus for speech synthesis comprising:
-
an input for receiving text signals into a processor;
a processor configured to synthesize an acoustic sequence using a phoneme duration model, the phoneme duration model produced by developing a functional transformation form with an inflection point for use with a generalized additive model, wherein the generalized additive model is specifically designed to calculate phoneme durations for speech synthesis; and
an output for providing speech signals representative of the received text. - View Dependent Claims (11, 12, 13, 14)
specify at least one of a plurality of contextual factors for use in a generalized additive model;
apply an inverse of the functional transformation form to duration training data;
generate coefficients for use in the generalized additive model;
apply the generalized additive model to at least one phoneme of the received text; and
generate at least one phoneme having a duration.
-
-
13. The apparatus of claim 10, wherein the phoneme duration model is used in a formant method and a concatenative method of speech generation.
-
14. The apparatus of claim 10, wherein the phoneme duration model is a sum of products model, and wherein the processor is further configured to synthesize the acoustic sequence using a phoneme pitch model.
-
15. A speech generation process comprising:
generating a speech output in response to a phoneme duration model, the phoneme duration model produced by developing a functional transformation form with an inflection point for use with a generalized additive model, wherein the generalized additive model is specifically designed to calculate phoneme durations for speech synthesis. - View Dependent Claims (16)
-
17. A computer readable medium containing executable instructions which, when executed in a processing system, causes the system to perform a method for synthesizing speech comprising:
-
receiving text into a processor;
processing the text using a phoneme duration model, the phoneme duration model produced by developing a functional transformation form with an inflection point for use with a generalized additive model, wherein the generalized additive model is specifically designed to calculate phoneme durations for speech synthesis; and
generating speech signals representative of the received text. - View Dependent Claims (18)
-
-
19. A speech synthesis system comprising:
-
a voice generation device for processing an acoustic phoneme sequence representative of a text; and
a duration modeling device coupled to the voice generation device for receiving phonemes from the voice generation device and providing phoneme durations using a phoneme duration model, wherein the phoneme duration model generates model coefficients by developing a functional transformation with an inflection point, wherein the duration modeling device receives the model coefficients from the phoneme duration model and generates at least one phoneme having a duration using a generalized additive model for each phoneme of the received text, and wherein the generalized additive model is specifically designed to calculate phoneme durations for synthesized speech. - View Dependent Claims (20, 21, 22)
a pitch modeling device coupled to the duration modeling device that receives at least one phoneme having a duration and, using pitch information, provides an acoustic sequence of synthesized speech signals representative of the text.
-
-
21. The speech synthesis of claim 19, wherein the voice generation device processes the text input using a concatenative speech generation model.
-
22. The speech synthesis of claim 19, wherein the voice generation device processes the text input using a formant synthesis speech generation model.
-
23. A method for generating a phoneme duration in a speech synthesis system, the method comprising:
-
developing a functional transformation with an inflection point;
applying an inverse of the functional transformation to measured durations of observed training phonemes;
generating model coefficients for use in a generalized additive model, wherein the generalized additive model is specifically designed to calculate phoneme durations for speech synthesis;
receiving at least one phoneme representative of a text;
determining at least one of a plurality of contextual factors of the at least one phoneme for use in the generalized additive model;
applying the generalized additive model for at least one phoneme of the text; and
applying the functional transformation for generating a phoneme having a duration.
-
-
24. A method for producing synthetic speech comprising:
-
receiving text into a processor;
processing the text using a phoneme duration model, the phoneme duration model produced by developing a functional transformation form with an inflection point for use with a generalized additive model, the generalized additive model expressed by
where D is the duration of a phoneme, ƒ
i(i=1, . . . , N) represents the ith one of a plurality of contextual factors influencing D, Mi is the number of values that ƒ
i can take, α
i,j is a factor scale corresponding to the jth value of factor ƒ
i denoted by ƒ
i,(j), and F is the functional transformation form; and
generating speech signals representative of the received text. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31)
specifying at least one of a plurality of contextual factors for use in a generalized additive model;
applying an inverse of the functional transformation form to duration training data;
generating coefficients for use in the generalized additive model;
applying the generalized additive model to at least one phoneme of the received text; and
generating at least one phoneme having a duration.
-
-
27. The method of claim 26, wherein the plurality of contextual factors comprises an interaction between accent and the identity of a following phoneme, an interaction between accent and the identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
-
28. The method of claim 24, wherein a phoneme duration model is used to process a plurality of phonemes.
-
29. The method of claim 24, wherein the phoneme duration model is used in a formant method of speech generation.
-
30. The method of claim 24, wherein the phoneme duration model is used in a concatenative method of speech generation.
-
31. The method of claim 24, further comprising processing the text using a phoneme pitch model.
-
32. An apparatus for speech synthesis comprising:
-
an input for receiving text signals into a processor;
a processor configured to synthesize an acoustic sequence using a phoneme duration model, the phoneme duration model produced by developing a functional transformation form with an inflection point for use with a generalized additive model, wherein the generalized additive model is expressed by
where D is the duration of a phoneme, ƒ
i(i=1, . . . , N) represents the ith one of a plurality of contextual factors influencing D, Mi is the number of values that ƒ
i can take, α
i,j is a factor scale corresponding to the jth value of factor ƒ
i denoted by ƒ
i(j), and F is the functional transformation form; and
an output for providing speech signals representative of the received text. - View Dependent Claims (33, 34, 35, 36)
specify at least one of a plurality of contextual factors for use in a generalized additive model;
apply an inverse of the functional transformation form to duration training data;
generate coefficients for use in the generalized additive model;
apply the generalized additive model to at least one phoneme of the received text; and
generate at least one phoneme having a duration.
-
-
35. The apparatus of claim 32, wherein the phoneme duration model is used in a formant method and a concatenative method of speech generation.
-
36. The apparatus of claim 32, wherein the processor is further configured to synthesize the acoustic sequence using a phoneme pitch model.
-
37. A speech generation process comprising:
-
generating a speech output in response to a phoneme duration model, the phoneme duration model produced by developing a functional transformation form with an inflection point for use with a generalized additive model, wherein the generalized additive model is expressed by
where D is the duration of a phoneme, ƒ
i(i=1, . . . , N) represents the ith one of a plurality of contextual factors influencing D, Mi is the number of values that ƒ
i can take, α
i,j is a factor scale corresponding to the jth value of factor ƒ
i denoted by ƒ
i(j), and F is the functional transformation form.- View Dependent Claims (38)
-
-
39. A computer readable medium containing executable instructions which, when executed in a processing system, causes the system to perform a method for synthesizing speech comprising:
-
receiving text into a processor;
processing the text using a phoneme duration model, the phoneme duration model produced by developing a functional transformation form with an inflection point for use with a generalized additive model, wherein the generalized additive model is expressed by
where D is the duration of a phoneme, ƒ
i(i=1, . . . , N) represents the ith one of a plurality of contextual factors influencing D, Mi is the number of values that ƒ
i can take, α
i,j is a factor scale corresponding to the jth value of factor ƒ
i denoted by ƒ
i(j), and F is the functional transformation form; and
generating speech signals representative of the received text. - View Dependent Claims (40)
-
-
41. A speech synthesis system comprising:
-
a voice generation device for processing an acoustic phoneme sequence representative of a text; and
a duration modeling device coupled to the voice generation device for receiving phonemes from the voice generation device and providing phoneme durations using a phoneme duration model, wherein the phoneme duration model generates model coefficients by developing a functional transformation with an inflection point, wherein the duration modeling device receives the model coefficients from the phoneme duration model and generates at least one phoneme having a duration using a generalized additive model for each phoneme of the received text, and wherein the generalized additive model is expressed by
where D is the duration of a phoneme, ƒ
i(i=1, . . . , N) represents the ith one of a plurality of contextual factors influencing D, Mi is the number of values that ƒ
i can take, α
i,j is a factor scale corresponding to the jth value of factor ƒ
i denoted by ƒ
i(j), and F is the functional transformation form.- View Dependent Claims (42, 43, 44)
a pitch modeling device coupled to the duration modeling device that receives at least one phoneme having a duration and, using pitch information, provides an acoustic sequence of synthesized speech signals representative of the text.
-
-
43. The speech synthesis of claim 41, wherein the voice generation device processes the text input using a concatenative speech generation model.
-
44. The speech synthesis of claim 41, wherein the voice generation device processes the text input using a formant synthesis speech generation model.
-
45. A method for generating a phoneme duration in a speech synthesis system, the method comprising:
-
developing a functional transformation with an inflection point;
applying an inverse of the functional transformation to measured durations of observed training phonemes;
generating model coefficients for use in a generalized additive model, wherein the generalized additive model is expressed by
where D is the duration of a phoneme, ƒ
i(i=1, . . . , N) represents the ith one of a plurality of contextual factors influencing D, Mi is the number of values that ƒ
i can take, α
i,j is a factor scale corresponding to the jth value of factor ƒ
i denoted by ƒ
i(j), and F is the functional transformation form;
receiving at least one phoneme representative of a text;
determining at least one of a plurality of contextual factors of the at least one phoneme for use in the generalized additive model;
applying the generalized additive model for at least one phoneme of the text; and
applying the functional transformation for generating a phoneme having a duration.
-
Specification