Method and apparatus for improved duration modeling of phonemes
First Claim
1. A method for producing synthetic speech comprising:
- receiving text into a processor;
processing the text using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU3## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
controls the amount of shrinking and expansion on either side of a main inflection point, and β
controls the position of the main inflection point; and
generating speech signals representative of the received text.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and an apparatus for improved duration modeling of phonemes in a speech synthesis system are provided. According to one aspect, text is received into a processor of a speech synthesis system. The received text is processed using a sum-of-products phoneme duration model that is used in either the formant method or the concatenative method of speech generation. The phoneme duration model, which is used along with a phoneme pitch model, is produced by developing a non-exponential functional transformation form for use with a generalized additive model. The non-exponential functional transformation form comprises a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration. The minimum and maximum phoneme durations are observed in training data. The received text is processed by specifying at least one of a number of contextual factors for the generalized additive model. An inverse of the non-exponential functional transformation is applied to duration observations, or training data. Coefficients are generated for use with the generalized additive model. The generalized additive model comprising the coefficients is applied to at least one phoneme of the received text resulting in the generation of at least one phoneme having a duration. An acoustic sequence is generated comprising speech signals that are representative of the received text.
280 Citations
22 Claims
-
1. A method for producing synthetic speech comprising:
-
receiving text into a processor; processing the text using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU3## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
controls the amount of shrinking and expansion on either side of a main inflection point, and β
controls the position of the main inflection point; andgenerating speech signals representative of the received text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An apparatus for speech synthesis comprising:
-
an input for receiving text signals into a processor; a processor configured to synthesize an acoustic sequence using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU4## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
controls the amount of shrinking and expansion on either side of a main inflection point, and β
controls the position of the main inflection point; andan output for providing speech signals representative of the received text. - View Dependent Claims (10, 11, 12)
-
-
13. A speech recognition process comprising:
generating a speech output in response to a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU5## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
controls the amount of shrinking and expansion on either side of a main inflection point, and β
controls the position of the main inflection point.- View Dependent Claims (14)
-
15. A computer readable medium containing executable instructions which, when executed in a processing system, causes the system to perform a method for synthesizing speech comprising:
-
receiving text into a processor; processing the text using a phoneme duration model, the phoneme duration model produced by developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation form comprises a root sinusoidal transformation expressed by ##EQU6## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
controls the amount of shrinking and expansion on either side of a main inflection point, and β
controls the position of the main inflection point; andgenerating speech signals representative of the received text. - View Dependent Claims (16)
-
-
17. A method for generating a phoneme duration model for use in a speech synthesis system, the method comprising:
-
developing a non-exponential functional transformation form for use with a generalized additive model, wherein the non-exponential functional transformation is expressed by ##EQU7## where x is the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
controls the amount of shrinking and expansion on either side of a main inflection point, and β
controls the position of the main inflection point; andgenerating a speech output in response to said developing said non-exponential functional transformation.
-
-
18. A speech synthesis system comprising:
-
a voice generation device for processing an acoustic phoneme sequence representative of a text; and a duration modeling device coupled to said voice generation device for receiving phonemes from said voice generation device and providing phoneme durations using a phoneme duration model, wherein said phoneme duration model generates model coefficients by developing a non-exponential functional transformation comprising a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration, wherein said root sinusoidal transformation is expressed by ##EQU8## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
controls the amount of shrinking and expansion on either side of a main inflection point, and β
controls the position of the main inflection point, and wherein said duration modeling device receives said model coefficients from said phoneme duration model and generates at least one phoneme having a duration using a generalized additive model for each phoneme of the received text. - View Dependent Claims (19, 20, 21)
-
-
22. A method for generating a phoneme duration in a speech synthesis system, said method comprising:
-
developing a non-exponential functional transformation; applying an inverse of said non-exponential functional transformation to measured durations of observed training phonemes, wherein said non-exponential functional transformation comprises a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration, wherein said root sinusoidal transformation is expressed by ##EQU9## where x comprises one or more of a plurality of contextual factors influencing the duration of a phoneme, A is the minimum phoneme duration observed in training data, B is the maximum phoneme duration observed in training data, α
controls the amount of shrinking and expansion on either side of a main inflection point, and β
controls the position of the main inflection point;generating model coefficients for use in a generalized additive model; receiving at least one phoneme representative of a text; determining at least one of the plurality of contextual factors of said at least one phoneme for use in said generalized additive model; applying said generalized additive model for at least one phoneme of said text; and applying the non-exponential functional transformation for generating a phoneme having a duration.
-
Specification