Method and apparatus for improved duration modeling of phonemes
First Claim
1. A method comprising:
- identifying a non-exponential functional transformation that defines a shape containing an inflection point, wherein the functional transformation comprises a root sinusoidal transformation; and
incorporating the functional transformation into a generalized additive model for modeling phoneme durations.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and an apparatus for improved duration modeling of phonemes in a speech synthesis system are provided. According to one aspect, text is received into a processor of a speech synthesis system. The received text is processed using a sum-of-products phoneme duration model that is used in either the formant method or the concatenative method of speech generation. The phoneme duration model, which is used along with a phoneme pitch model, is produced by developing a non-exponential functional transformation form for use with a generalized additive model. The non-exponential functional transformation form comprises a root sinusoidal transformation that is controlled in response to a minimum phoneme duration and a maximum phoneme duration. The minimum and maximum phoneme durations are observed in training data. The received text is processed by specifying at least one of a number of contextual factors for the generalized additive model. An inverse of the non-exponential functional transformation is applied to duration observations, or training data. Coefficients are generated for use with the generalized additive model. The generalized additive model comprising the coefficients is applied to at least one phoneme of the received text resulting in the generation of at least one phoneme having a duration. An acoustic sequence is generated comprising speech signals that are representative of the received text.
39 Citations
41 Claims
-
1. A method comprising:
-
identifying a non-exponential functional transformation that defines a shape containing an inflection point, wherein the functional transformation comprises a root sinusoidal transformation; and
incorporating the functional transformation into a generalized additive model for modeling phoneme durations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
wherein x is a duration for a phoneme, A is a minimum duration for the phoneme, B is a maximum duration for the phoneme, α
controls a slope of the shape at the inflection point, and β
controls a location on the shape of the inflection point.
-
-
3. The method of claim 1 further comprising:
determining control parameters for the functional transformation by applying an inverse of the functional transformation to phoneme durations in training data, the control parameters defining a location on the shape for the inflection point and a slope of the shape at the inflection point.
-
4. The method of claim 3 further comprising:
measuring a duration range for each phoneme in the training data.
-
5. The method of claim 3 further comprising:
measuring a duration range for a plurality of phonemes in the training data.
-
6. The method of claim 1, wherein the shape contains a plurality of inflection points.
-
7. The method of claim 1 further comprising:
selecting a contextual factor that influences phoneme durations.
-
8. The method of claim 7, wherein selecting a contextual factor comprises:
choosing at least one from the group consisting of an interaction between accent and an identity of a following phoneme, an interaction between accent and an identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
-
9. The method of claim 1 further comprising:
generating a duration for a phoneme using the generalized additive model.
-
10. A computer-readable medium having executable instructions to cause a processor to perform a method comprising:
-
identifying a non-exponential functional transformation that defines a shape containing an inflection point, wherein the functional transformation comprises a root sinusoidal transformation; and
incorporating the functional transformation into a generalized additive model for modeling phoneme durations. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
wherein x is a duration for a phoneme, A is a minimum duration for the phoneme, B is a maximum duration for the phoneme, α
controls a slope of the shape at the inflection point, and β
controls a location on the shape of the inflection point.
-
-
12. The computer-readable medium of claim 10, wherein the method further comprises:
- p1 determining control parameters for the functional transformation by applying an inverse of the functional transformation to phoneme durations in training data, the control parameters defining a location on the shape for the inflection point and a slope of the shape at the inflection point.
-
13. The computer-readable medium of claim 12, wherein the method further comprises:
measuring a duration range for each phoneme in the training data.
-
14. The computer-readable medium of claim 12, wherein the method further comprises:
measuring a duration range for a plurality of phonemes in the training data.
-
15. The computer-readable medium of claim 10, wherein the shape contains a plurality of inflection points.
-
16. The computer-readable medium of claim 10, wherein the method further comprises:
selecting a contextual factor that influences phoneme durations.
-
17. The computer-readable medium of claim 16, wherein selecting a contextual factor comprises:
choosing at least one from the group consisting of an interaction between accent and an identity of a following phoneme, an interaction between accent and an identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
-
18. The computer-readable medium of claim 10, wherein the method further comprises:
generating a duration for a phoneme using the generalized additive model.
-
19. A system comprising:
-
a processor coupled to a memory through a bus; and
a process executed from the memory by the processor to cause the processor to identify a non-exponential functional transformation that defines a shape containing an inflection point, and incorporate the functional transformation into a generalized additive model for modeling phoneme durations, wherein the functional transformation comprises a root sinusoidal transformation. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
wherein x is a duration for a phoneme, A is a minimum duration for the phoneme, B is a maximum duration for the phoneme, α
controls a slope of the shape at the inflection point, and β
controls a location on the shape of the inflection point.
-
-
21. The system of claim 19, wherein the process further causes the processor to determine control parameters for the functional transformation by applying an inverse of the functional transformation to phoneme durations in training data, the control parameters defining a location on the shape for the inflection point and a slope of the shape at the inflection point.
-
22. The system of claim 21, wherein the process further causes the processor to measure a duration range for each phoneme in the training data.
-
23. The system of claim 21, wherein the process further causes the processor to measure a duration range for a plurality of phonemes in the training data.
-
24. The system of claim 19, wherein the shape contains a plurality of inflection points.
-
25. The system of claim 19, wherein the process further causes the processor to select a contextual factor that influences phoneme durations.
-
26. The system of claim 25, wherein the process further causes the processor, when selecting a contextual factor, to choose at least one from the group consisting of an interaction between accent and an identity of a following phoneme, an interaction between accent and an identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
-
27. The system of claim 19, wherein the process further causes the processor to generate a duration for a phoneme using the generalized additive model.
-
28. An apparatus comprising:
-
means for identifying a non-exponential functional transformation that defines a shape containing an inflection point, wherein the functional transformation comprises a root sinusoidal transformation; and
means for incorporating the functional transformation into a generalized additive model for modeling phoneme durations. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36)
wherein x is a duration for a phoneme, A is a minimum duration for the phoneme, B is a maximum duration for the phoneme, α
controls a slope of the shape at the inflection point, and β
controls a location on the shape of the inflection point.
-
-
30. The apparatus of claim 28 further comprising:
means for determining control parameters for the functional transformation by applying an inverse of the functional transformation to phoneme durations in training data, the control parameters defining a location on the shape for the inflection point and a slope of the shape at the inflection point.
-
31. The apparatus of claim 30 further comprising:
means for measuring a duration range for each phoneme in the training data.
-
32. The apparatus of claim 30 further comprising means for measuring a duration range for a plurality of phonemes in the training data.
-
33. The apparatus of claim 28, wherein the shape contains a plurality of inflection points.
-
34. The apparatus of claim 28 further comprising:
means for selecting a contextual factor that influences phoneme durations.
-
35. The apparatus of claim 34, wherein the means for selecting a contextual factor chooses at least one from the group consisting of an interaction between accent and an identity of a following phoneme, an interaction between accent and an identity of a preceding phoneme, an interaction between accent and a number of phonemes to the end of an utterance, a number of syllables to a nuclear accent of an utterance, a number of syllables to an end of an utterance, an interaction between syllable position and a position of a phoneme with respect to a left edge of the phoneme enclosing word, an onset of an enclosing syllable, and a coda of an enclosing syllable.
-
36. The apparatus of claim 28 further comprising:
means for generating a duration for a phoneme using the generalized additive model.
-
37. An apparatus comprising:
-
means for receiving text signals;
means for synthesizing an acoustic sequence from the text signals using a phoneme duration model, the phoneme duration model produced by incorporating a functional transformation form with an inflection point into a generalized additive model that calculates phoneme durations, wherein the functional transformation form comprises a root sinusoidal transformation, the root sinusoidal transformation controlled in response to a minimum phoneme duration and a maximum phoneme duration; and
means for providing speech signals representative of the received text. - View Dependent Claims (38, 39, 40, 41)
means for applying an inverse of the functional transformation form to duration training data to generate coefficients for use in the generalized additive model;
means for specifying at least one of a plurality of contextual factors for use in the generalized additive model; and
means for applying the generalized additive model to at least one phoneme of the received text to generate at least one duration.
-
-
39. The apparatus of claim 37, wherein the phoneme duration model is used in a formant method and a concatenative method of speech generation.
-
40. The apparatus of claim 37, wherein the phoneme duration model is a sum of products model, and wherein the means for synthesizing further comprises means for modeling phoneme pitch.
-
41. The apparatus of claim 37, wherein the generalized additive model is expressed by
-
( f 1 , f 2 , … f N ) = F [ ∑ i = 1 N ∏ j = 1 M i a i , j f i ( j ) ] , where D is the duration of a phoneme, fi(i=1, . . . , N) represents the ith one of a plurality of contextual factors influencing D, Mi is the number of values that fi can take, α
i,j is a factor scale corresponding to the jth value of factor fi denoted by fi(j), and F is the functional transformation form.
-
Specification