Clustered patterns for text-to-speech synthesis
First Claim
1. An apparatus for generating clustered patterns for text-to-speech synthesis, comprising:
- representative pattern memory configured to store a plurality of initial representative patterns, each initial representative pattern being a noise pattern, an attribute being differently affixed to each initial representative pattern, the attribute including at least accent type;
pitch pattern memory configured to store a large number of natural pitch patterns for learning, each natural pitch pattern being an accent phrase in a sentence and including the attribute of the accent phrase;
clustering unit configured to classify each natural pitch pattern to the initial representative pattern, the natural pitch patterns of same attribute being classified to one initial representative pattern of the same attribute;
transformation parameter generation unit configured to respectively generate a transformation parameter for each natural pitch pattern by evaluating an error between a transformed representative pattern and each natural pitch pattern classified to the initial representative pattern from which the transformed representative pattern is generated;
representative pattern generation unit configured to update each initial representative pattern by calculating an evaluation function of the sum of the error between the transformed representative pattern and each natural pitch pattern classified to the initial representative pattern; and
wherein said representative pattern memory stores each updated representative pattern as a clustered pattern of the attribute affixed to the initial representative pattern from which the updated representative pattern is generated.
1 Assignment
0 Petitions
Accused Products
Abstract
A representative pattern memory stores a plurality of initial representative patterns as a noise pattern. Different attribute is affixed to each initial representative pattern. A pitch pattern memory stores a large number of natural pitch patterns as an accent phrase. A clustering unit classifies each natural pitch pattern to the initial representative pattern based on the attribute of the accent phrase. A transformation parameter generation unit calculates an error between a transformed representative pattern and each natural pitch pattern classified to the initial representative pattern. A representative pattern generation unit calculates an evaluation function of the sum of the error between the transformed-representative pattern and each natural pitch pattern classified to the initial representative pattern, and updates each initial representative pattern. The representative pattern memory stores each updated representative pattern as a clustered pattern of the attribute affixed to the corresponding initial representative pattern.
35 Citations
26 Claims
-
1. An apparatus for generating clustered patterns for text-to-speech synthesis, comprising:
-
representative pattern memory configured to store a plurality of initial representative patterns, each initial representative pattern being a noise pattern, an attribute being differently affixed to each initial representative pattern, the attribute including at least accent type;
pitch pattern memory configured to store a large number of natural pitch patterns for learning, each natural pitch pattern being an accent phrase in a sentence and including the attribute of the accent phrase;
clustering unit configured to classify each natural pitch pattern to the initial representative pattern, the natural pitch patterns of same attribute being classified to one initial representative pattern of the same attribute;
transformation parameter generation unit configured to respectively generate a transformation parameter for each natural pitch pattern by evaluating an error between a transformed representative pattern and each natural pitch pattern classified to the initial representative pattern from which the transformed representative pattern is generated;
representative pattern generation unit configured to update each initial representative pattern by calculating an evaluation function of the sum of the error between the transformed representative pattern and each natural pitch pattern classified to the initial representative pattern; and
wherein said representative pattern memory stores each updated representative pattern as a clustered pattern of the attribute affixed to the initial representative pattern from which the updated representative pattern is generated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
wherein the natural pitch pattern represents a time change of fundamental frequency. -
3. The apparatus according to claim 2,
wherein the transformation parameter represents one of a change of duration along a time axis, and a shift of frequency along a frequency axis. -
4. The apparatus according to claim 1,
wherein the attribute of the accent phrase includes accent type, number of mora, part of speech, and phoneme. -
5. The apparatus according to claim 1,
wherein said representative pattern memory stores a plurality of clustered patterns each corresponding to a different attribute affixed to each initial representative pattern. -
6. The apparatus according to claim 1,
wherein said transformation parameter generation unit repeats generation of the transformation parameter, and said representative pattern generation unit repeats update of the representative pattern, until the evaluation function satisfies a predetermined condition. -
7. The apparatus according to claim 6,
wherein said representative pattern memory stores the updated representative pattern, when the evaluation function satisfies the predetermined condition. -
8. The apparatus according to claim 7, further comprising:
a transformation parameter generation rule memory being configured to store the transformation parameter and the attribute of the natural pitch pattern of which the error is evaluated, when the evaluation function satisfies the predetermined condition.
-
9. The apparatus according to claim 6,
wherein said transformation parameter generation unit generates the transformation parameters for all combinations of each natural pitch pattern and each initial representative pattern. -
10. The apparatus according to claim 9, further comprising:
-
an error evaluation unit being configured to respectively calculate an error between each natural pitch pattern and each transformed representative pattern; and
wherein said clustering unit classifies each natural pitch pattern to one initial representative pattern of which the error between the natural pitch pattern and the one initial representative pattern is the smallest among errors between the natural pitch pattern and all transformed representative patterns.
-
-
11. The apparatus according to claim 10, whenever said transformation parameter generation unit generates the transformation parameters for all combinations of each natural pitch pattern and each updated representative pattern, until the evaluation function satisfies the predetermined condition,
wherein said error evaluation unit repeats calculation of the error, and said clustering unit repeats classification of each natural pitch pattern. -
12. The apparatus according to claim 11, further comprising:
a representative pattern selection rule memory being configured to correspondingly store the attribute of the natural pitch patterns classified to each updated representative pattern and an address of the updated representative pattern in said representative pattern memory, when the evaluation function satisfies the predetermined condition.
-
-
13. A method for generating clustered patterns for text-to-speech synthesis, comprising the steps of:
-
storing the plurality of initial representative patterns, each initial representative pattern being a noise pattern, an attribute being differently affixed to each initial representative pattern, the attribute including at least accent type;
storing a large number of natural pitch patterns for learning, each natural pitch pattern being an accent phrase in a sentence and including the attribute of the accent phrase;
classifying each natural pitch pattern to the initial representative pattern, the natural pitch patterns of same attribute being classified to one initial representative pattern of the same attribute;
respectively generating a transformation parameter for each natural pitch pattern by evaluating an error between a transformed representative pattern and each natural pitch pattern classified to the initial representative pattern from which the transformed representative pattern is generated;
updating each initial representative pattern by calculating an evaluation function of the sum of the error between the transformed representative pattern and each natural pitch pattern classified to the initial representative pattern; and
storing each updated representative pattern as a clustered pattern of the attribute affixed to the initial representative pattern from which the updated representative pattern is generated. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
wherein the natural pitch pattern represents a time change of fundamental frequency. -
15. The method according to claim 14, wherein the transformation parameter represents one of a change of duration along a time axis, and a shift of frequency along a frequency axis.
-
16. The method of according to claim 13, wherein the attribute of the accent phrase includes accent type, number of mora, part of speech, and phoneme.
-
17. The method according to claim 13, further comprising the step of:
storing a plurality of the clustered patterns each corresponding to a different attribute affixed to each initial representative pattern.
-
18. The method according to claim 13, further comprising the steps of:
repeating generation of the transformation parameter and update of the representative pattern, until the evaluation function satisfies a predetermined condition.
-
19. The method according to claim 18, further comprising the step of:
storing the updated representative pattern, when the evaluation function satisfies the predetermined condition.
-
20. The method according to claim 19, further comprising the step of:
storing the transformation parameter and the attribute of the natural pitch pattern of which the error is evaluated, when the evaluation function satisfies the predetermined condition.
-
21. The method according to claim 18, further comprising the step of:
generating the transformation parameters for all combinations of each natural pitch pattern and each initial representative pattern.
-
22. The method according to claim 21, further comprising the steps of:
-
respectively calculating an error between each natural pitch pattern and each transformed representative pattern; and
classifying each natural pitch pattern to one initial representative pattern of which the error between the natural pitch pattern and the one initial representative pattern is the smallest among errors between the natural pitch pattern and all transformed representative patterns.
-
-
23. The method according to claim 22, further comprising the step of:
-
whenever the transformation parameters for all combinations of each natural pitch pattern and each updated representative pattern are generated, until the evaluation function satisfies the predetermined condition;
repeating calculation of the error and classification of each natural pitch pattern.
-
-
24. The method according to claim 23, further comprising the step of:
correspondingly storing the attribute of the natural pitch patterns classified to each updated representative pattern and an address of the updated representative pattern, when the evaluation function satisfies the predetermined condition.
-
-
25. A computer readable memory containing computer readable instructions to generate clustered patterns for text-to-speech synthesis, comprising:
-
instruction means for causing a computer to store a plurality of initial representative patterns, each initial representative pattern being a noise pattern, an attribute being differently affixed to each initial representative pattern, the attribute including at least accent type;
instruction means for causing a computer to store a large number of natural pitch patterns for learning, each natural pitch pattern being an accent phrase in a sentence and including the attribute of the accent phrase;
instruction means for causing a computer to classify each natural pitch pattern to the initial representative pattern, the natural pitch patterns of same attribute being classified to one initial representative pattern of the same attribute;
instruction means for causing a computer to respectively generate a transformation parameter for each natural pitch pattern by evaluating an error between a transformed representative pattern and each natural pitch pattern classified to the initial representative pattern from which the transformed representative pattern is generated;
instruction means for causing a computer to update each initial representative pattern by calculating an evaluation function of the sum of the error between the transformed representative pattern and each natural pitch pattern classified to the initial representative pattern; and
instruction means for causing a computer to store each updated representative pattern as a clustered pattern of the attribute affixed to the initial representative pattern from which the updated representative pattern is generated.
-
-
26. A learning apparatus for generating a representative pattern as a typical pitch pattern used for text-to-speech synthesis, comprising:
-
representative pattern memory means for storing a plurality of representative patterns and attribute data corresponding to each representative pattern, the representative pattern being variously transformed as a pitch pattern of a prosody unit by a transformation parameter, the attribute data being characteristic of the prosody unit to affect the pitch pattern;
clustering means for classifying each of a plurality of prosody units in a text for learning to one of the plurality of representative patterns in said representative pattern memory means according to attribute data of each prosody unit;
extraction means for extracting a natural pitch pattern corresponding to each prosody unit classified to the representative pattern from a plurality of natural pitch patterns corresponding to the text;
transformation parameter generation means for generating the transformation parameter for evaluating an error between the natural pitch pattern and a transformed representative pattern for each prosody unit classified to the representative pattern; and
representative pattern generation means for recursively generating the representative pattern by calculating an evaluation function of the sum of the error between the natural pitch pattern and the transformed representative pattern for all prosody units classified to the representative pattern.
-
Specification