Method and apparatus for prosody for synthetic speech prosody determination
First Claim
1. A method for specifying synthetic speech intonation, comprising the steps of:
- (a) obtaining natural pitch and duration values for a natural voicing section of a natural utterance;
(b) obtaining synthetic pitch and duration values for a synthetic voicing section of a synthetic equivalent to the natural utterance;
(c) aligning the natural voicing section to the synthetic voicing section; and
(d) replacing the synthetic pitch and duration values of the synthetic voicing section with the natural pitch and duration values.
0 Assignments
0 Petitions
Accused Products
Abstract
In a synthetic speech system intonation of a natural utterance is automatically applied to a synthesized utterance. The present invention applies the desired intonation of the natural utterance to the synthesized utterance by aligning voicing sections of the natural utterance to the synthesized utterance. The voicing sections are initially delineated by voiced versus unvoiced, based on default voicing specifications for the synthetic utterance and on pitch tracker analysis of the natural utterance, and an attempt is made to align individual sections thereby. If no initial alignment occurs then a further attempt is made by varying the default voicing specifications of the synthesized utterance. If alignment is still not achieved, then each of the utterances, natural and synthetic, is considered a single large voicing section, which thus forces alignment therebetween. Once alignment occurs, the intonation of the natural utterance is applied to the synthetic utterance thereby providing the synthetic utterance with the desired, more natural, intonation. Further, the synthetic utterance having intonation specification can be graphically displayed so that the user may view and interactively and graphically modify the intonation specification for the synthetic utterance.
74 Citations
24 Claims
-
1. A method for specifying synthetic speech intonation, comprising the steps of:
-
(a) obtaining natural pitch and duration values for a natural voicing section of a natural utterance; (b) obtaining synthetic pitch and duration values for a synthetic voicing section of a synthetic equivalent to the natural utterance; (c) aligning the natural voicing section to the synthetic voicing section; and (d) replacing the synthetic pitch and duration values of the synthetic voicing section with the natural pitch and duration values. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus for intonation specification comprising:
-
(a) means for obtaining natural pitch and duration values for a natural voicing section of a natural utterance; (b) means for obtaining synthetic pitch and duration values for a synthetic voicing section of a synthetic equivalent to the natural utterance; (c) means for aligning the natural voicing section to the synthetic voicing section; and (d) means for substituting the natural pitch and duration values of the natural voicing section for the synthetic pitch and duration values. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method for intonation specification comprising the following steps:
-
a) obtaining natural voiced pitch and duration values for a natural voiced portion of a natural utterance; b) obtaining natural unvoiced pitch and duration values for a natural unvoiced portion of the natural utterance; c) obtaining synthetic voiced and unvoiced pitch and duration values for synthetic voiced and unvoiced portions of a synthetic equivalent to the natural utterance; d) aligning the natural voiced and unvoiced portion to the synthetic voiced and unvoiced portions; and e) substituting the natural voiced and unvoiced pitch and duration values for the synthetic voiced and unvoiced pitch and duration values. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
-
22. A method for intonation specification in a synthetic speech system comprising the following steps:
-
a) obtaining a set of pitch and duration values of one or more voicing sections of a natural utterance; b) obtaining a set of pitch and duration values of one or more voicing sections of a synthetic equivalent to the natural utterance; c) aligning the one or more voicing sections of the natural utterance to the one or more voicing sections of the synthetic equivalent to the natural utterance, including the steps of i) varying voicing possibilities of the one or more voicing sections of the synthetic equivalent to the natural utterance until one or more alignments are reached between sequentially voiced and unvoiced types of the one or more voicing sections of the synthetic equivalent to the natural utterance and alternating voiced and unvoiced types of the one or more voicing sections of the natural utterance; and ii) sequentially aligning alternating voiced and unvoiced types of the one or more voicing sections of the natural utterance to alternating voiced and unvoiced types of the one or more voicing sections of the synthetic equivalent to the natural utterance for the best reached alignment between sequentially voiced and unvoiced types of the one or more voicing sections of the synthetic equivalent to the natural utterance and alternating voiced and unvoiced types of the one or more voicing sections of the natural utterance, the best reached alignment being the alignment with the i) lowest accumulated error between the one or more voicing sections of the natural utterance and the one or more voicing sections of the synthetic equivalent to the natural utterance; ii) fewest voicing possibilities actually varied; and iii) fewest of the one or more voicing sections of the natural utterance which fell outside a predetermined duration range; and d) substituting the pitch and duration values of the one or more voicing sections of the natural utterance for the pitch and duration values of the one or more voicing sections of the synthetic equivalent to the natural utterance.
-
-
23. An apparatus for intonation specification in a synthetic speech system comprising:
-
a) means for obtaining a set of pitch and duration values of one or more voicing sections of a natural utterance; b) means for obtaining a set of pitch and duration values of one or more voicing sections of a synthetic equivalent to the natural utterance; c) means for aligning the one or more voicing sections of the natural utterance to the one or more voicing sections of the synthetic equivalent to the natural utterance, the means for aligning including i) means for varying voicing possibilities of the one or more voicing sections of the synthetic equivalent to the natural utterance until one or more alignments are reached between sequentially voiced and unvoiced types of the one or more voicing sections of the synthetic equivalent to the natural utterance and alternating voiced and unvoiced types of the one or more voicing sections of the natural utterance; and ii) means for sequentially aligning alternating voiced and unvoiced types of the one or more voicing sections of the natural utterance to alternating voiced and unvoiced types of the one or more voicing sections of the synthetic equivalent to the natural utterance for the best reached alignment between sequentially voiced and unvoiced types of the one or more voicing sections of the synthetic equivalent to the natural utterance and alternating voiced and unvoiced types of the one or more voicing sections of the natural utterance, wherein the best reached alignment is the alignment with the i) lowest accumulated error between the one or more voicing sections of the natural utterance and the one or more voicing sections of the synthetic equivalent to the natural utterance; ii) fewest voicing possibilities actually varied; and iii) fewest of the one or more voicing sections of the natural utterance which fell outside a predetermined duration range; and d) means for substituting the pitch and duration values of the one or more voicing sections of the natural utterance for the pitch and duration values of the one or more voicing sections of the synthetic equivalent to the natural utterance.
-
-
24. A method for intonation specification in a synthetic speech system comprising the following steps:
-
a) obtaining a set of pitch and duration values of one or more voiced portions of a natural utterance; b) obtaining a set of pitch and duration values of one or more unvoiced portions of a natural utterance; c) obtaining a set of pitch and duration values of one or more voiced and one or more unvoiced portions of a synthetic equivalent to the natural utterance; d) aligning the one or more voiced portions of the natural utterance to the one or more voiced and unvoiced portions of the synthetic equivalent to the natural utterance, the step of aligning including i) varying voicing possibilities of the one or more voicing sections of the synthetic equivalent to the natural utterance until one or more alignments are reached between sequentially voiced and unvoiced types of the one or more voicing sections of the synthetic equivalent to the natural utterance and alternating voiced and unvoiced types of the one or more voicing sections of the natural utterance; and ii) sequentially aligning alternating voiced and unvoiced types of the one or more voicing sections of the natural utterance to alternating voiced and unvoiced types of the one or more voicing sections of the synthetic equivalent to the natural utterance for the best reached alignment between sequentially voiced and unvoiced types of the one or more voicing sections of the synthetic equivalent to the natural utterance and alternating voiced and unvoiced types of the one or more voicing sections of the natural utterance, the best reached alignment being the alignment with the i) lowest accumulated error between the one or more voicing sections of the natural utterance and the one or more voicing sections of the synthetic equivalent to the natural utterance; ii) fewest voicing possibilities actually varied; and iii) fewest of the one or more voicing sections of the natural utterance which fell outside a predetermined duration range; and e) substituting the pitch and duration values of the one or more voiced portions of the natural utterance for the pitch and duration values of the one or more voiced and unvoiced portions of the synthetic equivalent to the natural utterance.
-
Specification