Prosody generation for text-to-speech synthesis based on micro-prosodic data
First Claim
1. A prosody modification system for use in text-to-speech, comprising:
- an input receiving a sequence of prosodic data vectors Pn, measured at time Tn, which samples a sound waveform; and
a prosody data warping module directly deriving new prosodic data vectors Qn from the original data vectors Pn using a function, which is controlled by warping parameters A0, . . . Ak, which avoids round-off errors in deriving quantized values, which has derivatives with respect to A0, . . . Ak, Pn, and Tn that are continuous, and which has sufficiently high complexity to model intentional prosody of the sound waveform, and sufficiently low complexity to avoid modeling micro-prosody of the sound waveform, thereby ensuring that micro-prosodic perturbations and errors in measurement of Tn are transferred directly to the output Qn, causing the errors to be reversed during re-synthesis and therefore eliminated, and resulting in micro-prosodic perturbations being preserved during re-synthesis.
2 Assignments
0 Petitions
Accused Products
Abstract
A prosody modification system for use in text-to-speech includes an input receiving a sequence of prosodic data vectors Pn, measured at time Tn, which samples a sound waveform. A prosody data warping module directly derives new prosodic data vectors Qn from the original data vectors Pn using a function, which is controlled by warping parameters A0, . . . Ak, which avoids round-off errors in deriving quantized values, which has derivatives with respect to A0, . . . Ak, Pn, and Tn that are continuous, and which has sufficiently high complexity to model intentional prosody of the sound waveform, and sufficiently low complexity to avoid modeling micro-prosody of the sound waveform. The smoothness and simplicity of the function ensure that micro-prosodic perturbations and errors in measurement of Tn are transferred directly to the output Qn. The errors are thus reversed during re-synthesis and therefore eliminated, resulting in micro-prosodic perturbations being preserved during re-synthesis.
42 Citations
50 Claims
-
1. A prosody modification system for use in text-to-speech, comprising:
-
an input receiving a sequence of prosodic data vectors Pn, measured at time Tn, which samples a sound waveform; and
a prosody data warping module directly deriving new prosodic data vectors Qn from the original data vectors Pn using a function, which is controlled by warping parameters A0, . . . Ak, which avoids round-off errors in deriving quantized values, which has derivatives with respect to A0, . . . Ak, Pn, and Tn that are continuous, and which has sufficiently high complexity to model intentional prosody of the sound waveform, and sufficiently low complexity to avoid modeling micro-prosody of the sound waveform, thereby ensuring that micro-prosodic perturbations and errors in measurement of Tn are transferred directly to the output Qn, causing the errors to be reversed during re-synthesis and therefore eliminated, and resulting in micro-prosodic perturbations being preserved during re-synthesis. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A prosody generation system for use in text-to-speech synthesis, comprising:
-
an input receiving a sequence of original sound units {Uj}, which when concatenated yield a desired synthetic phrase or sentence;
a prosody data warping module which directly derives new prosodic data vectors {Qjn} from original prosodic data vectors {Pjn} sampled from an original sound unit Uj, and thus modifies perceived prosody of the sound unit, and a controlling module, which determines an amount of prosodic modification for sound units in the input sequence, and presents this information as warping parameters per sound unit, along with prosodic data of the sound units, to the prosody data warping module, and a prosody concatenation module, which concatenates prosodic data of the prosodically modified sound units with adjacent sound units, performs a smoothing of prosodic attributes between adjacent sound units, and outputs a single and final sequence of prosodic data vectors, which are synchronized with the entire phrase or sentence. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
-
35. A prosody modification method for use in text-to-speech, comprising:
-
receiving a sequence of prosodic data vectors Pn, measured at time Tn, which samples a sound waveform; and
directly deriving new prosodic data vectors Qn from the original data vectors Pn using a function, which is controlled by warping parameters A0, . . . Ak, which avoids round-off errors in deriving quantized values, which has derivatives with respect to A0, . . . Ak, Pn, and Tn that are continuous, and which has sufficiently high complexity to model intentional prosody of the sound waveform, and sufficiently low complexity to avoid modeling micro-prosody of the sound waveform, thereby ensuring that micro-prosodic perturbations and errors in measurement of Tn are transferred directly to the output Qn, causing the errors to be reversed during re-synthesis and therefore eliminated, and resulting in micro-prosodic perturbations being preserved during re-synthesis. - View Dependent Claims (36, 37, 38, 39, 40)
-
-
41. A prosody generation method for use in text-to-speech synthesis, comprising:
-
receiving a sequence of original sound units {Uj}, which when concatenated yield a desired synthetic phrase or sentence;
directly deriving new prosodic data vectors {Qjn} from original prosodic data vectors {Pjn} sampled from an original sound unit Uj, thus modifying perceived prosody of the sound unit;
determining an amount of prosodic modification for sound units in the input sequence;
presenting the amount of prosodic modification as warping parameters per sound unit, along with prosodic data of the sound units;
concatenating prosodic data of the prosodically modified sound units with adjacent sound units;
performing a smoothing of prosodic attributes between adjacent sound units; and
outputing a single and final sequence of prosodic data vectors, which are synchronized with the entire phrase or sentence. - View Dependent Claims (42, 43, 44, 45, 46, 47, 48, 49, 50)
-
Specification