METHOD FOR FORMING THE EXCITATION SIGNAL FOR A GLOTTAL PULSE MODEL BASED PARAMETRIC SPEECH SYNTHESIS SYSTEM
First Claim
1. A method for creating parametric models for use in training a speech synthesis system, wherein the system comprises at least a training text corpus, a speech database, and a model training module, the method comprising:
- a. obtaining, by the model training module, speech data for the training text corpus, wherein the speech data comprises recorded speech signals and corresponding transcriptions;
b. converting, by the model training module, the training text corpus into context dependent phone labels;
c. extracting, by the model training module, for each frame of speech in the speech signal from the speech training database, at least one of;
spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;
d. forming, by the model training module, a feature vector stream for each frame of speech using the at least one of;
spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;
e. labeling speech with context dependent phones;
f. extracting durations of each context dependent phone from the labelled speech;
g. performing parameter estimation of the speech signal, wherein the parameter estimation is performed comprising the features, HMM, and decision trees; and
h. identifying a plurality of sub-band Eigen glottal pulses, wherein the sub-band Eigen glottal pulses comprise separate models used to form excitation during synthesis.
4 Assignments
0 Petitions
Accused Products
Abstract
A system and method are presented for forming the excitation signal for a glottal pulse model based parametric speech synthesis system. The excitation signal may be formed by using a plurality of sub-band templates instead of a single one. The plurality of sub-band templates may be combined to form the excitation signal wherein the proportion in which the templates are added is dynamically based on determined energy coefficients. These coefficients vary from frame to frame and are learned, along with the spectral parameters, during feature training. The coefficients are appended to the feature vector, which comprises spectral parameters and is modeled using HMMs, and the excitation signal is determined.
-
Citations
19 Claims
-
1. A method for creating parametric models for use in training a speech synthesis system, wherein the system comprises at least a training text corpus, a speech database, and a model training module, the method comprising:
-
a. obtaining, by the model training module, speech data for the training text corpus, wherein the speech data comprises recorded speech signals and corresponding transcriptions; b. converting, by the model training module, the training text corpus into context dependent phone labels; c. extracting, by the model training module, for each frame of speech in the speech signal from the speech training database, at least one of;
spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;d. forming, by the model training module, a feature vector stream for each frame of speech using the at least one of;
spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;e. labeling speech with context dependent phones; f. extracting durations of each context dependent phone from the labelled speech; g. performing parameter estimation of the speech signal, wherein the parameter estimation is performed comprising the features, HMM, and decision trees; and h. identifying a plurality of sub-band Eigen glottal pulses, wherein the sub-band Eigen glottal pulses comprise separate models used to form excitation during synthesis. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for identification of sub-band Eigen pulses from a glottal pulse database for training a speech synthesis system, wherein the method comprises:
-
a. receiving pulses from the glottal pulse database; b. decomposing each pulse into a plurality of sub-band components; c. dividing the sub-band components into a plurality of databases based on the decomposing; d. determining a vector representation of each database; e. determining Eigen pulse values, from the vector representation, for each database; and f. selecting a best Eigen pulse for each database for use in synthesis. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
Specification