Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
First Claim
1. A method performed by a processing circuit for creating parametric models for use in training a speech synthesis system, wherein the system comprises at least a training text corpus, a speech database, and a model training module, the method comprising:
- a. obtaining, by the model training module, speech data from the speech database wherein the speech data comprises recorded speech signals and corresponding portions of the training text corpus;
b. converting, by the model training module, the training text corpus into context dependent phone labels;
c. extracting, by the model training module, for each frame of speech in the speech signal from the speech data, at least one of;
spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values using the context dependent phone labels;
d. forming, by the model training module, a feature vector stream for each frame of speech in the speech signal from the speech data using the at least one of;
the spectral features, the plurality of band excitation energy coefficients, and the fundamental frequency values;
e. labeling, by the model training module, each frame of speech in the speech signal with the context dependent phone labels;
f. extracting, by the model training module, durations of each of the context dependent phone labels from the labeled speech;
g. forming, by the model training module, context dependent Hidden Markov Models (HMMs) using the feature vector streams and the context dependent phone labels from the labeled speech;
h. performing, by a parameter generation module, parameter estimation of the speech signal, wherein the parameter estimation is performed comprising the feature vector streams, the HMMs, and decision trees;
i. identifying a plurality of sub-band Eigen glottal pulses from the speech signal, wherein the sub-band Eigen glottal pulses comprise separate models used to form excitation during synthesis; and
j. applying the identified plurality of sub-band Eigen glottal pulses from the speech signal to form an excitation signal, wherein the excitation signal is applied in the speech synthesis system to synthesize speech.
3 Assignments
0 Petitions
Accused Products
Abstract
A system and method are presented for forming the excitation signal for a glottal pulse model based parametric speech synthesis system. The excitation signal may be formed by using a plurality of sub-band templates instead of a single one. The plurality of sub-band templates may be combined to form the excitation signal wherein the proportion in which the templates are added is dynamically based on determined energy coefficients. These coefficients vary from frame to frame and are learned, along with the spectral parameters, during feature training. The coefficients are appended to the feature vector, which comprises spectral parameters and is modeled using HMMs, and the excitation signal is determined.
33 Citations
12 Claims
-
1. A method performed by a processing circuit for creating parametric models for use in training a speech synthesis system, wherein the system comprises at least a training text corpus, a speech database, and a model training module, the method comprising:
-
a. obtaining, by the model training module, speech data from the speech database wherein the speech data comprises recorded speech signals and corresponding portions of the training text corpus; b. converting, by the model training module, the training text corpus into context dependent phone labels; c. extracting, by the model training module, for each frame of speech in the speech signal from the speech data, at least one of;
spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values using the context dependent phone labels;d. forming, by the model training module, a feature vector stream for each frame of speech in the speech signal from the speech data using the at least one of;
the spectral features, the plurality of band excitation energy coefficients, and the fundamental frequency values;e. labeling, by the model training module, each frame of speech in the speech signal with the context dependent phone labels; f. extracting, by the model training module, durations of each of the context dependent phone labels from the labeled speech; g. forming, by the model training module, context dependent Hidden Markov Models (HMMs) using the feature vector streams and the context dependent phone labels from the labeled speech; h. performing, by a parameter generation module, parameter estimation of the speech signal, wherein the parameter estimation is performed comprising the feature vector streams, the HMMs, and decision trees; i. identifying a plurality of sub-band Eigen glottal pulses from the speech signal, wherein the sub-band Eigen glottal pulses comprise separate models used to form excitation during synthesis; and j. applying the identified plurality of sub-band Eigen glottal pulses from the speech signal to form an excitation signal, wherein the excitation signal is applied in the speech synthesis system to synthesize speech. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
Specification