Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
First Claim
1. A method performed by a generic computer processor for creating a glottal pulse database from a speech signal, in a speech synthesis system, wherein the system comprises at least a speech database, the method comprising the steps of:
- a. pre-emphasizing the speech signal to obtain a pre-filtered signal;
b. analyzing the pre-filtered signal, using linear prediction, to obtain inverse filtering parameters;
c. performing inverse filtering of the speech signal using the inverse filtering parameters;
d. determining an integrated linear prediction residual signal using the inversely filtered speech signal;
e. identifying glottal segment boundaries in the speech signal;
f. segmenting the integrated linear prediction residual signal into glottal pulses using the identified glottal segment boundaries from the speech signal;
g. normalizing the glottal pulses;
h. forming the glottal pulse database by collecting all normalized glottal pulses obtained for the speech signal; and
i. applying the formed glottal pulse database to form an excitation signal, wherein the excitation signal is applied in the speech synthesis system to synthesize speech.
6 Assignments
0 Petitions
Accused Products
Abstract
A method is presented for forming the excitation signal for a glottal pulse model based parametric speech synthesis system. In one embodiment, fundamental frequency values are used to form the excitation signal. The excitation is modeled using a voice source pulse selected from a database of a given speaker. The voice source signal is segmented into glottal segments, which are used in vector representation to identify the glottal pulse used for formation of the excitation signal. Use of a novel distance metric and preserving the original signals extracted from the speakers voice samples helps capture low frequency information of the excitation signal. In addition, segment edge artifacts are removed by applying a unique segment joining method to improve the quality of synthetic speech while creating a true representation of the voice quality of a speaker.
-
Citations
35 Claims
-
1. A method performed by a generic computer processor for creating a glottal pulse database from a speech signal, in a speech synthesis system, wherein the system comprises at least a speech database, the method comprising the steps of:
-
a. pre-emphasizing the speech signal to obtain a pre-filtered signal; b. analyzing the pre-filtered signal, using linear prediction, to obtain inverse filtering parameters; c. performing inverse filtering of the speech signal using the inverse filtering parameters; d. determining an integrated linear prediction residual signal using the inversely filtered speech signal; e. identifying glottal segment boundaries in the speech signal; f. segmenting the integrated linear prediction residual signal into glottal pulses using the identified glottal segment boundaries from the speech signal; g. normalizing the glottal pulses; h. forming the glottal pulse database by collecting all normalized glottal pulses obtained for the speech signal; and i. applying the formed glottal pulse database to form an excitation signal, wherein the excitation signal is applied in the speech synthesis system to synthesize speech. - View Dependent Claims (2, 3)
-
-
4. A method for creating parametric models for use in training a speech synthesis system performed by a generic computer processor, wherein the system comprises at least a glottal pulse database, the method comprising the steps of:
-
a. determining a glottal pulse distance metric between a number of glottal pulses; b. clustering the glottal pulse database into a number of clusters to determine centroid glottal pulses; c. forming a corresponding vector database by associating a vector with each glottal pulse in the glottal pulse database, wherein the centroid glottal pulses and the distance metric are defined mathematically to determine association; d. determining Eigenvectors of the vector database; e. creating parametric models by associating a glottal pulse from the glottal pulse database to each determined Eigenvector; and f. applying the created parametric models to the speech synthesis system in order to train the speech synthesis system. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method to synthesize speech performed by a generic computer processor using input text, comprising the steps of:
-
a. converting the input text into context dependent phone labels; b. processing the phone labels created in step (a) using trained parametric models to predict fundamental frequency values, duration of the speech synthesized, and spectral features of the phone labels; c. creating an excitation signal using an Eigen glottal pulse and said predicted one or more of;
fundamental frequency values, spectral features of phone labels, and duration of the speech synthesized;d. dividing signal regions of excitation into categories of segments comprising one or more of;
voiced, unvoiced, and/or pause;e. creating an excitation for each category; f. combining the excitation signal with the spectral features of the phone labels using a filter to create synthetic speech output. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
-
Specification