Phonetic Hidden Markov model speech synthesizer
First Claim
1. A method for generating synthesized speech wherein an acoustic ergodic hidden Markov model (AEHMM) reflecting constraints on the acoustic arrangement of speech is correlated to a phonetic ergodic hidden Markov model (PhEHMM), the method comprising the steps ofa) building an AEHMM in which an observations sequence comprises speech features vectors extracted from frames in which the speech uttered during the training of said AEHMM is divided, and in which a hidden sequence comprises a sequence of sources that most probably emitted the speech utterance frames;
- b) initializing said AEHMM by a vector quantization clustering scheme having the same size as said AEHMM;
c) training said AEHMM by the Forward-Backward algorithm and Baum-Welch re-estimation formulas;
d) associating with each frame a label representing a most probable source;
e) building a PhEHMM of the same size as said AEHMM in which an observations sequence comprises phoneme sequence obtained from a written text, and in which a hidden sequence comprises a sequence of labels;
f) initializing a PhEHMM transition probability matrix by assigning to state transition probabilities the same values as the transition probabilities of the corresponding states of said AEHMM;
g) initializing PhEHMM observation probability functions by;
(g.1) using a speech corpus aligned with a sequence of phonemes,(g.2) generating for said speech corpus a sequence of most probable labels, using said AEHMM, and(g.3) computing the observations probability function for each phoneme, counting the number of occurrences of the phoneme in a state divided by the total number of phonemes emitted by said state;
h) training said PhEHMM by the Baum-Welch algorithm on a proper synthetic observations corpus;
h.1) providing an input text of one or more words to be synthesized;
i) determining for each word to be synthesized a phoneme sequence and through said PhEHMM a sequence of labels corresponding to the word to be synthesized by means of a proper optimality criterion;
j) determining from the input text a set of additional parameters, as energy, prosody contours and voicing, by a prosodic processor;
k) determining, for the sequence of labels corresponding to the word to be synthesized, a set of speech features vectors corresponding to the word to be synthesized through said AEHMM;
l) transforming said speech features vectors corresponding to the word to be synthesized into a set of filter coefficients representing spectral information; and
m) using said set of filter coefficients and said additional parameters in a synthesis filter to produce a synthetic speech output.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and a system for synthesizing speech from unrestricted text, based on the principle of associating a written string of text with a sequence of speech features vectors that most probably model the corresponding speech utterance. The synthesizer is based on the interaction between two different Ergodic Hidden Markov Models: an acoustic model reflecting the constraints on the acoustic arrangement of speech, and a phonetic model interfacing phonemic transcription to the speech features representation.
-
Citations
10 Claims
-
1. A method for generating synthesized speech wherein an acoustic ergodic hidden Markov model (AEHMM) reflecting constraints on the acoustic arrangement of speech is correlated to a phonetic ergodic hidden Markov model (PhEHMM), the method comprising the steps of
a) building an AEHMM in which an observations sequence comprises speech features vectors extracted from frames in which the speech uttered during the training of said AEHMM is divided, and in which a hidden sequence comprises a sequence of sources that most probably emitted the speech utterance frames; -
b) initializing said AEHMM by a vector quantization clustering scheme having the same size as said AEHMM; c) training said AEHMM by the Forward-Backward algorithm and Baum-Welch re-estimation formulas; d) associating with each frame a label representing a most probable source; e) building a PhEHMM of the same size as said AEHMM in which an observations sequence comprises phoneme sequence obtained from a written text, and in which a hidden sequence comprises a sequence of labels; f) initializing a PhEHMM transition probability matrix by assigning to state transition probabilities the same values as the transition probabilities of the corresponding states of said AEHMM; g) initializing PhEHMM observation probability functions by; (g.1) using a speech corpus aligned with a sequence of phonemes, (g.2) generating for said speech corpus a sequence of most probable labels, using said AEHMM, and (g.3) computing the observations probability function for each phoneme, counting the number of occurrences of the phoneme in a state divided by the total number of phonemes emitted by said state; h) training said PhEHMM by the Baum-Welch algorithm on a proper synthetic observations corpus; h.1) providing an input text of one or more words to be synthesized; i) determining for each word to be synthesized a phoneme sequence and through said PhEHMM a sequence of labels corresponding to the word to be synthesized by means of a proper optimality criterion; j) determining from the input text a set of additional parameters, as energy, prosody contours and voicing, by a prosodic processor; k) determining, for the sequence of labels corresponding to the word to be synthesized, a set of speech features vectors corresponding to the word to be synthesized through said AEHMM; l) transforming said speech features vectors corresponding to the word to be synthesized into a set of filter coefficients representing spectral information; and m) using said set of filter coefficients and said additional parameters in a synthesis filter to produce a synthetic speech output. - View Dependent Claims (2, 3)
-
-
4. A text-to-speech synthesizer system comprising:
-
a text input device for entering text of speech to be synthesized; a phonetic processor for converting the text input into a phonetic representation and for determining phonetic duration parameters; a prosodic processor for generating prosodic and energy contours for the speech to be synthesized; and a synthesis filter which, using said prosodic and energy contours and filter coefficients, generates the speech to be synthesized; characterized in that; said phonetic processor includes a synthetic observations generator which translates said phonetic representation of the input text into a string of phonetic symbols, each phonetic symbol repeated to properly reflect the phoneme duration, and said phonetic processor generates a Phonetic Ergodic Hidden Markov Model (PhEHMM) observation sequence; and the system further comprises; a labelling unit associating with each observation of said observations sequence the probability that a state of the PhEHMM has generated said observation by an optimality criterion; and a spectra sequence production unit computing a speech features vector for each speech frame to be synthesized by a correlation between labels and speech features vectors, computed by an Acoustic Ergodic Hidden Markov Model (AEHMM), built on previously uttered speech corpus, said spectra sequence production unit converting by a back transformation the speech features vectors into filter coefficients to be used by said synthesis filter. - View Dependent Claims (5, 6)
-
-
7. A method of generating synthesized speech, said method comprising the steps of:
-
generating a set of acoustic hidden Markov models, each acoustic hidden Markov model comprising a plurality of states, transitions between the states, a set of acoustic features vectors outputs associated with the states or transitions, and probabilities of the transitions and of the outputs; generating a set of phonetic hidden Markov models, each phonetic hidden Markov model comprising a plurality of states, transitions between the states, a set of phonetic symbol outputs associated with the states or transitions, and probabilities of the transitions and of the outputs, each phonetic hidden Markov model being correlated with exactly one acoustic hidden Markov model; converting a text of words into a series of phonetic symbols; estimating, for each phonetic symbol in the series of phonetic symbols and for each phonetic hidden Markov model, the probability that the phonetic hidden Markov model would generate the phonetic symbol; generating, for each phonetic symbol in the series of phonetic symbols, at least one acoustic features vector comprising a weighted sum of acoustic features vectors expected to be output by the acoustic hidden Markov models, each expected acoustic features vector being weighted by the probability that the phonetic hidden Markov model correlated with the acoustic hidden Markov model would generate the phonetic symbol; and producing synthetic speech from the generated acoustic features vectors. - View Dependent Claims (8)
-
-
9. A text-to-speech synthesizer comprising:
-
means for storing a set of acoustic hidden Markov models, each acoustic hidden Markov model comprising a plurality of states, transitions between the states, a set of acoustic features vectors outputs associated with the states or transitions, and probabilities of the transistions and of the outputs; means for storing a set of phonetic hidden Markov models, each phonetic hidden Markov model comprising a plurality of states, transitions between the states, a set of phonetic symbol outputs associated with the states or transitions, and probabilities of the transitions and of the outputs, each phonetic hidden Markov model being correlated with exactly one acoustic hidden Markov model; a text input device for entering a text of words; a phonetic processor for converting the text of words into a series of phonetic symbols; a labeling unit for estimating, for each phonetic symbol in the series of phonetic symbols and for each phonetic hidden Markov model, the probability that the phonetic hidden Markov model would generate the phonetic symbol; a spectra sequence production unit for generating, for each phonetic symbol in the series of phonetic symbols, at least one acoustic features vector comprising a weighted sum of acoustic features vectors expected to be output by the acoustic hidden Markov models, each expected acoustic features vector being weighted by the probability that the phonetic hidden Markov model correlated with the acoustic hidden Markov model would generate the phonetic symbol; and a synthesis filter for producing synthetic speech from the generated acoustic features vectors. - View Dependent Claims (10)
-
Specification