Phonetic Hidden Markov model speech synthesizer

US 5,230,037 A
Filed: 06/07/1991
Issued: 07/20/1993
Est. Priority Date: 10/16/1990
Status: Expired due to Fees

First Claim

Patent Images

1. A method for generating synthesized speech wherein an acoustic ergodic hidden Markov model (AEHMM) reflecting constraints on the acoustic arrangement of speech is correlated to a phonetic ergodic hidden Markov model (PhEHMM), the method comprising the steps ofa) building an AEHMM in which an observations sequence comprises speech features vectors extracted from frames in which the speech uttered during the training of said AEHMM is divided, and in which a hidden sequence comprises a sequence of sources that most probably emitted the speech utterance frames;

b) initializing said AEHMM by a vector quantization clustering scheme having the same size as said AEHMM;

c) training said AEHMM by the Forward-Backward algorithm and Baum-Welch re-estimation formulas;

d) associating with each frame a label representing a most probable source;

e) building a PhEHMM of the same size as said AEHMM in which an observations sequence comprises phoneme sequence obtained from a written text, and in which a hidden sequence comprises a sequence of labels;

f) initializing a PhEHMM transition probability matrix by assigning to state transition probabilities the same values as the transition probabilities of the corresponding states of said AEHMM;

g) initializing PhEHMM observation probability functions by;

(g.1) using a speech corpus aligned with a sequence of phonemes,(g.2) generating for said speech corpus a sequence of most probable labels, using said AEHMM, and(g.3) computing the observations probability function for each phoneme, counting the number of occurrences of the phoneme in a state divided by the total number of phonemes emitted by said state;

h) training said PhEHMM by the Baum-Welch algorithm on a proper synthetic observations corpus;

h.1) providing an input text of one or more words to be synthesized;

i) determining for each word to be synthesized a phoneme sequence and through said PhEHMM a sequence of labels corresponding to the word to be synthesized by means of a proper optimality criterion;

j) determining from the input text a set of additional parameters, as energy, prosody contours and voicing, by a prosodic processor;

k) determining, for the sequence of labels corresponding to the word to be synthesized, a set of speech features vectors corresponding to the word to be synthesized through said AEHMM;

l) transforming said speech features vectors corresponding to the word to be synthesized into a set of filter coefficients representing spectral information; and

m) using said set of filter coefficients and said additional parameters in a synthesis filter to produce a synthetic speech output.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and a system for synthesizing speech from unrestricted text, based on the principle of associating a written string of text with a sequence of speech features vectors that most probably model the corresponding speech utterance. The synthesizer is based on the interaction between two different Ergodic Hidden Markov Models: an acoustic model reflecting the constraints on the acoustic arrangement of speech, and a phonetic model interfacing phonemic transcription to the speech features representation.

Citations

10 Claims

1. A method for generating synthesized speech wherein an acoustic ergodic hidden Markov model (AEHMM) reflecting constraints on the acoustic arrangement of speech is correlated to a phonetic ergodic hidden Markov model (PhEHMM), the method comprising the steps ofa) building an AEHMM in which an observations sequence comprises speech features vectors extracted from frames in which the speech uttered during the training of said AEHMM is divided, and in which a hidden sequence comprises a sequence of sources that most probably emitted the speech utterance frames;
- b) initializing said AEHMM by a vector quantization clustering scheme having the same size as said AEHMM;
  
  c) training said AEHMM by the Forward-Backward algorithm and Baum-Welch re-estimation formulas;
  
  d) associating with each frame a label representing a most probable source;
  
  e) building a PhEHMM of the same size as said AEHMM in which an observations sequence comprises phoneme sequence obtained from a written text, and in which a hidden sequence comprises a sequence of labels;
  
  f) initializing a PhEHMM transition probability matrix by assigning to state transition probabilities the same values as the transition probabilities of the corresponding states of said AEHMM;
  
  g) initializing PhEHMM observation probability functions by;
  
  (g.1) using a speech corpus aligned with a sequence of phonemes,(g.2) generating for said speech corpus a sequence of most probable labels, using said AEHMM, and(g.3) computing the observations probability function for each phoneme, counting the number of occurrences of the phoneme in a state divided by the total number of phonemes emitted by said state;
  
  h) training said PhEHMM by the Baum-Welch algorithm on a proper synthetic observations corpus;
  
  h.1) providing an input text of one or more words to be synthesized;
  
  i) determining for each word to be synthesized a phoneme sequence and through said PhEHMM a sequence of labels corresponding to the word to be synthesized by means of a proper optimality criterion;
  
  j) determining from the input text a set of additional parameters, as energy, prosody contours and voicing, by a prosodic processor;
  
  k) determining, for the sequence of labels corresponding to the word to be synthesized, a set of speech features vectors corresponding to the word to be synthesized through said AEHMM;
  
  l) transforming said speech features vectors corresponding to the word to be synthesized into a set of filter coefficients representing spectral information; and
  
  m) using said set of filter coefficients and said additional parameters in a synthesis filter to produce a synthetic speech output.
- View Dependent Claims (2, 3)
- - 2. A method for generating speech from unrestricted written text according to claim 1, wherein the proper optimality criterion of step i) is given by the Baum-Welch algorithm, and wherein the determination of the speech features vectors of step k) is obtained by weighting the features vectors by the probabilities of corresponding labels.
  - 3. A method for generating speech from unrestricted written text according to claim 1, wherein the proper optimality criterion of step i) is given by the Viterbi algorithm, and wherein the determination of the speech features vectors of step k) is obtained by associated with each label, in the sequence of labels corresponding to the word to be synthesized, the corresponding speech features vector of said AEHMM.

4. A text-to-speech synthesizer system comprising:
- a text input device for entering text of speech to be synthesized;
  
  a phonetic processor for converting the text input into a phonetic representation and for determining phonetic duration parameters;
  
  a prosodic processor for generating prosodic and energy contours for the speech to be synthesized; and
  
  a synthesis filter which, using said prosodic and energy contours and filter coefficients, generates the speech to be synthesized;
  
  characterized in that;
  
  said phonetic processor includes a synthetic observations generator which translates said phonetic representation of the input text into a string of phonetic symbols, each phonetic symbol repeated to properly reflect the phoneme duration, and said phonetic processor generates a Phonetic Ergodic Hidden Markov Model (PhEHMM) observation sequence; and
  
  the system further comprises;
  
  a labelling unit associating with each observation of said observations sequence the probability that a state of the PhEHMM has generated said observation by an optimality criterion; and
  
  a spectra sequence production unit computing a speech features vector for each speech frame to be synthesized by a correlation between labels and speech features vectors, computed by an Acoustic Ergodic Hidden Markov Model (AEHMM), built on previously uttered speech corpus, said spectra sequence production unit converting by a back transformation the speech features vectors into filter coefficients to be used by said synthesis filter.
- View Dependent Claims (5, 6)
- - 5. A text-to-speech synthesizer system of claim 4 in which the optimality criterion used in said labelling unit consists of computing the probability that each state generated a given observation by the Baum-Welch algorithm, and in which each speech features vector is computed by said AEHMM as a sum of the speech features vectors associated with each state of the PhEHMM, weighted by the probability that the state of the PhEHMM generated the observation, computed by said labelling unit.
  - 6. A text-to-speech synthesizer system of claim 4 wherein the optimality criterion used in said labelling unit consists of computing the sequence of the states that most probably have generated the observed synthetic observations sequence as obtained by the Viterbi algorithm, and wherein each speech features vector is obtained by associating with each state of the PhEHMM the corresponding source model of said AEHMM and a speech features vector comprising a mean vector associated with the source model.

7. A method of generating synthesized speech, said method comprising the steps of:
- generating a set of acoustic hidden Markov models, each acoustic hidden Markov model comprising a plurality of states, transitions between the states, a set of acoustic features vectors outputs associated with the states or transitions, and probabilities of the transitions and of the outputs;
  
  generating a set of phonetic hidden Markov models, each phonetic hidden Markov model comprising a plurality of states, transitions between the states, a set of phonetic symbol outputs associated with the states or transitions, and probabilities of the transitions and of the outputs, each phonetic hidden Markov model being correlated with exactly one acoustic hidden Markov model;
  
  converting a text of words into a series of phonetic symbols;
  
  estimating, for each phonetic symbol in the series of phonetic symbols and for each phonetic hidden Markov model, the probability that the phonetic hidden Markov model would generate the phonetic symbol;
  
  generating, for each phonetic symbol in the series of phonetic symbols, at least one acoustic features vector comprising a weighted sum of acoustic features vectors expected to be output by the acoustic hidden Markov models, each expected acoustic features vector being weighted by the probability that the phonetic hidden Markov model correlated with the acoustic hidden Markov model would generate the phonetic symbol; and
  
  producing synthetic speech from the generated acoustic features vectors.
- View Dependent Claims (8)
- - 8. A method as claimed in claim 7, characterized in that the step of estimating, for each phonetic symbol in the series of phonetic symbols and for each phonetic Markov model, the probability that the phonetic Markov model would generate the phonetic symbol comprises:
    - estimating, for each phonetic symbol in the series of phonetic symbols and for each phonetic Markov model, the phonetic Markov model which would most likely generate the phonetic symbol;
      
      estimating the probability that the most likely phonetic Markov model would generate the phonetic symbol as one; and
      
      estimating the probability that each other phonetic Markov model would generate the phonetic symbol as zero.

9. A text-to-speech synthesizer comprising:
- means for storing a set of acoustic hidden Markov models, each acoustic hidden Markov model comprising a plurality of states, transitions between the states, a set of acoustic features vectors outputs associated with the states or transitions, and probabilities of the transistions and of the outputs;
  
  means for storing a set of phonetic hidden Markov models, each phonetic hidden Markov model comprising a plurality of states, transitions between the states, a set of phonetic symbol outputs associated with the states or transitions, and probabilities of the transitions and of the outputs, each phonetic hidden Markov model being correlated with exactly one acoustic hidden Markov model;
  
  a text input device for entering a text of words;
  
  a phonetic processor for converting the text of words into a series of phonetic symbols;
  
  a labeling unit for estimating, for each phonetic symbol in the series of phonetic symbols and for each phonetic hidden Markov model, the probability that the phonetic hidden Markov model would generate the phonetic symbol;
  
  a spectra sequence production unit for generating, for each phonetic symbol in the series of phonetic symbols, at least one acoustic features vector comprising a weighted sum of acoustic features vectors expected to be output by the acoustic hidden Markov models, each expected acoustic features vector being weighted by the probability that the phonetic hidden Markov model correlated with the acoustic hidden Markov model would generate the phonetic symbol; and
  
  a synthesis filter for producing synthetic speech from the generated acoustic features vectors.
- View Dependent Claims (10)
- - 10. A system as claimed in claim 9, characterized in that the labeling unit comprises a Viterbi processor for estimating, for each phonetic symbol in the series of phonetic symbols and for each phonetic Markov model, the phonetic Markov model which would most likely generate the phonetic symbol, for estimating the probability that the most likely phonetic Markov model would generate the phonetic symbol as one, and for estimating the probability that each other phonetic Markov model would generate the phonetic symbol as zero.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Giustiniani, Massimo, Pierucci, Piero
Primary Examiner(s)
Fleming, Michael R.
Assistant Examiner(s)
Doerrler, Michelle

Application Number

US07/716,022
Time in Patent Office

774 Days
Field of Search

381/41-53, 395/2
US Class Current

704/200
CPC Class Codes

G10L 15/14 using statistical models, e...

Phonetic Hidden Markov model speech synthesizer

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Phonetic Hidden Markov model speech synthesizer

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links