Text to speech synthesis system and method using context dependent vowel allophones
First Claim
1. In a text-to-speech conversion system having means for converting a specified text string into a corresponding string of consonant and vowel phonemes, each said phoneme being selected from a predefined set of phonemes including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;
- parameter generating means for generating speech parameters corresponding to said string of phonemes; and
speech synthesizing means for generating a speech waveform corresponding to the speech parameters generated by said parameter generating means;
the improvement comprising;
vowel allophone storage means for storing a multiplicity of vowel allophones, each said stored vowel allophone comprising a set of speech parameters;
said vowel allophones including allophones for a multiplicity of vowel phonemes;
context table means for assigning one of said vowel allophones to every vowel phoneme context LVR, where V represents any vowel phoneme selected from said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;
said context table means including a distinct entry for every phoneme context LVR denoting which of said vowel allophones is assigned to each said phoneme context LVR; and
vowel allophone generating means, coupled to said vowel allophone storage means, for providing speech parameters representative of a specified vowel phoneme to said parameter generating means, including allophone selection means coupled to said context table means for selecting one of said multiplicity of vowel allophones for each of at least a subset of said vowel phonemes in said string of phonemes, said allophone selection means including context indexing means for determining the phonemes in said string which immediately precede and follow said vowel phonemes in said string of phonemes, said allophone selection means including context indexing means for determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and table lookup means for assigning to said vowel phoneme the vowel allophone denoted in said context table means for said vowel phoneme in the context of said preceding and following phonemes;
whereby the speech parameters used to synthesize vowel phonemes represent vowel allophones corresponding to the contexts of said vowel phonemes.
11 Assignments
0 Petitions
Accused Products
Abstract
A text-to-speech conversion system converts specified text strings into corresponding strings of consonant and vowel phonemes. A parameter generator converts the phonemes into formant parameters, and a formant synthesizer uses the formant parameters to generate a synthetic speech waveform. A library of vowel allophones are stored, each stored vowel allophone being represented by formant parameters for four formants. The vowel allophone library includes a context index for associating each said vowel allophone with one or more pairs of phonemes preceding and following the corresponding vowel phoneme in a phoneme string. When synthesizing speech, a vowel allophone generator uses the vowel allophone library to provide formant parameters representative of a specified vowel phoneme. The vowel allophone generator coacts with the context index to select the proper vowel allophone, as determined by the phonemes preceding and following the specified vowel phoneme. As a result, the synthesized pronunciation of vowel phonemes is improved by using vowel allophone formant parameters which correspond to the context of the vowel phonemes. The formant data for large sets of vowel allophones is efficiently stored using code books of formant parameters selected using vector quantization methods. The formant parameters for each vowel allophone are specified, in part, by indices pointing to formant parameters in the code books.
-
Citations
23 Claims
-
1. In a text-to-speech conversion system having means for converting a specified text string into a corresponding string of consonant and vowel phonemes, each said phoneme being selected from a predefined set of phonemes including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;
- parameter generating means for generating speech parameters corresponding to said string of phonemes; and
speech synthesizing means for generating a speech waveform corresponding to the speech parameters generated by said parameter generating means;
the improvement comprising;vowel allophone storage means for storing a multiplicity of vowel allophones, each said stored vowel allophone comprising a set of speech parameters;
said vowel allophones including allophones for a multiplicity of vowel phonemes;context table means for assigning one of said vowel allophones to every vowel phoneme context LVR, where V represents any vowel phoneme selected from said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;
said context table means including a distinct entry for every phoneme context LVR denoting which of said vowel allophones is assigned to each said phoneme context LVR; andvowel allophone generating means, coupled to said vowel allophone storage means, for providing speech parameters representative of a specified vowel phoneme to said parameter generating means, including allophone selection means coupled to said context table means for selecting one of said multiplicity of vowel allophones for each of at least a subset of said vowel phonemes in said string of phonemes, said allophone selection means including context indexing means for determining the phonemes in said string which immediately precede and follow said vowel phonemes in said string of phonemes, said allophone selection means including context indexing means for determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and table lookup means for assigning to said vowel phoneme the vowel allophone denoted in said context table means for said vowel phoneme in the context of said preceding and following phonemes; whereby the speech parameters used to synthesize vowel phonemes represent vowel allophones corresponding to the contexts of said vowel phonemes. - View Dependent Claims (2, 3, 4)
- parameter generating means for generating speech parameters corresponding to said string of phonemes; and
-
5. In a text-to-speech conversion system having means for converting a specified text string into a corresponding string of consonant and vowel phonemes, each said phoneme being selected from a predefined set of phonemes including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;
- parameter generating means for generating formant parameters corresponding to said string of phonemes; and
formant synthesizing means for generating a speech waveform corresponding to the formant parameters generated by said parameter generating means;
the improvement comprising;vowel allophone storage means for storing a multiplicity of vowel allophones, each said stored vowel allophone comprising a set of formant parameters;
said vowel allophones including allophones for a multiplicity of vowel phonemes;
said vowel allophone storage means including context indexing means for associating each said vowel allophone with one or more pairs of phonemes preceding and following the corresponding vowel phoneme in a phoneme string;context table means for assigning one of said vowel allophones to every vowel phoneme context LVR, where V represents any vowel phoneme selected from at least a subset of said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;
said context table means including a distinct entry for every phoneme context LVR denoting which of said vowel allophones is assigned to each said phoneme context LVR; andvowel allophone generating means, coupled to said vowel allophone storage means, for providing formant parameters representative of a specified vowel phoneme to said parameter generating means, including allophone selection means coupled to said context table means for selecting one of said multiplicity of vowel allophones for each of at least a subset of said vowel phonemes in said string of phonemes, said allophone selection means including means for determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and means for assigning to said vowel phoneme the vowel allophone detected in said context table means for said vowel phoneme in the context of said preceding and following phonemes; whereby the formant parameters used to synthesize vowel phonemes represent vowel allophones corresponding to the contexts of said vowel phonemes. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
- parameter generating means for generating formant parameters corresponding to said string of phonemes; and
-
14. In a method of converting text strings into synthetic speech, the steps comprising:
-
defining a set of phonemes, including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes; storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of speech parameters; denoting in a data structure an assigned one of said vowel allophones for every phoneme context LVR, where V represents any vowel phoneme selected from at least a subset of said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;
said data structure containing a distinct allophone assignment entry for each said phoneme context LVR;converting a specified text string into a corresponding string of phonemes, said string of phonemes including consonant and vowel phonemes, each said phoneme being selected from said defined set of phonemes; and for each vowel phoneme in at least a subset of said vowel phonemes in said string of phonemes, determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and then assigning said vowel phoneme the vowel allophone denoted in said data structure for said vowel phoneme in the context of said preceding and following phonemes. - View Dependent Claims (15, 16, 17)
-
-
18. In a method of converting text strings into synthetic speech, the steps comprising:
-
storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of formant parameters; defining a set of phonemes, including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes; storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of formant parameters; denoting in a data structure an assigned one of said vowel allophones for every phoneme context LVR, where V represents any vowel phoneme selected from at least a subset of said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;
said data structure containing a distinct allophone assignment entry for each said phoneme context LVR; andconverting a specified text string into a corresponding string of phonemes, said string of phonemes including consonant and vowel phonemes, each said phoneme being selected from said defined set of phonemes; for each vowel phoneme in at least a subset of said vowel phonemes in said string of phonemes, determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and then assigning said vowel phoneme the vowel allophone denoted in said data structure for said vowel phoneme in the context of said preceding and following phonemes. - View Dependent Claims (19, 20, 21)
-
-
22. In a method of converting text strings into synthetic speech, the steps comprising:
-
defining a set of phonemes, including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes; storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of speech parameters; converting a specified text string into a corresponding string of phonemes, said string of phonemes including consonant and vowel phonemes, each said phoneme being selected from said defined set of phonemes; and for each of at least a subset of said vowel phonemes in said string of phonemes, computing a phoneme context value for said vowel phoneme as a function of a the phonemes in said string of phonemes which precede and follow said vowel phoneme, and then assigning to said vowel phoneme a selected one of said predefined vowel allophones corresponding to said computed phoneme context value; and converting said string of phonemes, including said assigned vowel allophones, into speech parameters and then generating an audio waveform corresponding to said speech parameters.
-
-
23. A text-to-speech synthesis system, comprising:
-
vowel allophone storage means storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of speech parameters; text conversion means for converting a specified text string into a corresponding string of consonant and vowel phonemes, each said phoneme being selected from a predefined set of phonemes including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes; vowel phoneme to allophone conversion means, couple to said text conversion means and said vowel allophone storage means, for computing a phoneme context value for each of at least a subset of said vowel phonemes in said string of phonemes, said phoneme context value comprising a function of the phonemes in said string of phonemes which precede and follow said vowel phoneme, and for then assigning to said vowel phoneme a selected one of said predefined vowel allophones corresponding to said computed phoneme context value; parameter generating means for generating speech parameters corresponding to said string of phonemes, including said speech parameters for said assigned vowel allophones; and speech synthesizing means for generating a speech waveform corresponding to the speech parameters generated by said parameter generating means.
-
Specification