Text to speech synthesis system and method using context dependent vowel allophones

US 4,979,216 A
Filed: 02/17/1989
Issued: 12/18/1990
Est. Priority Date: 02/17/1989
Status: Expired due to Term

First Claim

Patent Images

1. In a text-to-speech conversion system having means for converting a specified text string into a corresponding string of consonant and vowel phonemes, each said phoneme being selected from a predefined set of phonemes including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;

parameter generating means for generating speech parameters corresponding to said string of phonemes; and

speech synthesizing means for generating a speech waveform corresponding to the speech parameters generated by said parameter generating means;

the improvement comprising;

vowel allophone storage means for storing a multiplicity of vowel allophones, each said stored vowel allophone comprising a set of speech parameters;

said vowel allophones including allophones for a multiplicity of vowel phonemes;

context table means for assigning one of said vowel allophones to every vowel phoneme context LVR, where V represents any vowel phoneme selected from said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;

said context table means including a distinct entry for every phoneme context LVR denoting which of said vowel allophones is assigned to each said phoneme context LVR; and

vowel allophone generating means, coupled to said vowel allophone storage means, for providing speech parameters representative of a specified vowel phoneme to said parameter generating means, including allophone selection means coupled to said context table means for selecting one of said multiplicity of vowel allophones for each of at least a subset of said vowel phonemes in said string of phonemes, said allophone selection means including context indexing means for determining the phonemes in said string which immediately precede and follow said vowel phonemes in said string of phonemes, said allophone selection means including context indexing means for determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and table lookup means for assigning to said vowel phoneme the vowel allophone denoted in said context table means for said vowel phoneme in the context of said preceding and following phonemes;

whereby the speech parameters used to synthesize vowel phonemes represent vowel allophones corresponding to the contexts of said vowel phonemes.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A text-to-speech conversion system converts specified text strings into corresponding strings of consonant and vowel phonemes. A parameter generator converts the phonemes into formant parameters, and a formant synthesizer uses the formant parameters to generate a synthetic speech waveform. A library of vowel allophones are stored, each stored vowel allophone being represented by formant parameters for four formants. The vowel allophone library includes a context index for associating each said vowel allophone with one or more pairs of phonemes preceding and following the corresponding vowel phoneme in a phoneme string. When synthesizing speech, a vowel allophone generator uses the vowel allophone library to provide formant parameters representative of a specified vowel phoneme. The vowel allophone generator coacts with the context index to select the proper vowel allophone, as determined by the phonemes preceding and following the specified vowel phoneme. As a result, the synthesized pronunciation of vowel phonemes is improved by using vowel allophone formant parameters which correspond to the context of the vowel phonemes. The formant data for large sets of vowel allophones is efficiently stored using code books of formant parameters selected using vector quantization methods. The formant parameters for each vowel allophone are specified, in part, by indices pointing to formant parameters in the code books.

Citations

23 Claims

1. In a text-to-speech conversion system having means for converting a specified text string into a corresponding string of consonant and vowel phonemes, each said phoneme being selected from a predefined set of phonemes including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;
- parameter generating means for generating speech parameters corresponding to said string of phonemes; and
  
  speech synthesizing means for generating a speech waveform corresponding to the speech parameters generated by said parameter generating means;
  
  the improvement comprising;
  
  vowel allophone storage means for storing a multiplicity of vowel allophones, each said stored vowel allophone comprising a set of speech parameters;
  
  said vowel allophones including allophones for a multiplicity of vowel phonemes;
  
  context table means for assigning one of said vowel allophones to every vowel phoneme context LVR, where V represents any vowel phoneme selected from said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;
  
  said context table means including a distinct entry for every phoneme context LVR denoting which of said vowel allophones is assigned to each said phoneme context LVR; and
  
  vowel allophone generating means, coupled to said vowel allophone storage means, for providing speech parameters representative of a specified vowel phoneme to said parameter generating means, including allophone selection means coupled to said context table means for selecting one of said multiplicity of vowel allophones for each of at least a subset of said vowel phonemes in said string of phonemes, said allophone selection means including context indexing means for determining the phonemes in said string which immediately precede and follow said vowel phonemes in said string of phonemes, said allophone selection means including context indexing means for determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and table lookup means for assigning to said vowel phoneme the vowel allophone denoted in said context table means for said vowel phoneme in the context of said preceding and following phonemes;
  
  whereby the speech parameters used to synthesize vowel phonemes represent vowel allophones corresponding to the contexts of said vowel phonemes.
- View Dependent Claims (2, 3, 4)
- - 2. The text-to-speech conversion system set forth in claim 1, said vowel allophone storage means including:
    - speech storage means for storing the speech parameters for each said vowel allophone;
      
      said speech storage means including code book means for storing a multiplicity of sets of speech parameters; and
      
      allophone means for denoting, for each said vowel allophone, one of said multiplicity of sets of speech parameters in said code book means.
  - 3. The text-to-speech conversion system set forth in claim 1, said context indexing means including vowel substitution means for use when a vowel phoneme V₁ in said string of phonemes is immediately preceded or followed by a vowel phonemes, said vowel substitution means including means for selecting an entry in said context table means to use for assigning one of said vowel allophones to said vowel phoneme V₁.
  - 4. The text-to-speech conversion system as set forth in claim 1, said context indexing means including vowel substitution means for use when a vowel phoneme V₁ in said string of phonemes occurs in a phoneme context CV₁ V₂ or V₂ V₁ C, where C is a consonant phoneme and V₂ is a vowel phoneme neighboring said vowel phoneme V₁, said vowel substitution means including means for selecting one of said phoneme contexts LVR which is phonetically equivalent to said phoneme context CV₁ V₂ or V₂ V₁ C;
    - said table lookup means including means for assigning to said vowel phoneme V₁ the vowel allophone denoted in said context table means for said phonetically equivalent phoneme context LVR.

5. In a text-to-speech conversion system having means for converting a specified text string into a corresponding string of consonant and vowel phonemes, each said phoneme being selected from a predefined set of phonemes including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;
- parameter generating means for generating formant parameters corresponding to said string of phonemes; and
  
  formant synthesizing means for generating a speech waveform corresponding to the formant parameters generated by said parameter generating means;
  
  the improvement comprising;
  
  vowel allophone storage means for storing a multiplicity of vowel allophones, each said stored vowel allophone comprising a set of formant parameters;
  
  said vowel allophones including allophones for a multiplicity of vowel phonemes;
  
  said vowel allophone storage means including context indexing means for associating each said vowel allophone with one or more pairs of phonemes preceding and following the corresponding vowel phoneme in a phoneme string;
  
  context table means for assigning one of said vowel allophones to every vowel phoneme context LVR, where V represents any vowel phoneme selected from at least a subset of said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;
  
  said context table means including a distinct entry for every phoneme context LVR denoting which of said vowel allophones is assigned to each said phoneme context LVR; and
  
  vowel allophone generating means, coupled to said vowel allophone storage means, for providing formant parameters representative of a specified vowel phoneme to said parameter generating means, including allophone selection means coupled to said context table means for selecting one of said multiplicity of vowel allophones for each of at least a subset of said vowel phonemes in said string of phonemes, said allophone selection means including means for determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and means for assigning to said vowel phoneme the vowel allophone detected in said context table means for said vowel phoneme in the context of said preceding and following phonemes;
  
  whereby the formant parameters used to synthesize vowel phonemes represent vowel allophones corresponding to the contexts of said vowel phonemes.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
- - 6. The text-to-speech conversion system set forth in claim 5, said vowel allophone storage means including:
    - formant storage means for storing parameters for a multiplicity of formants for each said vowel allophone;
      
      said formant storage means including code book means for storing a multiplicity of sets of formant parameters; and
      
      allophone means for denoting, for each said vowel allophone, one of said multiplicity of sets of formant parameters in said code book means.
  - 7. The text-to-speech conversion system set for in claim 6, wherein the number of sets of formant parameters stored in said code book means is much less than the number of vowel allophones stored by said vowel allophone storage means;
    - the sets of formant parameters stored in said code book means being selected from sets of formant parameters representing substantially all of said vowel allophones using a minimax distortion vector quantization process.
  - 8. The text-to-speech conversion system set forth in claim 5, each vowel allophone in said vowel allophone storage means including a set of back and forward boundary parameters representative of speech formants at the boundaries of the allophone, and a set of intermediate parameters representative of speech formants between the back and forward boundaries of the allophone;
    - said vowel allophone storage means including;
      
      formant storage means for storing parameters for a multiplicity of formants for each said vowel allophone;
      
      said formant storage means including code book means for storing a multiplicity of sets of intermediate formant parameters; and
      
      allophone means for denoting, for each said vowel allophone, boundary values for said vowel allophone and one of said multiplicity of sets of intermediate formant parameters in said code book means.
  - 9. The text-to-speech conversion system set forth in claim 8, each said set of intermediate formant parameters in said code book means representing the intermediate trajectory of one formant for a vowel allophone;
    - said allophone means including means for denoting at least three of said sets of intermediate formant parameters;
      
      whereby said vowel allophones comprise the formant parameters for at least three formants.
  - 10. The text-to-speech conversion system set forth in claim 5, said vowel allophone storage means including means for storing vowel allophones as pronounced by a selected individual so that said text-to-speech conversion system produces synthetic speech which mimics said selected individual speaking an unlimited vocabulary.
  - 11. The text-to-speech conversion system set forth in claim 5, said vowel allophone storage means including means for storing vowel allophones as pronounced by an individual speaking a selected dialect so that said text-to-speech conversion system produces synthetic speech which mimics said selected dialect.
  - 12. The text-to-speech conversion system set forth in claim 5, said vowel allophone storage means including means for storing vowel allophones as pronounced by a specified cartoon character so that said text-to-speech conversion system produces synthetic speech which mimics said selected cartoon character.
  - 13. The text-to-speech conversion system set forth in claim 5, said vowel allophone storage means including means for storing vowel allophones as pronounced by a plurality of selected individuals so that said text-to-speech conversion system produces synthetic speech which mimics a plurality of selected individuals.

14. In a method of converting text strings into synthetic speech, the steps comprising:
- defining a set of phonemes, including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;
  
  storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of speech parameters;
  
  denoting in a data structure an assigned one of said vowel allophones for every phoneme context LVR, where V represents any vowel phoneme selected from at least a subset of said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;
  
  said data structure containing a distinct allophone assignment entry for each said phoneme context LVR;
  
  converting a specified text string into a corresponding string of phonemes, said string of phonemes including consonant and vowel phonemes, each said phoneme being selected from said defined set of phonemes; and
  
  for each vowel phoneme in at least a subset of said vowel phonemes in said string of phonemes, determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and then assigning said vowel phoneme the vowel allophone denoted in said data structure for said vowel phoneme in the context of said preceding and following phonemes.
- View Dependent Claims (15, 16, 17)
- - 15. The method of converting text strings into synthetic speech as set forth in claim 14, said storing step including the step of providing code book means for storing a multiplicity of sets of speech parameters, and allophone means for denoting, for each said vowel allophone, one of said multiplicity of sets of speech parameters in said code book means.
  - 16. The method of converting text strings into synthetic speech as set forth in claim 15, wherein the number of sets of speech parameters stored in said code book means is much less than said predefined multiplicity of vowel allophones;
    - the sets of speech parameters stored in said code book means being selected from sets of speech parameters representing substantially all of said vowel allophones using a minimax distortion vector quantization process.
  - 17. The method of converting text strings into synthetic speech as set forth in claim 14, said storing step storing vowel allophones as pronounced by a selected individual so that said method produces synthetic speech which mimics said selected individual speaking.

18. In a method of converting text strings into synthetic speech, the steps comprising:
- storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of formant parameters;
  
  defining a set of phonemes, including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;
  
  storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of formant parameters;
  
  denoting in a data structure an assigned one of said vowel allophones for every phoneme context LVR, where V represents any vowel phoneme selected from at least a subset of said multiplicity of vowel phonemes, L represents any consonant phoneme immediately preceding said vowel phoneme V selected from said predefined set of phonemes, and R represents any consonant phoneme immediately following said vowel phoneme V selected from said predefined set of phonemes;
  
  said data structure containing a distinct allophone assignment entry for each said phoneme context LVR; and
  
  converting a specified text string into a corresponding string of phonemes, said string of phonemes including consonant and vowel phonemes, each said phoneme being selected from said defined set of phonemes;
  
  for each vowel phoneme in at least a subset of said vowel phonemes in said string of phonemes, determining the phonemes in said string which immediately precede and follow said vowel phoneme in said string of phonemes, and then assigning said vowel phoneme the vowel allophone denoted in said data structure for said vowel phoneme in the context of said preceding and following phonemes.
- View Dependent Claims (19, 20, 21)
- - 19. The method of converting text strings into synthetic speech as set forth in claim 18, said storing step including the step of providing code book means for storing a multiplicity of sets of formant parameters, and allophone means for denoting, for each said vowel allophone, one of said multiplicity of sets of formant parameters in said code book means.
  - 20. The method of converting text strings into synthetic speech as set forth in claim 19, wherein the number of sets of formant parameters stored in said code book means is much less than said predefined multiplicity of vowel allophones;
    - the sets of formant parameters stored in said code book means being selected from sets of formant parameters representing substantially all of said vowel allophones using a minimax distortion vector quantization process.
  - 21. The method of converting text strings into synthetic speech as set forth in claim 18, said storing step storing vowel allophones as pronounced by a selected individual so that said method produces synthetic speech which mimics said selected individual speaking.

22. In a method of converting text strings into synthetic speech, the steps comprising:
- defining a set of phonemes, including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;
  
  storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of speech parameters;
  
  converting a specified text string into a corresponding string of phonemes, said string of phonemes including consonant and vowel phonemes, each said phoneme being selected from said defined set of phonemes; and
  
  for each of at least a subset of said vowel phonemes in said string of phonemes, computing a phoneme context value for said vowel phoneme as a function of a the phonemes in said string of phonemes which precede and follow said vowel phoneme, and then assigning to said vowel phoneme a selected one of said predefined vowel allophones corresponding to said computed phoneme context value; and
  
  converting said string of phonemes, including said assigned vowel allophones, into speech parameters and then generating an audio waveform corresponding to said speech parameters.

23. A text-to-speech synthesis system, comprising:
- vowel allophone storage means storing a multiplicity of predefined vowel allophones, each vowel allophone being represented by a set of speech parameters;
  
  text conversion means for converting a specified text string into a corresponding string of consonant and vowel phonemes, each said phoneme being selected from a predefined set of phonemes including a multiplicity of consonant phonemes and a multiplicity of vowel phonemes;
  
  vowel phoneme to allophone conversion means, couple to said text conversion means and said vowel allophone storage means, for computing a phoneme context value for each of at least a subset of said vowel phonemes in said string of phonemes, said phoneme context value comprising a function of the phonemes in said string of phonemes which precede and follow said vowel phoneme, and for then assigning to said vowel phoneme a selected one of said predefined vowel allophones corresponding to said computed phoneme context value;
  
  parameter generating means for generating speech parameters corresponding to said string of phonemes, including said speech parameters for said assigned vowel allophones; and
  
  speech synthesizing means for generating a speech waveform corresponding to the speech parameters generated by said parameter generating means.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Centigram Communications Corporation (TE Connectivity Limited)
Inventors
Williams, Linda D., Groner, Gabriel F., Malsheen, Bathsheba J.
Primary Examiner(s)
KEMENY, EMANUEL

Application Number

US07/312,692
Time in Patent Office

669 Days
Field of Search

381/51, 381/52
US Class Current

704/260
CPC Class Codes

G10L 13/08 Text analysis or generation...

Text to speech synthesis system and method using context dependent vowel allophones

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Text to speech synthesis system and method using context dependent vowel allophones

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links