Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

US 10,255,903 B2
Filed: 10/06/2015
Issued: 04/09/2019
Est. Priority Date: 05/28/2014
Status: Active Grant

First Claim

Patent Images

1. A method performed by a processing circuit for creating parametric models for use in training a speech synthesis system, wherein the system comprises at least a training text corpus, a speech database, and a model training module, the method comprising:

a. obtaining, by the model training module, speech data from the speech database wherein the speech data comprises recorded speech signals and corresponding portions of the training text corpus;

b. converting, by the model training module, the training text corpus into context dependent phone labels;

c. extracting, by the model training module, for each frame of speech in the speech signal from the speech data, at least one of;

spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values using the context dependent phone labels;

d. forming, by the model training module, a feature vector stream for each frame of speech in the speech signal from the speech data using the at least one of;

the spectral features, the plurality of band excitation energy coefficients, and the fundamental frequency values;

e. labeling, by the model training module, each frame of speech in the speech signal with the context dependent phone labels;

f. extracting, by the model training module, durations of each of the context dependent phone labels from the labeled speech;

g. forming, by the model training module, context dependent Hidden Markov Models (HMMs) using the feature vector streams and the context dependent phone labels from the labeled speech;

h. performing, by a parameter generation module, parameter estimation of the speech signal, wherein the parameter estimation is performed comprising the feature vector streams, the HMMs, and decision trees;

i. identifying a plurality of sub-band Eigen glottal pulses from the speech signal, wherein the sub-band Eigen glottal pulses comprise separate models used to form excitation during synthesis; and

j. applying the identified plurality of sub-band Eigen glottal pulses from the speech signal to form an excitation signal, wherein the excitation signal is applied in the speech synthesis system to synthesize speech.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are presented for forming the excitation signal for a glottal pulse model based parametric speech synthesis system. The excitation signal may be formed by using a plurality of sub-band templates instead of a single one. The plurality of sub-band templates may be combined to form the excitation signal wherein the proportion in which the templates are added is dynamically based on determined energy coefficients. These coefficients vary from frame to frame and are learned, along with the spectral parameters, during feature training. The coefficients are appended to the feature vector, which comprises spectral parameters and is modeled using HMMs, and the excitation signal is determined.

33 Citations

View as Search Results

12 Claims

1. A method performed by a processing circuit for creating parametric models for use in training a speech synthesis system, wherein the system comprises at least a training text corpus, a speech database, and a model training module, the method comprising:
- a. obtaining, by the model training module, speech data from the speech database wherein the speech data comprises recorded speech signals and corresponding portions of the training text corpus;
  
  b. converting, by the model training module, the training text corpus into context dependent phone labels;
  
  c. extracting, by the model training module, for each frame of speech in the speech signal from the speech data, at least one of;
  
  spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values using the context dependent phone labels;
  
  d. forming, by the model training module, a feature vector stream for each frame of speech in the speech signal from the speech data using the at least one of;
  
  the spectral features, the plurality of band excitation energy coefficients, and the fundamental frequency values;
  
  e. labeling, by the model training module, each frame of speech in the speech signal with the context dependent phone labels;
  
  f. extracting, by the model training module, durations of each of the context dependent phone labels from the labeled speech;
  
  g. forming, by the model training module, context dependent Hidden Markov Models (HMMs) using the feature vector streams and the context dependent phone labels from the labeled speech;
  
  h. performing, by a parameter generation module, parameter estimation of the speech signal, wherein the parameter estimation is performed comprising the feature vector streams, the HMMs, and decision trees;
  
  i. identifying a plurality of sub-band Eigen glottal pulses from the speech signal, wherein the sub-band Eigen glottal pulses comprise separate models used to form excitation during synthesis; and
  
  j. applying the identified plurality of sub-band Eigen glottal pulses from the speech signal to form an excitation signal, wherein the excitation signal is applied in the speech synthesis system to synthesize speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the spectral features are determined comprising the steps of:
    - a. determining an energy coefficient from the speech signal;
      
      b. pre-emphasizing the speech signal and determining mel-generalized cepstral (MGC) coefficients for each frame of the pre-emphasized speech signal;
      
      c. appending the energy coefficient and the MGC coefficients to form a MGC coefficient for each frame of the signal; and
      
      d. extracting spectral vectors for each frame.
  - 3. The method of claim 1, wherein the plurality of band excitation energy coefficients are determined comprising the steps of:
    - a. determining, from the speech signal, fundamental frequency values;
      
      b. performing pre-emphasis on the speech signal;
      
      c. performing linear predictive coding (LPC) Analysis on the pre-emphasized speech signal;
      
      d. performing inverse filtering on the speech signal and the LPC analyzed signal;
      
      e. segmenting glottal cycles using the fundamental frequency values and the inversely filtered speech signal;
      
      f. decomposing corresponding glottal cycles for each frame into sub-band components;
      
      g. computing energies of each sub-band component to form a plurality of energy coefficients for each frame; and
      
      h. using the energy coefficients to extract excitation vectors for each frame.
  - 4. The method of claim 3, wherein the sub-band components comprise at least 2 bands.
  - 5. The method of claim 4, wherein the sub-band components comprises at least a high band component and a low band component.
  - 6. The method of claim 1, wherein the identifying a plurality of sub-band Eigen glottal pulses further comprises the steps of:
    - a. creating a glottal pulse database using the speech data;
      
      b. decomposing each pulse into a plurality of sub-band components;
      
      c. dividing the sub-band components into a plurality of databases based on the decomposing;
      
      d. determining a vector representation of each database;
      
      e. determining Eigen pulse values, from the vector representation, for each database; and
      
      f. selecting a best Eigen pulse for each database for use in synthesis.
  - 7. The method of claim 6, wherein the plurality of sub-band components comprises low band and high band.
  - 8. The method of claim 6, wherein the glottal database is created by:
    - a. performing linear prediction analysis on a speech signal;
      
      b. performing inverse filtering of the signal to obtain an integrated linear prediction residual; and
      
      c. segmenting the integrated linear prediction residual into glottal cycles to obtain a number of glottal pulses.
  - 9. The method of claim 6, wherein the decomposing further comprises:
    - a. determining a cut off frequency, wherein said cut off frequency separates the sub-band components into groupings;
      
      b. obtaining a zero crossing at the edge of the low frequency bulge;
      
      c. placing zeros in the higher band region of the spectrum and obtaining the time domain version of the low frequency component of glottal pulse, wherein the obtaining comprises performing inverse FFT; and
      
      d. placing zeros in the lower band region of the spectrum prior to obtaining the time domain version of the high frequency component of the glottal pulse, wherein the obtaining comprises performing inverse FFT.
  - 10. The method of claim 9, wherein the groupings comprise a lower band grouping and a higher band grouping.
  - 11. The method of claim 9, wherein the separating of sub-band components into groupings is performed using a ZFR method and applied on the spectral magnitude.
  - 12. The method of claim 6, wherein the determining a vector representation of each database further comprises a set of distances from a set of fixed number of points of a metric space, obtained as centroids after a metric based clustering of a large set of signals from the metric space.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Genesys Telecommunications Laboratories Incorporated (Genesys Cloud Services Incorporated)
Original Assignee
Interactive Intelligence Group Incorporated (Genesys Cloud Services Incorporated)
Inventors
Dachiraju, Rajesh, Raghavendra, E. Veera, Ganapathiraju, Aravind
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Chavez, Rodrigo A

Application Number

US14/875,778
Publication Number

US 20160027430A1
Time in Patent Office

1,281 Days
Field of Search

704 7, 704 10, 704201, 704208, 704210, 704215, 704224, 704234, 704248
US Class Current
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/027   Concept to speech synthesis...

G10L 13/06   Elementary speech units use...

Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

33 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links