METHOD FOR FORMING THE EXCITATION SIGNAL FOR A GLOTTAL PULSE MODEL BASED PARAMETRIC SPEECH SYNTHESIS SYSTEM

US 20160027430A1
Filed: 10/06/2015
Published: 01/28/2016
Est. Priority Date: 05/28/2014
Status: Active Grant

First Claim

Patent Images

1. A method for creating parametric models for use in training a speech synthesis system, wherein the system comprises at least a training text corpus, a speech database, and a model training module, the method comprising:

a. obtaining, by the model training module, speech data for the training text corpus, wherein the speech data comprises recorded speech signals and corresponding transcriptions;

b. converting, by the model training module, the training text corpus into context dependent phone labels;

c. extracting, by the model training module, for each frame of speech in the speech signal from the speech training database, at least one of;

spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;

d. forming, by the model training module, a feature vector stream for each frame of speech using the at least one of;

spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;

e. labeling speech with context dependent phones;

f. extracting durations of each context dependent phone from the labelled speech;

g. performing parameter estimation of the speech signal, wherein the parameter estimation is performed comprising the features, HMM, and decision trees; and

h. identifying a plurality of sub-band Eigen glottal pulses, wherein the sub-band Eigen glottal pulses comprise separate models used to form excitation during synthesis.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are presented for forming the excitation signal for a glottal pulse model based parametric speech synthesis system. The excitation signal may be formed by using a plurality of sub-band templates instead of a single one. The plurality of sub-band templates may be combined to form the excitation signal wherein the proportion in which the templates are added is dynamically based on determined energy coefficients. These coefficients vary from frame to frame and are learned, along with the spectral parameters, during feature training. The coefficients are appended to the feature vector, which comprises spectral parameters and is modeled using HMMs, and the excitation signal is determined.

Citations

19 Claims

1. A method for creating parametric models for use in training a speech synthesis system, wherein the system comprises at least a training text corpus, a speech database, and a model training module, the method comprising:
- a. obtaining, by the model training module, speech data for the training text corpus, wherein the speech data comprises recorded speech signals and corresponding transcriptions;
  
  b. converting, by the model training module, the training text corpus into context dependent phone labels;
  
  c. extracting, by the model training module, for each frame of speech in the speech signal from the speech training database, at least one of;
  
  spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;
  
  d. forming, by the model training module, a feature vector stream for each frame of speech using the at least one of;
  
  spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;
  
  e. labeling speech with context dependent phones;
  
  f. extracting durations of each context dependent phone from the labelled speech;
  
  g. performing parameter estimation of the speech signal, wherein the parameter estimation is performed comprising the features, HMM, and decision trees; and
  
  h. identifying a plurality of sub-band Eigen glottal pulses, wherein the sub-band Eigen glottal pulses comprise separate models used to form excitation during synthesis.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the spectral features are determined comprising the steps of:
    - a. determining an energy coefficient from the speech signal;
      
      b. pre-emphasizing the speech signal and determining MGC coefficients for each frame of the pre-emphasized speech signal;
      
      c. appending the energy coefficient and the MGC coefficients to form a MCG for each frame of the signal; and
      
      d. extracting spectral vectors for each frame.
  - 3. The method of claim 1, wherein the plurality of band excitation energy coefficients are determined comprising the steps of:
    - a. determining, from the speech signal, fundamental frequency values;
      
      b. performing pre-emphasis on the speech signal;
      
      c. performing LCP Analysis on the pre-emphasized speech signal;
      
      d. performing inverse filtering on the speech signal and the LCP analyzed signal;
      
      e. segmenting glottal cycles using the fundamental frequency values and the inversely filtered speech signal;
      
      f. decomposing corresponding glottal cycles for each frame into sub-band components;
      
      g. computing energies of each sub-band component to form a plurality of energy coefficients for each frame; and
      
      h. using the energy coefficients to extract excitation vectors for each frame.
  - 4. The method of claim 3, wherein the sub-band components comprise at least 2 bands.
  - 5. The method of claim 4, wherein the sub-band components comprises at least a high band component and a low band component.
  - 6. The method of claim 1, wherein the identifying a plurality of sub-band Eigen glottal pulses further comprises the steps of:
    - a. creating a glottal pulse data based using the speech data;
      
      b. decomposing each pulse into a plurality of sub-band components;
      
      c. dividing the sub-band components into a plurality of databases based on the decomposing;
      
      d. determining a vector representation of each database;
      
      e. determining Eigen pulse values, from the vector representation, for each database; and
      
      f. selecting a best Eigen pulse for each database for use in synthesis.
  - 7. The method of claim 6, wherein the plurality of sub-band components comprises low band and high band.
  - 8. The method of claim 6, wherein the glottal database is created by:
    - a. performing linear prediction analysis on a speech signal;
      
      b. performing inverse filtering of the signal to obtain an integrated linear prediction residual; and
      
      c. segmenting the integrated linear prediction residual into glottal cycles to obtain a number of glottal pulses.
  - 9. The method of claim 6, wherein the decomposing further comprises:
    - a. determining a cut off frequency, wherein said cut off frequency separates the sub-band components into groupings;
      
      b. obtaining a zero crossing at the edge of the low frequency bulge;
      
      c. placing zeros in the higher band region of the spectrum and obtaining the time domain version of the low frequency component of glottal pulse, wherein the obtaining comprises performing inverse FFT; and
      
      d. placing zeros in the lower band region of the spectrum prior to obtaining the time domain version of the high frequency component of the glottal pulse, wherein the obtaining comprises performing inverse FFT.
  - 10. The method of claim 9, wherein the groupings comprise a lower band grouping and a higher band grouping.
  - 11. The method of claim 9, wherein the separating of sub-band components into groupings is performed using a ZFR method with suitable window size and applied on the spectral magnitude.
  - 12. The method of claim 6, wherein the determining a vector representation of each database further comprises a set of distances from a set of fixed number of points of a metric space, obtained as centroids after a metric based clustering of a large set of signals from the metric space.

13. A method for identification of sub-band Eigen pulses from a glottal pulse database for training a speech synthesis system, wherein the method comprises:
- a. receiving pulses from the glottal pulse database;
  
  b. decomposing each pulse into a plurality of sub-band components;
  
  c. dividing the sub-band components into a plurality of databases based on the decomposing;
  
  d. determining a vector representation of each database;
  
  e. determining Eigen pulse values, from the vector representation, for each database; and
  
  f. selecting a best Eigen pulse for each database for use in synthesis.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The method of claim 13, wherein the plurality of sub-band components comprises a low band and a high band.
  - 15. The method of claim 13, wherein the glottal pulse database is created by:
    - a. performing linear prediction analysis on a speech signal;
      
      b. performing inverse filtering of the signal to obtain an integrated linear prediction residual; and
      
      c. segmenting the integrated linear prediction residual into glottal cycles to obtain a number of glottal pulses.
  - 16. The method of claim 13, wherein the decomposing further comprises:
    - a. determining a cut off frequency, wherein said cut off frequency separates the sub-band components into groupings;
      
      b. obtaining a zero crossing at the edge of the low frequency bulge;
      
      c. placing zeros in the high band region of the spectrum prior to obtaining the time domain version of the low frequency component of glottal pulse, wherein the obtaining comprises performing inverse FFT; and
      
      d. placing zeros in the lower band region of the spectrum prior to obtaining the time domain version of the high frequency component of the glottal pulse, wherein the obtaining comprises performing inverse FFT.
  - 17. The method of claim 16, wherein the groupings comprise a lower band grouping and a higher band grouping.
  - 18. The method of claim 16, wherein the separating of sub-band components into groupings is performed using a ZFR method with suitable window size and applied on the spectral magnitude
  - 19. The method of claim 13, wherein the determining a vector representation of each database further comprises a set of distances from a set of fixed number of points of a metric space, obtained as centroids after a metric based clustering of a large set of signals from the metric space.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Genesys Cloud Services Incorporated
Original Assignee
Interactive Intelligence Group Incorporated (Genesys Cloud Services Incorporated)
Inventors
Dachiraju, Rajesh, Ganapathiraju, Aravind, Raghavendra, E. Veera

Granted Patent

US 10,255,903 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/027   Concept to speech synthesis...

G10L 13/06   Elementary speech units use...

METHOD FOR FORMING THE EXCITATION SIGNAL FOR A GLOTTAL PULSE MODEL BASED PARAMETRIC SPEECH SYNTHESIS SYSTEM

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR FORMING THE EXCITATION SIGNAL FOR A GLOTTAL PULSE MODEL BASED PARAMETRIC SPEECH SYNTHESIS SYSTEM

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links