Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

US 10,014,007 B2
Filed: 05/28/2014
Issued: 07/03/2018
Est. Priority Date: 05/28/2014
Status: Active Grant

First Claim

Patent Images

1. A method performed by a generic computer processor for creating a glottal pulse database from a speech signal, in a speech synthesis system, wherein the system comprises at least a speech database, the method comprising the steps of:

a. pre-emphasizing the speech signal to obtain a pre-filtered signal;

b. analyzing the pre-filtered signal, using linear prediction, to obtain inverse filtering parameters;

c. performing inverse filtering of the speech signal using the inverse filtering parameters;

d. determining an integrated linear prediction residual signal using the inversely filtered speech signal;

e. identifying glottal segment boundaries in the speech signal;

f. segmenting the integrated linear prediction residual signal into glottal pulses using the identified glottal segment boundaries from the speech signal;

g. normalizing the glottal pulses;

h. forming the glottal pulse database by collecting all normalized glottal pulses obtained for the speech signal; and

i. applying the formed glottal pulse database to form an excitation signal, wherein the excitation signal is applied in the speech synthesis system to synthesize speech.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method is presented for forming the excitation signal for a glottal pulse model based parametric speech synthesis system. In one embodiment, fundamental frequency values are used to form the excitation signal. The excitation is modeled using a voice source pulse selected from a database of a given speaker. The voice source signal is segmented into glottal segments, which are used in vector representation to identify the glottal pulse used for formation of the excitation signal. Use of a novel distance metric and preserving the original signals extracted from the speakers voice samples helps capture low frequency information of the excitation signal. In addition, segment edge artifacts are removed by applying a unique segment joining method to improve the quality of synthetic speech while creating a true representation of the voice quality of a speaker.

Citations

35 Claims

1. A method performed by a generic computer processor for creating a glottal pulse database from a speech signal, in a speech synthesis system, wherein the system comprises at least a speech database, the method comprising the steps of:
- a. pre-emphasizing the speech signal to obtain a pre-filtered signal;
  
  b. analyzing the pre-filtered signal, using linear prediction, to obtain inverse filtering parameters;
  
  c. performing inverse filtering of the speech signal using the inverse filtering parameters;
  
  d. determining an integrated linear prediction residual signal using the inversely filtered speech signal;
  
  e. identifying glottal segment boundaries in the speech signal;
  
  f. segmenting the integrated linear prediction residual signal into glottal pulses using the identified glottal segment boundaries from the speech signal;
  
  g. normalizing the glottal pulses;
  
  h. forming the glottal pulse database by collecting all normalized glottal pulses obtained for the speech signal; and
  
  i. applying the formed glottal pulse database to form an excitation signal, wherein the excitation signal is applied in the speech synthesis system to synthesize speech.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein the inverse filtering parameters in step (b) comprise linear prediction coefficients.
  - 3. The method of claim 1, wherein the identifying of step (e) is performed using Zero Frequency Filtering technique.

4. A method for creating parametric models for use in training a speech synthesis system performed by a generic computer processor, wherein the system comprises at least a glottal pulse database, the method comprising the steps of:
- a. determining a glottal pulse distance metric between a number of glottal pulses;
  
  b. clustering the glottal pulse database into a number of clusters to determine centroid glottal pulses;
  
  c. forming a corresponding vector database by associating a vector with each glottal pulse in the glottal pulse database, wherein the centroid glottal pulses and the distance metric are defined mathematically to determine association;
  
  d. determining Eigenvectors of the vector database;
  
  e. creating parametric models by associating a glottal pulse from the glottal pulse database to each determined Eigenvector; and
  
  f. applying the created parametric models to the speech synthesis system in order to train the speech synthesis system.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 5. The method of claim 4, wherein the number of glottal pulses is two.
  - 6. The method of claim 4, wherein step (a) further comprises the steps of:
    - a. de-composing the number of glottal pulses into corresponding sub-band components;
      
      b. computing a sub-band distance metric between the corresponding sub-band components of each glottal pulse; and
      
      c. computing the glottal pulse distance metric mathematically using the sub-band distance metrics.
  - 7. The method of claim 6, wherein the computing of step (c) is performed using the mathematical equation:
  - 8. The method of claim 4, wherein the number of clusters is 256.
  - 9. The method of claim 4, wherein the clustering of step (b) is performed using a modified k-means calculation that utilizes the glottal pulse distance metric.
  - 10. The method of claim 9, wherein the modified k-means calculation further comprises updating a centroid of a cluster with an element of the cluster whose sum of squares of distances from all other elements of that cluster is minimum.
  - 11. The method of claim 10, further comprising terminating the clustering iterations when there is no shift in any of the centroids from the clusters.
  - 12. The method of claim 4, wherein the determining of Eigenvectors of step (d) is performed using Principal Component Analysis.
  - 13. The method of claim 4, wherein step (e) further comprises the steps of:
    - a. determining the Eigenvector;
      
      b. determining the closest matching vector from the vector database to the Eigenvectorc. determining the closest matching glottal pulse from the glottal pulse database; and
      
      d. naming the glottal pulse from the glottal pulse database that is the closest match to the Eigenvector as the Eigen glottal pulse associated with the Eigenvector.
  - 14. The method of claim 4, wherein the training further comprises the steps of:
    - a. defining a training text corpus;
      
      b. obtaining speech data by recording a voice talent speaking the training text;
      
      c. converting the training text into context dependent phone labels;
      
      d. determining the spectral features of the speech data using the phone labels;
      
      e. estimating fundamental frequency of the speech data; and
      
      f. performing parameter estimation on an audio stream using the spectral features, the fundamental frequency, and the duration of the audio stream.

15. A method to synthesize speech performed by a generic computer processor using input text, comprising the steps of:
- a. converting the input text into context dependent phone labels;
  
  b. processing the phone labels created in step (a) using trained parametric models to predict fundamental frequency values, duration of the speech synthesized, and spectral features of the phone labels;
  
  c. creating an excitation signal using an Eigen glottal pulse and said predicted one or more of;
  
  fundamental frequency values, spectral features of phone labels, and duration of the speech synthesized;
  
  d. dividing signal regions of excitation into categories of segments comprising one or more of;
  
  voiced, unvoiced, and/or pause;
  
  e. creating an excitation for each category;
  
  f. combining the excitation signal with the spectral features of the phone labels using a filter to create synthetic speech output.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 16. The method of claim 15, wherein the dividing is performed based on the fundamental frequency value.
  - 17. The method of claim 15, wherein the filter of step (d) comprises a Mel Log Spectrum Approximation filter.
  - 18. The method of claim 15, wherein the step of creating an excitation signal comprises placing white noise in the unvoiced segments.
  - 19. The method of claim 15, wherein the step of creating an excitation signal for pause segments comprises placing a zero in the segment.
  - 20. The method of claim 15, wherein the excitation signal is created for voiced segments comprising the steps of:
    - a. creating glottal boundaries, using the predicted fundamental frequency value from a model, wherein the glottal boundaries mark pitch boundaries of the excitation signal;
      
      b. adding a glottal pulse beginning at each glottal boundary using an overlap add method;
      
      c. avoiding boundary effects in the excitation signal wherein the avoiding further comprises the steps of;
      
      i. creating a number of different excitations formed through the overlap add method with a constantly increasing amount of shifts in the glottal boundaries and an equal amount of circular left shift for the glottal pulse, wherein if the glottal pulse is of a length less than the corresponding pitch period, then the glottal pulse is zero extended to the length of pitch period prior to the left shift,ii. determining the arithmetic mean of the number of different excitation signals, andiii. declaring the arithmetic mean the final excitation signal for the voiced segment.
  - 21. The method of claim 15, wherein the Eigen glottal pulse is identified from a glottal pulse database, the identification comprising the steps of:
    - a. computing a glottal pulse distance metric between a number of glottal pulses;
      
      b. clustering the glottal pulse database into a number of clusters to determine centroid glottal pulses;
      
      c. forming a corresponding vector database by associating a vector with each glottal pulse in the glottal pulse database, wherein the centroid glottal pulses and the distance metric is defined mathematically to determine association;
      
      d. determining Eigenvectors of the vector database; and
      
      e. forming parametric models by associating a glottal pulse from the glottal pulse database to each determined Eigenvector to form parametric models.
  - 22. The method of claim 21, wherein the number of glottal pulses is two.
  - 23. The method of claim 21, wherein step (a) further comprises the steps of:
    - a. de-composing the number of glottal pulses into corresponding sub-band components;
      
      b. computing a sub-band distance metric between the corresponding sub-band components of each glottal pulse; and
      
      c. computing the distance metric mathematically using the sub-band distance metrics.
  - 24. The method of claim 23, wherein the computing of step (c) is performed using the mathematical equation:
  - 25. The method of claim 21, wherein the number of clusters is 256.
  - 26. The method of claim 21, wherein the clustering of step (b) is performed using a modified k-means calculation that utilizes the glottal pulse distance metric.
  - 27. The method of claim 26, wherein the modified k-means calculation further comprises updating a centroid of a cluster with an element of the cluster whose sum of squares of distances from all other elements of that cluster is minimum.
  - 28. The method of claim 27, further comprising terminating the clustering iterations when there is no shift in any of the centroids from the clusters.
  - 29. The method of claim 21, wherein the determining of Eigenvectors of step (d) is performed using Principal Component Analysis.
  - 30. The method of claim 21, wherein step (e) further comprises the steps of:
    - a. determining the Eigenvector;
      
      b. determining the closest matching vector from the vector database to the Eigenvector;
      
      c. determining the closest matching glottal pulse from the glottal pulse database; and
      
      d. naming the glottal pulse from the glottal pulse database that is the closest match to the Eigenvector as the Eigen glottal pulse associated with the Eigenvector.
  - 31. The method of claim 21, further comprising building the glottal pulse database from a speech signal, the building comprising the steps of:
    - a. performing pre-filtering of the speech signal to obtain a pre-filtered signal;
      
      b. analyzing the pre-filtered signal to obtain inverse filtering parameters;
      
      c. performing inverse filtering of the speech signal using the inverse filtering parameters;
      
      d. computing an integrated linear prediction residual signal using the inversely filtered speech signal;
      
      e. identifying glottal segment boundaries in the speech signal;
      
      f. segmenting the integrated linear prediction residual signal into glottal pulses using the identified glottal segment boundaries from the speech signal;
      
      g. performing normalization of the glottal pulses; and
      
      h. forming the glottal pulse database by collecting all normalized glottal pulses obtained for the speech signal.
  - 32. The method of claim 31, wherein the analysis of step (b) is performed using linear prediction.
  - 33. The method of claim 31, wherein the inverse filtering parameters in step (b) comprise linear prediction coefficients.
  - 34. The method of claim 31, wherein the identifying of step (e) is performing using Zero Frequency Filtering technique.
  - 35. The method of claim 31, wherein the pre-filtering of step (a) comprises pre-emphasis.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Genesys Cloud Services Incorporated
Original Assignee
Interactive Intelligence Incorporated (Genesys Cloud Services Incorporated)
Inventors
Dachiraju, Rajesh, Ganapathiraju, Aravind
Primary Examiner(s)
Shah, Paras D
Assistant Examiner(s)
CHAVEZ, RODRIGO A

Application Number

US14/288,745
Publication Number

US 20150348535A1
Time in Patent Office

1,497 Days
Field of Search

704 7, 704 10, 704201, 704208, 704210, 704215, 704224, 704234, 704248
US Class Current
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 25/90 Pitch determination of spee...

Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links