Speech processing system

US 9,466,285 B2
Filed: 11/26/2013
Issued: 10/11/2016
Est. Priority Date: 11/30/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A method of deriving speech synthesis parameters from an audio signal, the method performed in a device comprising a processor, the method comprising:

receiving an input speech audio signal;

estimating a position of glottal closure incidents from said input speech audio signal;

deriving a pulsed excitation signal from the position of the glottal closure incidents;

segmenting said audio signal on the basis of said glottal closure incidents, to obtain segments of said input speech audio signal;

processing the segments of the input speech audio to obtain a complex cepstrum and deriving a synthesis filter from said complex cepstrum;

producing a reconstructed speech signal based on the input speech audio signal by passing the pulsed excitation signal derived from the position of the glottal closure incidents through said synthesis filter derived from said complex cepstrum;

comparing said reconstructed speech signal with said input speech audio signal;

calculating a difference between the reconstructed speech signal and the input speech audio signal and modifying the pulsed excitation signal and the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech audio signal,wherein modifying the pulsed excitation signal and the complex cepstrum comprises the process of;

optimizing the position of the pulses in said excitation signal to reduce a mean between the reconstructed speech signal and the input speech audio signals;

recalculating the complex cepstrum by optimizing the complex cepstrum by minimizing the difference between the reconstructed speech signal and the input speech audio signal using the optimized pulse positions, andrepeating the process to derive as said speech synthesis parameters the position of the pulses and the complex cepstrum resulting in a minimum difference between the reconstructed speech signal and the input speech audio signal.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of deriving speech synthesis parameters from an input speech audio signal, wherein the audio signal is segmented on the basis of estimated positions of glottal closure incidents and the resulting segments are processed to obtain the complex cepstrum used to derive a synthesis filter. A reconstructed speech signal is produced by passing a pulsed excitation signal derived from the position of the glottal closure incidents through the synthesis filter, and compared with the input speech audio signal. The pulse excitation signal and the complex cepstrum are then iteratively modified to minimize the difference between the reconstructed speech signal and the input speech audio signal, by optimizing the position of the pulses in the excitation signal to reduce the mean squared error between the reconstructed speech signal and the input speech audio signal, and recalculating the complex using the optimized pulse positions.

Citations

14 Claims

1. A method of deriving speech synthesis parameters from an audio signal, the method performed in a device comprising a processor, the method comprising:
- receiving an input speech audio signal;
  
  estimating a position of glottal closure incidents from said input speech audio signal;
  
  deriving a pulsed excitation signal from the position of the glottal closure incidents;
  
  segmenting said audio signal on the basis of said glottal closure incidents, to obtain segments of said input speech audio signal;
  
  processing the segments of the input speech audio to obtain a complex cepstrum and deriving a synthesis filter from said complex cepstrum;
  
  producing a reconstructed speech signal based on the input speech audio signal by passing the pulsed excitation signal derived from the position of the glottal closure incidents through said synthesis filter derived from said complex cepstrum;
  
  comparing said reconstructed speech signal with said input speech audio signal;
  
  calculating a difference between the reconstructed speech signal and the input speech audio signal and modifying the pulsed excitation signal and the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech audio signal,wherein modifying the pulsed excitation signal and the complex cepstrum comprises the process of;
  
  optimizing the position of the pulses in said excitation signal to reduce a mean between the reconstructed speech signal and the input speech audio signals;
  
  recalculating the complex cepstrum by optimizing the complex cepstrum by minimizing the difference between the reconstructed speech signal and the input speech audio signal using the optimized pulse positions, andrepeating the process to derive as said speech synthesis parameters the position of the pulses and the complex cepstrum resulting in a minimum difference between the reconstructed speech signal and the input speech audio signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14)
- - 2. A method according to claim 1, wherein the difference between the reconstructed speech signal and the input speech audio signal is calculated using the mean squared error.
  - 3. A method according to claim 1, wherein the pulse height a_zis set such that a_z=0 if a_z<
    - 0 and a_z=1 if a_z>
      
      0 before recalculation of the complex cepstrum.
  - 4. A method according to claim 1, wherein optimizing the complex cepstrum is performed using a gradient method.
  - 5. A method according to claim 1, further comprising decomposing the complex cepstrum into phase and minimum phase cepstral components.
  - 6. A method of vocal analysis, the method comprising extracting speech synthesis parameters from an input signal in a method according to claim 1, and comparing the complex cepstral with threshold parameters.
  - 7. A method of training a speech synthesiser, the synthesiser comprising a source filter model for modelling speech using an excitation signal and a synthesis filter, the method comprising training the synthesis parameters by deriving speech synthesis parameters from an input signal using a method according to claim 1, the method further comprising storing the position of the pulses and the complex cepstrum resulting in said minimum difference in a memory as the speech synthesis parameters derived from the input signal.
  - 8. A method according to claim 7, the method further comprising training the synthesiser by receiving input text and speech, the method comprising extracting labels from the input text, and relating derived speech parameters to said labels via probability density functions.
  - 9. A text to speech method, the method comprising:
    - receiving input text;
      
      extracting labels from said input text;
      
      using said labels to extract speech parameters which have been stored in a memory,generating a speech signal from said extracted speech parameters wherein said speech signal is generated using a source filter model which produces speech using an excitation signal and a synthesis filter, said speech parameters comprising complex cepstrum parameters,wherein said complex cepstrum parameters which are stored in said memory have been derived using the method of claim 1.
  - 10. A text to speech method according to claim 9, wherein said complex cepstrum parameters are stored in said memory as minimum phase cepstrum parameters and phase parameters, the method being configured to produce said excitation signal using said phase parameters and said synthesis filter using said minimum phase cepstrum parameters.
  - 12. A text to speech system, the system comprising a memory and a processor adapted to:
    - receive input text;
      
      extract labels from said input text;
      
      use said labels to extract speech parameters which have been stored in the memory; and
      
      generate a speech signal from said extracted speech parameters wherein said speech signal is generated using a source filter model which produces speech using an excitation signal and a synthesis filter, said speech parameters comprising complex cepstrum parameters,wherein said complex cepstrum parameters which are stored in said memory have been derived using the method of claim 1.
  - 13. A non-transitory computer readable medium comprising computer readable code configured to cause a computer to perform the method of claim 1.
  - 14. A non-transitory computer readable medium comprising computer readable code configured to cause a computer to perform the method of claim 9.

11. A system for extracting speech synthesis parameters from an audio signal, the system comprising a processor adapted to:
- receive an input speech audio signal;
  
  estimate a position of glottal closure incidents from said input speech audio signal;
  
  derive a pulsed excitation signal from the position of the glottal closure incidents;
  
  segment said input speech audio signal on the basis of said glottal closure incidents, to obtain segments of said input speech audio signal;
  
  process the segments of the input speech audio signal to obtain a complex cepstrum and deriving a synthesis filter from said complex cepstrum;
  
  produce a reconstructed speech signal by passing the pulsed excitation signal derived from the position of the glottal closure incidents through said synthesis filter derived from said complex cepstrum;
  
  compare said reconstructed speech signal with said input speech audio signal;
  
  calculate a difference between the reconstructed speech signal and the input speech audio signal; and
  
  modify the pulsed excitation signal and the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech audio signal by executing a process comprising,optimizing the position of the pulses in said excitation signal to reduce a mean squared error between the reconstructed speech signal and the input speech audio signal;
  
  recalculating the complex cepstrum by optimizing the complex cepstrum by minimizing the difference between the reconstructed speech signal and the input speech audio signal using the optimized pulse positions; and
  
  repeating the process to derive as said speech synthesis parameters the position of the pulses and the complex cepstrum resulting in a minimum difference between the reconstructed speech signal and the input speech audio signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Maia, Ranniery
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Le, Thuykhanh

Application Number

US14/090,379
Publication Number

US 20140156280A1
Time in Patent Office

1,050 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 13/08 Text analysis or generation...

Speech processing system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Speech processing system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links