SPEECH PROCESSING SYSTEM

US 20140156280A1
Filed: 11/26/2013
Published: 06/05/2014
Est. Priority Date: 11/30/2012
Status: Active Grant

First Claim

Patent Images

1. A method of deriving speech synthesis parameters from an audio signal, the method comprising:

receiving an input speech signal;

estimating the position of glottal closure incidents from said audio signal;

deriving a pulsed excitation signal from the position of the glottal closure incidents;

segmenting said audio signal on the basis of said glottal closure incidents, to obtain segments of said audio signal;

processing the segments of the audio signal to obtain the complex cepstrum and deriving a synthesis filter from said complex cepstrum;

reconstructing said speech audio signal to produce a reconstructed speech signal using an excitation model where the pulsed excitation signal is passed through said synthesis filter;

comparing said reconstructed speech signal with said input speech signal; and

calculating the difference between the reconstructed speech signal and the input speech signal and modifying either the pulsed excitation signal or the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of deriving speech synthesis parameters from an audio signal, the method comprising:

- receiving an input speech signal;
- estimating the position of glottal closure incidents from said audio signal;
- deriving a pulsed excitation signal from the position of the glottal closure incidents;
- segmenting said audio signal on the basis of said glottal closure incidents, to obtain segments of said audio signal;
- processing the segments of the audio signal to obtain the complex cepstrum and deriving a synthesis filter from said complex cepstrum;
- reconstructing said speech audio signal to produce a reconstructed speech signal using an excitation model where the pulsed excitation signal is passed through said synthesis filter;
- comparing said reconstructed speech signal with said input speech signal; and
- calculating the difference between the reconstructed speech signal and the input speech signal and modifying either the pulsed excitation signal or the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech.

Citations

18 Claims

1. A method of deriving speech synthesis parameters from an audio signal, the method comprising:
- receiving an input speech signal;
  
  estimating the position of glottal closure incidents from said audio signal;
  
  deriving a pulsed excitation signal from the position of the glottal closure incidents;
  
  segmenting said audio signal on the basis of said glottal closure incidents, to obtain segments of said audio signal;
  
  processing the segments of the audio signal to obtain the complex cepstrum and deriving a synthesis filter from said complex cepstrum;
  
  reconstructing said speech audio signal to produce a reconstructed speech signal using an excitation model where the pulsed excitation signal is passed through said synthesis filter;
  
  comparing said reconstructed speech signal with said input speech signal; and
  
  calculating the difference between the reconstructed speech signal and the input speech signal and modifying either the pulsed excitation signal or the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 16, 17)
- - 2. A method according to claim 1, comprising modifying both the pulsed excitation signal and the complex cepstrum to reduce the difference between the reconstructed speech and the input speech.
  - 3. A method according to claim 2, wherein modifying the pulsed excitation signal and the complex cepstrum comprises the process of:
    - optimising the position of the pulses in said excitation signal to reduce the mean squared error between reconstructed speech and the input speech; and
      
      recalculating the complex cepstrum using the optimised pulse positions, wherein the process is repeated until the position of the pulses and the complex cepstrum results in a minimum difference between the reconstructed speech and the input speech.
  - 4. A method according to claim 2, wherein the difference between the reconstructed speech and the input speech is calculated using the mean squared error.
  - 5. A method according to claim 3, wherein the pulse height a_zis set such that a_z=0 if a_z<
    - 0 and a_z=1 if a_z>
      
      0 before recalculation of the complex cepstrum.
  - 6. A method according to claim 3, wherein re-calculating the complex cepstrum comprises optimising the complex cepstrum by minimising the difference between the reconstructed speech and the input speech, wherein the optimising is performed using a gradient method.
  - 7. A method according to claim 1, further comprising decomposing the complex cepstrum into phase and minimum phase cepstral components.
  - 8. A method of vocal analysis, the method comprising extracting speech synthesis parameters from an input signal in a method according to claim 1, and comparing the complex cepstral with threshold parameters.
  - 9. A method of training a speech synthesiser, the synthesiser comprising a source filter model for modeling speech using an excitation signal and a synthesis filter, the method comprising training the synthesis parameters by deriving speech synthesis parameters from an input signal using a method according to claim 1, the method further comprising storing said extracted parameters.
  - 10. A method according to claim 9, the method further comprising training the synthesiser by receiving input text and speech, the method comprising extracting labels from the input text, and relating derived speech parameters to said labels via probability density functions.
  - 13. A text to speech method according to claim 11, wherein said complex cepstrum parameters which are stored in said memory have been derived using the method of claim 1.
  - 16. A text to speech system according to claim 15, wherein said memory comprises parameters have been derived using the method of claim 1.
  - 17. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 1.

11. A text to speech method, the method comprising:
- receiving input text;
  
  extracting labels from said input text;
  
  using said labels to extract speech parameters which have been stored in a memory,generating a speech signal from said extracted speech parameters wherein said speech signal is generated using a source filter model which produces speech using an excitation signal and a synthesis filter, said speech parameters comprising complex cepstrum parameters.
- View Dependent Claims (12, 18)
- - 12. A text to speech method according to claim 11, wherein said complex cepstrum parameters are stored in said memory as minimum phase cepstrum parameters and phase parameters, the method being configured to produce said excitation signal using said phase parameters and said synthesis filter using said minimum phase cepstrum parameters.
  - 18. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 11.

14. A system for extracting speech synthesis parameters from an audio signal, the system comprising a processor adapted to:
- receive an input speech signal;
  
  estimate the position of glottal closure incidents from said audio signal;
  
  derive a pulsed excitation signal from the position of the glottal closure incidents;
  
  segment said audio signal on the basis of said glottal closure incidents, to obtain segments of said audio signal;
  
  process the segments of the audio signal to obtain the complex cepstrum and deriving a synthesis filter from said complex cepstrum;
  
  reconstruct said speech audio signal to produce a reconstructed speech signal using an excitation model where the pulsed excitation signal is passed through said synthesis filter;
  
  compare said reconstructed speech signal with said input speech signal; and
  
  calculate the difference between the reconstructed speech signal and the input speech signal and modifying either the pulsed excitation signal or the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech.

15. A text to speech system, the system comprising a memory and a processor adapted to:
- receive input text;
  
  extract labels from said input text;
  
  use said labels to extract speech parameters which have been stored in the memory; and
  
  generate a speech signal from said extracted speech parameters wherein said speech signal is generated using a source filter model which produces speech using an excitation signal and a synthesis filter, said speech parameters comprising complex cepstrum parameters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Ranniery, Maia

Granted Patent

US 9,466,285 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 13/08 Text analysis or generation...

SPEECH PROCESSING SYSTEM

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH PROCESSING SYSTEM

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links