Speech processing system
First Claim
1. A method of deriving speech synthesis parameters from an audio signal, the method performed in a device comprising a processor, the method comprising:
- receiving an input speech audio signal;
estimating a position of glottal closure incidents from said input speech audio signal;
deriving a pulsed excitation signal from the position of the glottal closure incidents;
segmenting said audio signal on the basis of said glottal closure incidents, to obtain segments of said input speech audio signal;
processing the segments of the input speech audio to obtain a complex cepstrum and deriving a synthesis filter from said complex cepstrum;
producing a reconstructed speech signal based on the input speech audio signal by passing the pulsed excitation signal derived from the position of the glottal closure incidents through said synthesis filter derived from said complex cepstrum;
comparing said reconstructed speech signal with said input speech audio signal;
calculating a difference between the reconstructed speech signal and the input speech audio signal and modifying the pulsed excitation signal and the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech audio signal,wherein modifying the pulsed excitation signal and the complex cepstrum comprises the process of;
optimizing the position of the pulses in said excitation signal to reduce a mean between the reconstructed speech signal and the input speech audio signals;
recalculating the complex cepstrum by optimizing the complex cepstrum by minimizing the difference between the reconstructed speech signal and the input speech audio signal using the optimized pulse positions, andrepeating the process to derive as said speech synthesis parameters the position of the pulses and the complex cepstrum resulting in a minimum difference between the reconstructed speech signal and the input speech audio signal.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of deriving speech synthesis parameters from an input speech audio signal, wherein the audio signal is segmented on the basis of estimated positions of glottal closure incidents and the resulting segments are processed to obtain the complex cepstrum used to derive a synthesis filter. A reconstructed speech signal is produced by passing a pulsed excitation signal derived from the position of the glottal closure incidents through the synthesis filter, and compared with the input speech audio signal. The pulse excitation signal and the complex cepstrum are then iteratively modified to minimize the difference between the reconstructed speech signal and the input speech audio signal, by optimizing the position of the pulses in the excitation signal to reduce the mean squared error between the reconstructed speech signal and the input speech audio signal, and recalculating the complex using the optimized pulse positions.
-
Citations
14 Claims
-
1. A method of deriving speech synthesis parameters from an audio signal, the method performed in a device comprising a processor, the method comprising:
-
receiving an input speech audio signal; estimating a position of glottal closure incidents from said input speech audio signal; deriving a pulsed excitation signal from the position of the glottal closure incidents; segmenting said audio signal on the basis of said glottal closure incidents, to obtain segments of said input speech audio signal; processing the segments of the input speech audio to obtain a complex cepstrum and deriving a synthesis filter from said complex cepstrum; producing a reconstructed speech signal based on the input speech audio signal by passing the pulsed excitation signal derived from the position of the glottal closure incidents through said synthesis filter derived from said complex cepstrum; comparing said reconstructed speech signal with said input speech audio signal; calculating a difference between the reconstructed speech signal and the input speech audio signal and modifying the pulsed excitation signal and the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech audio signal, wherein modifying the pulsed excitation signal and the complex cepstrum comprises the process of; optimizing the position of the pulses in said excitation signal to reduce a mean between the reconstructed speech signal and the input speech audio signals; recalculating the complex cepstrum by optimizing the complex cepstrum by minimizing the difference between the reconstructed speech signal and the input speech audio signal using the optimized pulse positions, and repeating the process to derive as said speech synthesis parameters the position of the pulses and the complex cepstrum resulting in a minimum difference between the reconstructed speech signal and the input speech audio signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14)
-
-
11. A system for extracting speech synthesis parameters from an audio signal, the system comprising a processor adapted to:
-
receive an input speech audio signal; estimate a position of glottal closure incidents from said input speech audio signal; derive a pulsed excitation signal from the position of the glottal closure incidents; segment said input speech audio signal on the basis of said glottal closure incidents, to obtain segments of said input speech audio signal; process the segments of the input speech audio signal to obtain a complex cepstrum and deriving a synthesis filter from said complex cepstrum; produce a reconstructed speech signal by passing the pulsed excitation signal derived from the position of the glottal closure incidents through said synthesis filter derived from said complex cepstrum; compare said reconstructed speech signal with said input speech audio signal; calculate a difference between the reconstructed speech signal and the input speech audio signal; and modify the pulsed excitation signal and the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech audio signal by executing a process comprising, optimizing the position of the pulses in said excitation signal to reduce a mean squared error between the reconstructed speech signal and the input speech audio signal; recalculating the complex cepstrum by optimizing the complex cepstrum by minimizing the difference between the reconstructed speech signal and the input speech audio signal using the optimized pulse positions; and repeating the process to derive as said speech synthesis parameters the position of the pulses and the complex cepstrum resulting in a minimum difference between the reconstructed speech signal and the input speech audio signal.
-
Specification