Speech processing and speech synthesis using a linear combination of bases at peak frequencies for spectral envelope information
First Claim
1. An apparatus for speech processing, the apparatus being implemented by a computer programmed to execute computer-readable instructions stored in a memory, the apparatus comprising:
- a frame extraction unit configured to extract, using the computer, a speech signal in each frame;
an information extraction unit configured to extract, using the computer, spectral envelope information of L-dimension from each frame by discrete Fourier transform, the spectral envelope information being represented by L points;
a basis generation unit configured to extract, using the computer, the spectral envelope information from the speech signal to generate a basis, to minimize a first evaluation function by changing the basis and a corresponding coefficient, the first evaluation being a sum of an error term and a first regularization term, the error term being a distortion between the spectral envelope information and a linear combination of the basis with the coefficient, the first regularization term being a sparseness of the coefficient, the sparseness being a smaller value when the coefficient is closer to zero, and to select the basis for which the first evaluation function is minimized;
a basis storage unit configured to store N bases (L>
N>
1), each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping; and
a parameter calculation unit configured to minimize, using the computer, a distortion between the spectral envelope information and a linear combination of each basis with the coefficient for each of L points of the spectral envelope information by changing the coefficient, and to set the coefficient of each basis for which the distortion is minimized as a spectral envelope parameter of the spectral envelope information.
4 Assignments
0 Petitions
Accused Products
Abstract
An information extraction unit extracts spectral envelope information of L-dimension from each frame of speech data by discrete Fourier transform. The spectral envelope information is represented by L points. A basis storage unit stores N bases (L>N>1). Each basis is differently a frequency band having a maximum as a peak frequency in a spectral domain having L-dimension. A value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain is zero. Two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlap. A parameter calculation unit minimizes a distortion between the spectral envelope information and a linear combination of each basis with a coefficient for each of L points of the spectral envelope information by changing the coefficient, and sets the coefficient of each basis from which the distortion is minimized to a spectral envelope parameter of the spectral envelope information.
-
Citations
14 Claims
-
1. An apparatus for speech processing, the apparatus being implemented by a computer programmed to execute computer-readable instructions stored in a memory, the apparatus comprising:
-
a frame extraction unit configured to extract, using the computer, a speech signal in each frame; an information extraction unit configured to extract, using the computer, spectral envelope information of L-dimension from each frame by discrete Fourier transform, the spectral envelope information being represented by L points; a basis generation unit configured to extract, using the computer, the spectral envelope information from the speech signal to generate a basis, to minimize a first evaluation function by changing the basis and a corresponding coefficient, the first evaluation being a sum of an error term and a first regularization term, the error term being a distortion between the spectral envelope information and a linear combination of the basis with the coefficient, the first regularization term being a sparseness of the coefficient, the sparseness being a smaller value when the coefficient is closer to zero, and to select the basis for which the first evaluation function is minimized; a basis storage unit configured to store N bases (L>
N>
1), each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping; anda parameter calculation unit configured to minimize, using the computer, a distortion between the spectral envelope information and a linear combination of each basis with the coefficient for each of L points of the spectral envelope information by changing the coefficient, and to set the coefficient of each basis for which the distortion is minimized as a spectral envelope parameter of the spectral envelope information. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus for a speech synthesis, the apparatus being implemented by a computer programmed to execute computer-readable instructions stored in a memory, the apparatus comprising:
-
a parameter storage unit configured to store the spectral envelope parameter corresponding to a pitch-cycle waveform of each speech unit; an attribute storage unit configured to store an attribute information of each speech unit; a division unit configured to divide, using the computer, a phoneme sequence of input text into each synthesis unit; a selection unit configured to select, using the computer, at least one speech unit corresponding to each synthesis unit by using the attribute information; an acquisition unit configured to acquire the spectral envelope parameter corresponding to the pitch-cycle waveform of each speech unit selected by the selection unit, the spectral envelope parameter having L-dimension; a fusion unit configured to fuse, using the computer, a plurality of spectral envelope parameters to one spectral envelope parameter, when the acquisition unit acquires the plurality of spectral envelope parameters corresponding to pitch-cycle waveforms of a plurality of selected speech units by the selection unit; a basis storage unit configured to store N bases (L>
N>
1), each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;an envelope generation unit configured to generate spectral envelope information by linearly combining the bases with the spectral envelope parameter, the spectral envelope information being represented by L points; a pitch-cycle waveform generation unit configured to generate a plurality of pitch-cycle waveforms by inverse-Fourier transform with a spectrum of the spectral envelope information; and a speech generation unit configured to generate a plurality of speech units by overlapping and adding the plurality of pitch-cycle waveforms, and to generate a speech waveform by concatenating the plurality of speech units, wherein the fusion unit is configured to correspond the spectral envelope parameter of each speech unit along a temporal direction, to average corresponded spectral envelope parameters to generate an averaged spectral envelope parameter, to select one representative speech unit from the plurality of speech units, and to set the spectral envelope parameter of the one representative speech unit as a representative spectral envelope parameter, to determine a boundary order from the representative spectral envelope parameter or the averaged spectral envelope parameter, and to mix the plurality of spectral envelope parameters by using the averaged spectral envelope parameter for a spectral envelope parameter having lower order than the boundary order and by using the representative spectral envelope parameter for a spectral envelope parameter having higher order than the boundary order.
-
-
11. A method for speech processing, the method using a computer to execute computer-readable instructions stored in a memory, the method comprising:
-
dividing a speech signal into each frame; extracting spectral envelope information of L-dimension from each frame by discrete Fourier transform, the spectral envelope information being represented by L points; extracting the spectral envelope information from the speech signal to generate a basis; minimizing a first evaluation function by changing the basis and a corresponding coefficient, the first evaluation being a sum of an error term and a first regularization term, the error term being a distortion between the spectral envelope information and a linear combination of the basis with the coefficient, the first regularization term being a sparseness of the coefficient, the sparseness being a smaller value when the coefficient is closer to zero; selecting the basis for which the first evaluation function is minimized; storing N bases (L>
N>
1) in a memory, each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;minimizing, by the computer, a distortion between the spectral envelope information and a linear combination of each basis with the coefficient for each of L points of the spectral envelope information by changing the coefficient; and setting the coefficient of each basis for which the distortion is minimized as a spectral envelope parameter of the spectral envelope information.
-
-
12. A method for speech synthesis, the method using a computer to execute computer-readable instructions stored in a memory, the method comprising:
-
storing a spectral envelope parameter corresponding to a pitch-cycle waveform of each speech unit; storing an attribute information of each speech unit; dividing a phoneme sequence of input text into each synthesis unit; selecting at least one speech unit corresponding to each synthesis unit by using the attribute information; acquiring the spectral envelope parameter corresponding to the pitch-cycle waveform of each speech unit selected, the spectral envelope parameter having L-dimension; fusing a plurality of spectral envelope parameters to one spectral envelope parameter, when the plurality of spectral envelope parameters corresponding to pitch-cycle waveforms of a plurality of selected speech units is acquired; storing N bases (L>
N>
1) in a memory, each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;generating spectral envelope information by linearly combining the bases with the spectral envelope parameter, the spectral envelope information being represented by L points; generating, by the computer, a plurality of pitch-cycle waveforms by inverse-Fourier transform with a spectrum of the spectral envelope information; generating a plurality of speech units by overlapping and adding the plurality of pitch-cycle waveforms; and generating a speech waveform by concatenating the plurality of speech units, wherein the fusing step further comprises corresponding the spectral envelope parameter of each speech unit along a temporal direction; averaging corresponded spectral envelope parameters to generate an averaged spectral envelope parameter; selecting one representative speech unit from the plurality of speech units; setting the spectral envelope parameter of the one representative speech unit as a representative spectral envelope parameter; determining a boundary order from the representative spectral envelope parameter or the averaged spectral envelope parameter; and mixing the plurality of spectral envelope parameters by using the averaged spectral envelope parameter for a spectral envelope parameter having lower order than the boundary order and by using the representative spectral envelope parameter for a spectral envelope parameter having higher order than the boundary order.
-
-
13. A non-transitory computer-readable medium storing a computer program for causing a computer to perform a method for a speech processing, the method comprising:
-
dividing a speech signal into each frame; extracting a spectral envelope information of L-dimension from each frame by discrete Fourier transform, the spectral envelope information being represented by L points; extracting the spectral envelope information from the speech signal to generate a basis; minimizing a first evaluation function by changing the basis and a corresponding coefficient, the first evaluation being a sum of an error term and a first regularization term, the error term being a distortion between the spectral envelope information and a linear combination of the basis with the coefficient, the first regularization term being a sparseness of the coefficient, the sparseness being a smaller value when the coefficient is closer to zero; selecting the basis for which the first evaluation function is minimized; storing N bases (L>
N>
1) in a memory, each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;minimizing a distortion between the spectral envelope information and a linear combination of each basis with the coefficient for each of L points of the spectral envelope information by changing the coefficient; and setting the coefficient of each basis for which the distortion is minimized as a spectral envelope parameter of the spectral envelope information.
-
-
14. A non-transitory computer-readable medium storing a computer program for causing a computer to perform a method for speech synthesis, the method comprising:
-
storing a spectral envelope parameter corresponding to a pitch-cycle waveform of each speech unit; storing an attribute information of each speech unit; dividing a phoneme sequence of input text into each synthesis unit; selecting at least one speech unit corresponding to each synthesis unit by using the attribute information; acquiring the spectral envelope parameter corresponding to the pitch-cycle waveform of each speech unit selected, the spectral envelope parameter having L-dimension; fusing a plurality of spectral envelope parameters to one spectral envelope parameter, when the plurality of spectral envelope parameters corresponding to pitch-cycle waveforms of a plurality of selected speech units is acquired; storing N bases (L>
N>
1) in a memory, each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;generating spectral envelope information by linearly combining the bases with the spectral envelope parameter, the spectral envelope information being represented by L points; generating a plurality of pitch-cycle waveforms by inverse-Fourier transform with a spectrum of the spectral envelope information; generating a plurality of speech units by overlapping and adding the plurality of pitch-cycle waveforms; and generating a speech waveform by concatenating the plurality of speech units, wherein the fusing step further comprises corresponding the spectral envelope parameter of each speech unit along a temporal direction; averaging corresponded spectral envelope parameters to generate an averaged spectral envelope parameter; selecting one representative speech unit from the plurality of speech units; setting the spectral envelope parameter of the one representative speech unit as a representative spectral envelope parameter; determining a boundary order from the representative spectral envelope parameter or the averaged spectral envelope parameter; and mixing the plurality of spectral envelope parameters by using the averaged spectral envelope parameter for a spectral envelope parameter having lower order than the boundary order and by using the representative spectral envelope parameter for a spectral envelope parameter having higher order than the boundary order.
-
Specification