Speech processing and speech synthesis using a linear combination of bases at peak frequencies for spectral envelope information

US 8,321,208 B2
Filed: 12/03/2008
Issued: 11/27/2012
Est. Priority Date: 12/03/2007
Status: Active Grant

First Claim

Patent Images

1. An apparatus for speech processing, the apparatus being implemented by a computer programmed to execute computer-readable instructions stored in a memory, the apparatus comprising:

a frame extraction unit configured to extract, using the computer, a speech signal in each frame;

an information extraction unit configured to extract, using the computer, spectral envelope information of L-dimension from each frame by discrete Fourier transform, the spectral envelope information being represented by L points;

a basis generation unit configured to extract, using the computer, the spectral envelope information from the speech signal to generate a basis, to minimize a first evaluation function by changing the basis and a corresponding coefficient, the first evaluation being a sum of an error term and a first regularization term, the error term being a distortion between the spectral envelope information and a linear combination of the basis with the coefficient, the first regularization term being a sparseness of the coefficient, the sparseness being a smaller value when the coefficient is closer to zero, and to select the basis for which the first evaluation function is minimized;

a basis storage unit configured to store N bases (L>

N>

1), each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping; and

a parameter calculation unit configured to minimize, using the computer, a distortion between the spectral envelope information and a linear combination of each basis with the coefficient for each of L points of the spectral envelope information by changing the coefficient, and to set the coefficient of each basis for which the distortion is minimized as a spectral envelope parameter of the spectral envelope information.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information extraction unit extracts spectral envelope information of L-dimension from each frame of speech data by discrete Fourier transform. The spectral envelope information is represented by L points. A basis storage unit stores N bases (L>N>1). Each basis is differently a frequency band having a maximum as a peak frequency in a spectral domain having L-dimension. A value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain is zero. Two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlap. A parameter calculation unit minimizes a distortion between the spectral envelope information and a linear combination of each basis with a coefficient for each of L points of the spectral envelope information by changing the coefficient, and sets the coefficient of each basis from which the distortion is minimized to a spectral envelope parameter of the spectral envelope information.

Citations

14 Claims

1. An apparatus for speech processing, the apparatus being implemented by a computer programmed to execute computer-readable instructions stored in a memory, the apparatus comprising:
- a frame extraction unit configured to extract, using the computer, a speech signal in each frame;
  
  an information extraction unit configured to extract, using the computer, spectral envelope information of L-dimension from each frame by discrete Fourier transform, the spectral envelope information being represented by L points;
  
  a basis generation unit configured to extract, using the computer, the spectral envelope information from the speech signal to generate a basis, to minimize a first evaluation function by changing the basis and a corresponding coefficient, the first evaluation being a sum of an error term and a first regularization term, the error term being a distortion between the spectral envelope information and a linear combination of the basis with the coefficient, the first regularization term being a sparseness of the coefficient, the sparseness being a smaller value when the coefficient is closer to zero, and to select the basis for which the first evaluation function is minimized;
  
  a basis storage unit configured to store N bases (L>
  
  N>
  
  1), each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping; and
  
  a parameter calculation unit configured to minimize, using the computer, a distortion between the spectral envelope information and a linear combination of each basis with the coefficient for each of L points of the spectral envelope information by changing the coefficient, and to set the coefficient of each basis for which the distortion is minimized as a spectral envelope parameter of the spectral envelope information.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The apparatus according to claim 1, further comprising:
    - a basis generation unit configured to determine a plurality of peak frequencies in the spectral domain, to create a unimodal window function having a length as an interval between two adjacent peak frequencies and having all zero frequency outside three adjacent peak frequencies along the frequency axis, and to set a shape of the window function to the basis.
  - 3. The apparatus according to claim 2, whereinthe basis generation unit is configured to determine the peak frequency having a wider interval than an adjacent peak frequency when the frequency is higher along the frequency axis.
  - 4. The apparatus according to claim 2, whereinthe basis generation unit is configured to determine the peak frequency having a wider interval than an adjacent peak frequency when the frequency is higher along the frequency axis as for a frequency band lower than a boundary frequency on the frequency axis, and to determine the peak frequency having an equal interval from the adjacent peak frequency as for a frequency band higher than the boundary frequency.
  - 5. The apparatus according to claim 1, whereinthe basis generation unit is configured to minimize a second evaluation function by changing the basis and the coefficient, the second evaluation function being the sum of the error term, the first regularization term, and a second regularization term, the second regularization term being a concentration degree at a position to a center of the basis, the concentration degree being a larger value when a value at the position distant from the center of the basis is larger, and to select the basis for which the second evaluation function is minimized.
  - 6. The apparatus according to claim 1, wherein the parameter calculation unit is configured to minimize the distortion, wherein the distortion is a squared error between the spectral envelope information and a linear combination of each basis with the coefficient corresponding to each basis.
  - 7. The apparatus according to claim 1, whereinthe parameter calculation unit is configured to minimize the distortion under a constraint that the coefficient is non-negative.
  - 8. The apparatus according to claim 1, whereinthe parameter calculation unit is configured to assign a number of quantized bits to each dimension of the spectral envelope parameter, to determine a number of quantization bits to each dimension of the spectral envelope parameter, and to quantize the spectral envelope parameter based on the number of quantized bits and the number of quantization bits.
  - 9. The apparatus according to claim 1, whereinthe spectral envelope information is one of a logarithm spectral envelope, a phase spectrum, an amplitude spectral envelope, and a power spectral envelope.

10. An apparatus for a speech synthesis, the apparatus being implemented by a computer programmed to execute computer-readable instructions stored in a memory, the apparatus comprising:
- a parameter storage unit configured to store the spectral envelope parameter corresponding to a pitch-cycle waveform of each speech unit;
  
  an attribute storage unit configured to store an attribute information of each speech unit;
  
  a division unit configured to divide, using the computer, a phoneme sequence of input text into each synthesis unit;
  
  a selection unit configured to select, using the computer, at least one speech unit corresponding to each synthesis unit by using the attribute information;
  
  an acquisition unit configured to acquire the spectral envelope parameter corresponding to the pitch-cycle waveform of each speech unit selected by the selection unit, the spectral envelope parameter having L-dimension;
  
  a fusion unit configured to fuse, using the computer, a plurality of spectral envelope parameters to one spectral envelope parameter, when the acquisition unit acquires the plurality of spectral envelope parameters corresponding to pitch-cycle waveforms of a plurality of selected speech units by the selection unit;
  
  a basis storage unit configured to store N bases (L>
  
  N>
  
  1), each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;
  
  an envelope generation unit configured to generate spectral envelope information by linearly combining the bases with the spectral envelope parameter, the spectral envelope information being represented by L points;
  
  a pitch-cycle waveform generation unit configured to generate a plurality of pitch-cycle waveforms by inverse-Fourier transform with a spectrum of the spectral envelope information; and
  
  a speech generation unit configured to generate a plurality of speech units by overlapping and adding the plurality of pitch-cycle waveforms, and to generate a speech waveform by concatenating the plurality of speech units, whereinthe fusion unit is configured to correspond the spectral envelope parameter of each speech unit along a temporal direction, to average corresponded spectral envelope parameters to generate an averaged spectral envelope parameter, to select one representative speech unit from the plurality of speech units, and to set the spectral envelope parameter of the one representative speech unit as a representative spectral envelope parameter, to determine a boundary order from the representative spectral envelope parameter or the averaged spectral envelope parameter, and to mix the plurality of spectral envelope parameters by using the averaged spectral envelope parameter for a spectral envelope parameter having lower order than the boundary order and by using the representative spectral envelope parameter for a spectral envelope parameter having higher order than the boundary order.

11. A method for speech processing, the method using a computer to execute computer-readable instructions stored in a memory, the method comprising:
- dividing a speech signal into each frame;
  
  extracting spectral envelope information of L-dimension from each frame by discrete Fourier transform, the spectral envelope information being represented by L points;
  
  extracting the spectral envelope information from the speech signal to generate a basis;
  
  minimizing a first evaluation function by changing the basis and a corresponding coefficient, the first evaluation being a sum of an error term and a first regularization term, the error term being a distortion between the spectral envelope information and a linear combination of the basis with the coefficient, the first regularization term being a sparseness of the coefficient, the sparseness being a smaller value when the coefficient is closer to zero;
  
  selecting the basis for which the first evaluation function is minimized;
  
  storing N bases (L>
  
  N>
  
  1) in a memory, each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;
  
  minimizing, by the computer, a distortion between the spectral envelope information and a linear combination of each basis with the coefficient for each of L points of the spectral envelope information by changing the coefficient; and
  
  setting the coefficient of each basis for which the distortion is minimized as a spectral envelope parameter of the spectral envelope information.

12. A method for speech synthesis, the method using a computer to execute computer-readable instructions stored in a memory, the method comprising:
- storing a spectral envelope parameter corresponding to a pitch-cycle waveform of each speech unit;
  
  storing an attribute information of each speech unit;
  
  dividing a phoneme sequence of input text into each synthesis unit;
  
  selecting at least one speech unit corresponding to each synthesis unit by using the attribute information;
  
  acquiring the spectral envelope parameter corresponding to the pitch-cycle waveform of each speech unit selected, the spectral envelope parameter having L-dimension;
  
  fusing a plurality of spectral envelope parameters to one spectral envelope parameter, when the plurality of spectral envelope parameters corresponding to pitch-cycle waveforms of a plurality of selected speech units is acquired;
  
  storing N bases (L>
  
  N>
  
  1) in a memory, each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;
  
  generating spectral envelope information by linearly combining the bases with the spectral envelope parameter, the spectral envelope information being represented by L points;
  
  generating, by the computer, a plurality of pitch-cycle waveforms by inverse-Fourier transform with a spectrum of the spectral envelope information;
  
  generating a plurality of speech units by overlapping and adding the plurality of pitch-cycle waveforms; and
  
  generating a speech waveform by concatenating the plurality of speech units,wherein the fusing step further comprisescorresponding the spectral envelope parameter of each speech unit along a temporal direction;
  
  averaging corresponded spectral envelope parameters to generate an averaged spectral envelope parameter;
  
  selecting one representative speech unit from the plurality of speech units;
  
  setting the spectral envelope parameter of the one representative speech unit as a representative spectral envelope parameter;
  
  determining a boundary order from the representative spectral envelope parameter or the averaged spectral envelope parameter; and
  
  mixing the plurality of spectral envelope parameters by using the averaged spectral envelope parameter for a spectral envelope parameter having lower order than the boundary order and by using the representative spectral envelope parameter for a spectral envelope parameter having higher order than the boundary order.

13. A non-transitory computer-readable medium storing a computer program for causing a computer to perform a method for a speech processing, the method comprising:
- dividing a speech signal into each frame;
  
  extracting a spectral envelope information of L-dimension from each frame by discrete Fourier transform, the spectral envelope information being represented by L points;
  
  extracting the spectral envelope information from the speech signal to generate a basis;
  
  minimizing a first evaluation function by changing the basis and a corresponding coefficient, the first evaluation being a sum of an error term and a first regularization term, the error term being a distortion between the spectral envelope information and a linear combination of the basis with the coefficient, the first regularization term being a sparseness of the coefficient, the sparseness being a smaller value when the coefficient is closer to zero;
  
  selecting the basis for which the first evaluation function is minimized;
  
  storing N bases (L>
  
  N>
  
  1) in a memory, each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;
  
  minimizing a distortion between the spectral envelope information and a linear combination of each basis with the coefficient for each of L points of the spectral envelope information by changing the coefficient; and
  
  setting the coefficient of each basis for which the distortion is minimized as a spectral envelope parameter of the spectral envelope information.

14. A non-transitory computer-readable medium storing a computer program for causing a computer to perform a method for speech synthesis, the method comprising:
- storing a spectral envelope parameter corresponding to a pitch-cycle waveform of each speech unit;
  
  storing an attribute information of each speech unit;
  
  dividing a phoneme sequence of input text into each synthesis unit;
  
  selecting at least one speech unit corresponding to each synthesis unit by using the attribute information;
  
  acquiring the spectral envelope parameter corresponding to the pitch-cycle waveform of each speech unit selected, the spectral envelope parameter having L-dimension;
  
  fusing a plurality of spectral envelope parameters to one spectral envelope parameter, when the plurality of spectral envelope parameters corresponding to pitch-cycle waveforms of a plurality of selected speech units is acquired;
  
  storing N bases (L>
  
  N>
  
  1) in a memory, each basis having a different frequency band having a maximum as a peak frequency in a spectral domain having L-dimension, a value corresponding to a frequency outside the frequency band along a frequency axis of the spectral domain being zero, and two frequency bands of which two peak frequencies are adjacent along the frequency axis partially overlapping;
  
  generating spectral envelope information by linearly combining the bases with the spectral envelope parameter, the spectral envelope information being represented by L points;
  
  generating a plurality of pitch-cycle waveforms by inverse-Fourier transform with a spectrum of the spectral envelope information;
  
  generating a plurality of speech units by overlapping and adding the plurality of pitch-cycle waveforms; and
  
  generating a speech waveform by concatenating the plurality of speech units,wherein the fusing step further comprisescorresponding the spectral envelope parameter of each speech unit along a temporal direction;
  
  averaging corresponded spectral envelope parameters to generate an averaged spectral envelope parameter;
  
  selecting one representative speech unit from the plurality of speech units;
  
  setting the spectral envelope parameter of the one representative speech unit as a representative spectral envelope parameter;
  
  determining a boundary order from the representative spectral envelope parameter or the averaged spectral envelope parameter; and
  
  mixing the plurality of spectral envelope parameters by using the averaged spectral envelope parameter for a spectral envelope parameter having lower order than the boundary order and by using the representative spectral envelope parameter for a spectral envelope parameter having higher order than the boundary order.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation), Toshiba Digital Solutions Corporation (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Tamura, Masatsune, Kagoshima, Takehiko, Tsuchiya, Katsumi
Primary Examiner(s)
Lerner, Martin

Application Number

US12/327,399
Publication Number

US 20090144053A1
Time in Patent Office

1,455 Days
Field of Search

704/200.1, 704/203, 704/204, 704/205, 704/207, 704/229, 704/230, 704/258, 704/260, 704/268, 704/220
US Class Current

704/205
CPC Class Codes

G10L 13/06 Elementary speech units use...

Speech processing and speech synthesis using a linear combination of bases at peak frequencies for spectral envelope information

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Speech processing and speech synthesis using a linear combination of bases at peak frequencies for spectral envelope information

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links