Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope

US 6,725,190 B1
Filed: 11/02/1999
Issued: 04/20/2004
Est. Priority Date: 11/02/1999
Status: Expired due to Term

First Claim

Patent Images

1. A speech reconstruction method for converting a series of feature vectors and a series of respective pitch values and voicing decisions of an original input speech signal into a speech signal, the feature vectors being obtained as follows:

i) deriving at successive instances of time an estimate of a spectral envelope SE(i), i being a frequency index, of the digitized original speech signal, ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, BW(i,k), i being a frequency index and k being the window function index, wherein each window is non-zero over a narrow range of frequencies, and computing the integrals thereof, according to the expression;

$BI (k) = \sum_{i} SE (i) \cdot BW (i, k),$ where BI(k) is defined as the k^thcomponent or “

bin”

of a “

binned spectrum”

, and iii) assigning said integrals or a set of pre-determined functions thereof to respective components of a corresponding feature vector in a series of feature vectors;

said speech reconstruction method comprising;

(a) converting each feature vector into a binned spectrum, (b) generating harmonic frequencies and weights according to the corresponding pitch and voicing decision, (c) generating for each harmonic frequency a respective phase, depending on the corresponding pitch value and voicing decision and possibly on the binned spectrum, (d) sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weight, so as to produce for each sampled basis function a respective line spectrum having multiple components, (e) combining each component of each respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, (f) generating gain coefficients of the basis functions, (g) multiplying the complex line spectrum of each basis function by the respective basis function gain coefficient, and summing up all resulting complex line spectra to generate a single complex line spectrum having a respective component for each of the harmonic frequencies, and (h) generating a time signal from complex line spectra computed at successive instances of time.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech reconstruction method and system for converting a series of binned spectra or functions thereof such as the Mel Frequency Cepstra Coefficients (MFCC), of an original digitized speech signal, into a reconstructed speech signal, where each binned spectrum has a respective pitch value and voicing decision. The binned spectra are derived from the original digitized speech signal at successive instances by multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions and computing the integrals thereof. At each respective time instance, harmonic frequencies and weights are generated according to the respective pitch value and voicing decision. Basis functions having bounded supports on the frequency axis are each sampled at all said harmonic frequencies, which are within its support and multiplied by respective harmonic weights. The sampled basis functions are combined with respective phases, generated according to the pitch value, voicing decision and possibly the binned spectrum, resulting in a complex line spectrum corresponding to each basis function. Coefficients are generated of the basis functions, and each of the points of the respective complex line spectra is multiplied by the respective basis function coefficient. The complex line spectra are summed up to generate for each time instance a single complex line spectrum with values for all harmonic frequencies. A time signal is generated from complex line spectra computed at successive instances of time.

Citations

24 Claims

1. A speech reconstruction method for converting a series of feature vectors and a series of respective pitch values and voicing decisions of an original input speech signal into a speech signal, the feature vectors being obtained as follows:
- i) deriving at successive instances of time an estimate of a spectral envelope SE(i), i being a frequency index, of the digitized original speech signal, ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, BW(i,k), i being a frequency index and k being the window function index, wherein each window is non-zero over a narrow range of frequencies, and computing the integrals thereof, according to the expression;
  
  $BI (k) = \sum_{i} SE (i) \cdot BW (i, k),$ where BI(k) is defined as the k^thcomponent or “
  
  bin”
  
  of a “
  
  binned spectrum”
  
  , and iii) assigning said integrals or a set of pre-determined functions thereof to respective components of a corresponding feature vector in a series of feature vectors;
  
  said speech reconstruction method comprising;
  
  (a) converting each feature vector into a binned spectrum, (b) generating harmonic frequencies and weights according to the corresponding pitch and voicing decision, (c) generating for each harmonic frequency a respective phase, depending on the corresponding pitch value and voicing decision and possibly on the binned spectrum, (d) sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weight, so as to produce for each sampled basis function a respective line spectrum having multiple components, (e) combining each component of each respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, (f) generating gain coefficients of the basis functions, (g) multiplying the complex line spectrum of each basis function by the respective basis function gain coefficient, and summing up all resulting complex line spectra to generate a single complex line spectrum having a respective component for each of the harmonic frequencies, and (h) generating a time signal from complex line spectra computed at successive instances of time.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, wherein the step of generating the gain coefficients of the basis functions includes:
3. The method according to claim 2, wherein:
- the frequency domain window functions BW(·
  
  ,k) used for computing the binned spectrum are hat functions of the Mel Frequency spaced evenly on the Mel frequency axis, the feature vectors contain Mel frequency cepstral coefficients (MFCC) which are determined by computing the discrete cosine transform (DCT) of the log of the binned spectrum, and step (a) of converting the feature vector into a binned spectrum includes the step of computing the inverse DCT of the Mel Cepstral coefficients followed by antilog to obtain the binned spectrum.
4. The method according to claim 2, wherein the estimate of the spectral envelope of the signal SE(i), i being a frequency index corresponding to the i^thdiscrete Fourier transform (DFT) index, is computed by taking the absolute value of the windowed Fourier transform of the signal, said method further including:
- (k) computing the spectral envelope of each basis function, denoted by SEB(i,l), i being a frequency index corresponding to the i^thdiscrete Fourier transform index and l being the index of the l^thharmonic frequency, in accordance with;
  
  $SEB (i, l) = \langle \sum_{j} BF (j, l) \cdot W (i \cdot f_{0} - f_{j}) \rangle,$ where W(f) is the Fourier transform of the window, f₀is the DFT resolution and BF(j,l) is the l^thbasis function sampled at the j^thharmonic frequency f_j, multiplied by the corresponding harmonic weight and combined with the corresponding phase, and (l) computing the binned basis functions, denoted by BB(k,l), k being the bin index and l being the basis function index, by integrating the spectral envelopes SEB(i,l) over the bin windows in accordance with;
  
  $BB (k, l) = \sum_{i} SEB (i, l) \cdot BW (i, k),$ where BW(i,k) is the bin window function, i being a frequency index and k being the bin index, (m) generating the basis function coefficients x(l) by performing the following minimization;
  
  $\min_{x} \sum_{k} {(\sum_{l} (x (l) \cdot BB (k, l) - BI (k)))}^{2}$ subject to x(l)≧
  
  0, where x(l) is the l^thsolution coefficients and BI(k) is the k^thcomponent of the binned spectrum of the original speech signal.
5. The method according to claim 1, wherein the basis functions have bounded supports, and the union of the supports cover the same frequency range covered by the union of the supports of the frequency domain bin windows, used for computing the binned spectrum.
6. The method according to claim 5, wherein the l^thbasis function BF(·
- ,l) is a convex function of the l^thfrequency domain bin window BW(·
  
  ,l), used for computing the binned spectrum.

7. A method for accepting a series of indices of speech frames in a speech database, a series of respective pitch values and voicing decisions and a series of respective energy values, and generating speech therefrom, the method comprising:
- (a) creating a database containing coded or uncoded feature vectors, the feature vectors being obtained as follows;
  
  i) deriving at successive instances of time an estimate of the spectral envelope of the digitized original speech signal, ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, wherein each window is non-zero over a narrow range of frequencies, and computing the integrals thereof, and iii) assigning said integrals or a set of pre-determined functions thereof to respective components of a corresponding feature vector in a series of feature vectors;
  
  (b) producing a series of features vectors from frames selected from the database according to the series of indices and the series of respective energy values, and (c) reconstructing speech from the series of feature vectors and the series of respective pitch values and voicing decisions by;
  
  i) converting each feature vector into a binned spectrum, ii) generating harmonic frequencies and weights according to the corresponding pitch and voicing decision, iii) generating for each harmonic frequency a respective phase, depending on the corresponding pitch value and voicing decision and possibly on the binned spectrum, iv) sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weight, so as to produce for each sampled basis function a respective line spectrum having multiple components, v) combining each component of each respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, vi) generating gain coefficients of the basis functions, vii) multiplying each complex line spectrum of each basis function by the respective basis function gain coefficient, and summing up all resulting complex line spectra to generate a single complex line spectrum having a respective component for each of the harmonic frequencies, and viii) generating a time signal from complex line spectra computed at successive instances of time.

8. A speech reconstruction device for converting a series of feature vectors and a series of respective pitch values and voicing decisions of an original input speech signal into a reconstructed speech signal, the feature vectors being obtained as follows:
- (i) deriving at successive instances of time an estimate of a spectral envelope SE(i), i being a frequency index, of the digitized original speech signal, (ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, BW(i,k), i being a frequency index and k being the window function index, wherein each window is non-zero over a narrow range of frequencies, and computing the integrals thereof, according to the expression;
  
  $BI (k) = \sum_{i} SE (i) \cdot BW (i, k)$ where BI(k) is the k^thcomponent or “
  
  bin”
  
  of a “
  
  binned spectrum”
  
  , and (iii) assigning said integrals or a set of pre-determined functions thereof to respective components of a corresponding feature vector in a series of feature vectors;
  
  said device comprising;
  
  an input stage for inputting said series of feature vectors and a respective series of pitch values and voicing decisions, and converting the feature vectors into binned spectra, a frequency and weight generator coupled to the input stage for generating harmonic frequencies and weights, a phase generator coupled to the input stage for generating phases for each harmonic frequency, a basis function sampler for sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weights, so as to produce for each sampled basis function a respective line spectrum having multiple components, a phase combiner coupled to the basis function sampler and the phase generator for combining each component of the respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, a coefficient generator for generating gain coefficients of the basis functions, a linear combination unit for multiplying each complex line spectrum of each basis function by the respective basis function gain coefficient and summing up all the resulting complex line spectra to generate a complex line spectrum with respective components for all harmonic frequencies, and a line spectrum to signal converter coupled to the linear combination unit for generating a time signal from a series of complex line spectra.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The device according to claim 8, wherein:
10. The device according to claim 8, wherein the basis functions have bounded supports, and the union of the supports covers the same frequency range covered by the union of the supports of the frequency domain bin windows, used for computing the binned spectrum.
11. The device according to claim 10, wherein the l^thbasis function BF(·
- ,l) is a convex function of the l^thfrequency domain bin window BW(·
  
  ,l), used for computing the binned spectrum.
12. The device according to claim 8, including:
- an equation coefficient generator coupled to the phase combiner for computing the bins of the basis functions by the following two step procedure or any other equivalent procedure;
  
  i) converting each basis function into a single time frame signal by adding up the sine waves corresponding to its respective complex line spectrum, and ii) calculating the bins on the single time frame signal corresponding to each basis function in an identical manner as was done for the original signal; and
  
  an equation solver coupled to the equation coefficient generator for deriving and solving equations which express the condition that the coefficients of the basis functions are all non negative, and that the sum of the binned basis functions, weighted by their coefficients, is as close as possible in some norm to the bins of the original speech signal.
13. The device according to claim 12, wherein:
- the estimate of the spectral envelope of the signal SE(i), i being a frequency index corresponding to the i^thdiscrete Fourier transform (DFT) index, is computed by taking the absolute value of the windowed Fourier transform of the signal, and the equation coefficient generator for computing the binned basis functions includes;
  
  a spectral envelope generator for generating a spectral envelope for each basis function, said spectral envelope denoted by SEB(i,l), i being a frequency index corresponding to the i^thdiscrete Fourier transform index and l being the basis function index, according to the following expression;
  
  $SEB (i, l) = \langle \sum_{j} BF (j, l) \cdot W (i \cdot f_{0} - f_{j}) \rangle,$ where W(f) is the Fourier transform of the window, f₀is the DFT resolution and BF(j,l) is the l^thbasis function sampled at the j^thharmonic frequency f_j, multiplied by the corresponding harmonic weight and combined with the corresponding phase, and an integrator for computing the bins of the basis functions, said bins denoted by BB(k,l), k being the bin index and l being the basis function index, by integrating the spectral envelopes SEB(i,l) over the bin windows in accordance with;
  
  $BB (k, l) = \sum_{i} SEB (i, l) \cdot BW (i, k),$ where BW(i,k) is the bin window function, i being a frequency index and k being the bin index, and wherein the equation solver is adapted to perform the minimization;
  
  $\min_{x} \sum_{k} {(BI (k) - \sum_{l} x (l) \cdot BB (k, l))}^{2}$ subject to x(l)≧
  
  0;
  
  where x(l) is the l^thsolution coefficients and BI(k) is the k^thcomponent of the binned spectrum of the original speech signal.

14. A decoder for decoding speech, said decoder being responsive to a received bit stream representing an encoded series of feature vectors, pitch values and voicing decisions, the decoder including:
- a decompression module for decompressing the series of respective feature vectors, pitch values and voicing decisions, a conversion unit for converting the feature vectors into binned spectra, a frequency and weight generator responsive to the pitch values and voicing decisions for generating harmonic frequencies and weights, a phase generator responsive to the pitch values, voicing decisions and possibly to the binned spectra for generating phases for each harmonic frequency, a basis function sampler for sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weights, so as to produce for each sampled basis function a respective line spectrum having multiple components, a phase combining device coupled to the basis function sampler and the phase generator for combining each component of the respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, a coefficient generator for generating gain coefficients of the basis functions, a linear combination unit for multiplying each complex line spectrum of each basis function by the respective basis function gain coefficient and summing up all the resulting complex line spectra to generate a complex line spectrum with respective components for all harmonic frequencies, and a line spectrum to signal converter coupled to the linear combination unit for generating a time signal from a series of complex line spectra.

15. A speech coding/decoding system comprising:
- an encoder for coding speech, said encoder being responsive to an input speech signal and including;
  
  a feature extraction module for computing feature vectors from the input speech signal at successive instances of time, the feature extraction module including;
  
  a spectrum estimator for deriving at each said instances of time an estimate of the spectral envelope of the input speech signal. an integrator coupled to the spectrum estimator for multiplying the spectral envelope by a predetermined set of frequency domain window functions, wherein each window occupies a narrow range of frequencies, and computing the integral thereof, and an assignment unit coupled to the integrator for deriving a set of predetermined functions of said integrals and assigning to respective components of a corresponding feature vector in said series of feature vectors;
  
  a pitch detector for computing respective pitch values and voicing decisions at said successive instances of time, and a compression module for compressing the series of respective feature vectors, pitch values and voicing decisions into a bit-stream;
  
  a decoder for decoding speech, said decoder being responsive to a received bit stream representing an encoded series of respective feature vectors, pitch values and voicing decisions, the decoder including;
  
  a decompression module for decompressing the series of respective feature vectors, pitch values and voicing decisions, a conversion unit for converting the feature vectors into binned spectra, a frequency and weight generator responsive to the pitch values and voicing decisions for generating harmonic frequencies and weights, a phase generator responsive to the pitch values, voicing decisions and possibly to the binned spectra for generating phases for each harmonic frequency, a basis function sampler for sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weights, so as to produce for each sampled basis function a respective line spectrum having multiple components, a phase combining device coupled to the basis function sampler and the phase generator for combining each component of the respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, a coefficient generator for generating gain coefficients of the basis functions, a linear combination unit for multiplying each complex line spectrum of each basis function by the respective basis function gain coefficient and summing up all the resulting complex line spectra to generate a complex line spectrum with respective components for all harmonic frequencies, and a line spectrum to signal converter coupled to the linear combination unit for generating a time signal from a series of complex line spectra.

16. A dual purpose speech recognition/playback system, for continuous speech recognition and reproduction of an encoded speech signal, said system comprising a decoder and a recognition unit:
- the decoder for decoding and playback of encoded speech being responsive to a received bit stream representing an encoded series of respective feature vectors, pitch values and voicing decisions, the decoder including;
  
  a decompression module for decompressing the series of respective feature vectors, pitch values and voicing decisions, a conversion unit for converting the feature vectors into binned spectra, a frequency and weight generator responsive to the pitch values and voicing decisions for generating harmonic frequencies and weights, a phase generator responsive to the pitch values, voicing decisions and possibly to the binned spectra for generating phases for each harmonic frequency, a basis function sampler for sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weights, so as to produce for each sampled basis function a respective line spectrum having multiple components, a phase combining device coupled to the basis function sampler and the phase generator for combining each component of the respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, a coefficient generator for generating gain coefficients of the basis functions, a linear combination unit for multiplying each complex line spectrum of each basis function by the respective basis function gain coefficient and summing up all the resulting complex line spectra to generate a complex line spectrum with respective components for all harmonic frequencies, and a line spectrum to signal converter coupled to the linear combination unit for generating a time signal from a series of complex line spectra; and
  
  the recognition unit being responsive to the decompressed feature vectors for continuous speech recognition.
- View Dependent Claims (17)
- - 17. The dual purpose recognition/playback system of claim 16, wherein the recognition unit is further responsive to the decompressed pitch values and voicing decisions for continuous speech recognition.

18. A speech recognition system comprising:
- an encoder for coding speech so as to derive low bit rate bit stream, said encoder being responsive to an input speech signal and including;
  
  a feature extraction module for computing feature vectors from the input speech signal at successive instances of time, the feature extraction module including;
  
  a spectrum estimator for deriving at each said instances of time an estimate of the spectral envelope of the input speech signal, an integrator coupled to the spectrum estimator for multiplying the spectral envelope by a predetermined set of frequency domain window function, wherein each window occupies a narrow range of frequencies, and computing the integral thereof, and an assignment unit coupled to the integrator for deriving a set of predetermined functions of said integrals and assigning to respective components of a corresponding feature vector in said series of feature vectors;
  
  a pitch detector for computing respective pitch values and voicing decisions at said successive instances of time, a compression module for compressing the series of respective feature vectors, pitch values and voicing decisions into a bit-stream, a transmitter coupled to the encoder for transmitting the low bit rate bit stream, a recognition unit responsive to the low bit rate bit stream for decompressing the feature vectors and performing continuous speech recognition on the feature vectors, and a transmitter within the speech recognition unit for retransmitting the results of the recognition and the low bit rate bit stream to a remote device for displaying the results of the recognition;
  
  said remote device including a speech decoder, comprising;
  
  a decompression module for decompressing the series of respective feature vectors, pitch values and voicing decisions, a conversion unit for converting the feature vectors into binned spectra, a frequency and weight generator responsive to the pitch values and voicing decisions for generating harmonic frequencies and weights, a phase generator responsive to the pitch values, voicing decisions and possibly to the binned spectra for generating phases for each harmonic frequency, a basis function sampler for sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weights, so as to produce for each sampled basis function a respective line spectrum having multiple components, a phase combiner coupled to the basis function sampler and the phase generator for combining each component of the respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, a coefficient generator for generating gain coefficients of the basis functions, a linear combination unit for multiplying each complex line spectrum of each basis function by the respective basis function gain coefficient and summing up all the resulting complex line spectra to generate a complex line spectrum with respective components for all harmonic frequencies, and a line spectrum to signal converter coupled to the linear combination unit for generating a time signal from a series of complex line spectra.
- View Dependent Claims (19)
- - 19. The recognition system of claim 18, wherein:

20. A speech generator for accepting a series of indices of speech frames in a speech database, a series of respective pitch values and voicing decisions and a series of respective energy values and generating speech, the device comprising:
- a database containing coded or uncoded feature vectors, the feature vectors being obtained as follows;
  
  i) deriving at successive instances of time an estimate of the spectral envelope of the digitized original speech signal, ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, wherein each window is non-zero over a narrow range of frequencies, and computing the integrals thereof, and iii) assigning said integrals or a set of predetermined functions thereof to respective components of a corresponding feature vector in a series of feature vectors;
  
  a features generator responsive to the series of indices and the series of respective energy values for producing a series of feature vectors using frames selected from the database, and a speech reconstruction unit for reconstructing speech from a series of features vectors and the series of respective pitch values and voicing decisions, said reconstruction unit comprising;
  
  a conversion unit for converting the feature vectors into binned spectra, a frequency and weight generator responsive to the pitch values and voicing decisions for generating harmonic frequencies and weights, a phase generator responsive to the pitch values, voicing decisions and possibly to the binned spectra for generating phases for each harmonic frequency, a basis function sampler for sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weights, so as to produce for each sampled basis function a respective line spectrum having multiple components, a phase combiner coupled to the basis function sampler and the phase generator for combining each component of the respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, a coefficient generator for generating gain coefficients of the basis functions, a linear combination unit for multiplying each complex line spectrum of each basis function by the respective basis function gain coefficient and summing up all the resulting complex line spectra to generate a complex line spectrum with respective components for all harmonic frequencies, and a line spectrum to signal converter coupled to the linear combination unit for generating a time signal from a series of complex line spectra.
- View Dependent Claims (21)
- - 21. The speech generator according to claim 20, being an output block of a speech synthesis system.

22. A computer program product comprising a computer useable medium having computer readable program code embodied therein for converting a series of feature vectors and a series of respective pitch values and voicing decisions of an original input speech signal into a reconstructed speech signal, the feature vectors being obtained as follows:
- i) deriving at successive instances of time an estimate of the spectral envelope of the digitized original speech signal, ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, wherein each window is non-zero over a narrow range of frequencies, and computing the integrals thereof, and iii) assigning said integrals or a set of predetermined functions thereof to respective components of a corresponding feature vector in a series of feature vectors;
  
  said computer program product comprising;
  
  computer readable program code for inputting said series of feature vectors and a respective series of pitch values and voicing decisions, and converting the feature vectors into binned spectra, computer readable program code for causing the computer to generate harmonic frequencies and weights according to the pitch value and voicing decision, computer readable program code for causing the computer to generate phases for each harmonic frequency depending on the pitch value, voicing decision and possibly on the binned spectrum, computer readable program code for causing the computer to sample a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiply by the respective harmonic weights, so as to produce for each sampled basis function a respective line spectrum having multiple components, computer readable program code for causing the computer to combine each component of the respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, computer readable program code for causing the computer to generate coefficients of the basis functions, computer readable program code for causing the computer to multiply each complex line spectrum of each basis function by the respective basis function coefficient and sum up all the resulting complex line spectra to generate a complex line spectrum with respective components for all harmonic frequencies, and computer readable program code for causing the computer to generate a time signal from a series of complex line spectra.

23. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for converting a series of feature vectors and a series of respective pitch values and voicing decisions of an original input speech signal into a reconstructed speech signal, the feature vectors being obtained as follows:
- i) deriving at successive instances of time an estimate of the spectral envelope of the digitized original speech signal, ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, wherein each window is non-zero over a narrow range of frequencies, and computing the integrals thereof, and iii) assigning said integrals or a set of pre-determined functions thereof to respective components of a corresponding feature vector in a series of feature vectors, said method steps comprising;
  
  (a) converting each feature vector into a binned spectrum, (b) generating harmonic frequencies and weights according to the corresponding pitch and voicing decision, (c) generating for each harmonic frequency a respective phase, depending on the corresponding pitch value and voicing decision and possibly on the binned spectrum, (d) sampling a predetermined set of basis functions each being a function in a set of frequency domain functions with bounded supports at all harmonic frequencies which are within its support, and multiplying by the respective harmonic weight, so as to produce for each sampled basis function a respective line spectrum having multiple components, (e) combining each component of each respective line spectrum with the respective phase thereof so as to produce a complex line spectrum for each basis function, (f) generating gain coefficients of the basis functions, (g) multiplying each complex line spectrum of each basis function by the respective basis function gain coefficient, and summing up all resulting complex line spectra to generate a single complex line spectrum having a respective component for each of the harmonic frequencies, and (h) generating a time signal from complex line spectra computed at successive instances of time.
- View Dependent Claims (24)
- - 24. The program storage device according to claim 23, wherein the method steps executable by the machine for generating the gain coefficients of the basis functions include:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Cohen, Gilad, Hoory, Ron, Chazan, Dan
Primary Examiner(s)
Banks-Harold, Marsha D.
Assistant Examiner(s)
Storm, Donald L.

Application Number

US09/432,081
Time in Patent Office

1,631 Days
Field of Search

704/208, 704/203, 704/214, 704/205, 704/207
US Class Current

704/205
CPC Class Codes

G10L 13/07 Concatenation rules

G10L 25/18 the extracted parameters be...

Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links