Multiband harmonic transform coder
First Claim
1. A method of encoding a speech signal into a set of encoded bits, the method comprising:
- digitizing the speech signal to produce a sequence of digital speech samples;
dividing the digital speech samples into a sequence of frames, each of the frames spanning multiple digital speech samples;
estimating a set of speech model parameters for a frame, wherein the speech model parameters include voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing pitch for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame;
quantizing the speech model parameters to produce parameter bits;
dividing the frame into one or more subframes and computing transform coefficients for the digital speech samples representing the subframes;
quantizing the transform coefficients in unvoiced regions of the frame to produce transform bits; and
including the parameter bits and the transform bits in the set of encoded bits.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech signal is encoded into a set of encoded bits by digitizing the speech signal to produce a sequence of digital speech samples that are divided into a sequence of frames, each of which spans multiple digital speech samples. A set of speech model parameters are estimated for a frame. The speech model parameters include voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing pitch for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame. The speech model parameters are quantized to produce parameter bits. The frame is also divided into one or more subframes for which transform coefficients are computed. The transform coefficients for unvoiced regions of the frame are quantized to produce transform bits. The parameter bits and the transform bits are included in the set of encoded bits.
128 Citations
42 Claims
-
1. A method of encoding a speech signal into a set of encoded bits, the method comprising:
-
digitizing the speech signal to produce a sequence of digital speech samples;
dividing the digital speech samples into a sequence of frames, each of the frames spanning multiple digital speech samples;
estimating a set of speech model parameters for a frame, wherein the speech model parameters include voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing pitch for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame;
quantizing the speech model parameters to produce parameter bits;
dividing the frame into one or more subframes and computing transform coefficients for the digital speech samples representing the subframes;
quantizing the transform coefficients in unvoiced regions of the frame to produce transform bits; and
including the parameter bits and the transform bits in the set of encoded bits. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
companding all sets of spectral magnitudes in the frame to produce sets of companded spectral magnitudes using a companding operation such as the logarithm;
quantizing the last set of the companded spectral magnitudes in the frame;
interpolating between the quantized last set of companded spectral magnitudes in the frame and a quantized set of companded spectral magnitudes from a prior frame to form interpolated spectral magnitudes;
determining a difference between a set of companded spectral magnitudes and the interpolated spectral magnitudes; and
quantizing the determined difference between the spectral magnitudes.
-
-
5. The method of claim 4, further comprising computing the spectral magnitudes by:
-
windowing the digital speech samples to produce windowed speech samples;
computing an FFT of the windowed speech samples to produce FFT coefficients;
summing energy in the FFT coefficients around multiples of a fundamental frequency corresponding to the pitch parameter; and
computing the spectral magnitudes as square roots of the summed energies.
-
-
6. The method of claim 3, further comprising computing the spectral magnitudes by:
-
windowing the digital speech samples to produce windowed speech samples;
computing an FFT of the windowed speech samples to produce FFT coefficients;
summing energy in the FFT coefficients around multiples of a fundamental frequency corresponding to the pitch parameter; and
computing the spectral magnitudes as square roots of the summed energies.
-
-
7. The method of claim 1, wherein the transform coefficients are computed using a transform possessing critical sampling and perfect reconstruction properties.
-
8. The method of claim 1, 2, 3, 4, 5, 6 or 7, wherein the transform coefficients are computed using an overlapped transform that computes transform coefficients for neighboring subframes using overlapping windows of the digital speech samples.
-
9. The method of claim 1, 2, 3, 4, 5, 6 or 7, wherein the quantizing of the transform coefficients to produce transform bits includes the steps of:
-
computing a spectral envelope for the subframe from the model parameters;
forming multiple sets of candidate coefficients, with each set of candidate coefficients being formed by combining one or more candidate vectors and multiplying the combined candidate vectors by the spectral envelope;
selecting from the multiple sets of candidate coefficients the set of candidate coefficients which is closest to the transform coefficients; and
including the index of the selected set of candidate coefficients in the transform bits.
-
-
10. The method of claim 9, wherein each candidate vector is formed from an offset into a known prototype vector and a number of sign bits, wherein each sign bit changes the sign of one or more elements of the candidate vector.
-
11. The method of claim 9, wherein the selected set of candidate coefficients is the set from the multiple sets of candidate coefficients with the highest correlation with the transform coefficients.
-
12. The method of claim 9, wherein the quantizing of the transform coefficients to produce transform bits includes the further steps of:
-
computing a best scale factor for the selected candidate vectors of the subframe;
quantizing the scale factors for the subframes in the frame to produce scale factor bits; and
including the scale factor bits in the transform bits.
-
-
13. The method of claim 12, wherein scale factors for different sub frames in the frame are jointly quantized to produce the scale factor bits.
-
14. The method of claim 13, where the joint quantization uses a vector quantizer.
-
15. The method of claim 1, 2, 3, 4, 5, 6 or 7, wherein the number of bits in the set of encoded bits for one frame in the sequence of frames is different than the number of bits in the set of encoded bits for a second frame in the sequence of frames.
-
16. The method of claim 1, 2, 3, 4, 5, 6 or 7, further comprising:
-
selecting the number of bits in the set of encoded bits, wherein the number may vary from frame to frame; and
allocating the selected number of bits between the parameters bits and the transform bits.
-
-
17. The method of claim 16, wherein selecting the number of bits in the set of encoded bits for a frame is based at least in part on the degree of change between the spectral magnitude parameters representing the spectral information in the frame and the previous spectral magnitude parameters representing the spectral information in the previous frame, and wherein a greater number of bits is favored when the degree of change is larger while a fewer number of bits is favored when the degree of change is smaller.
-
18. An encoder for encoding a digitized speech signal including a sequence of digital speech samples into a set of encoded bits, the encoder comprising:
-
a dividing element that divides the digital speech samples into a sequence of frames, each of the frames including multiple digital speech samples;
a speech model parameter estimator that estimates a set of speech model parameters for a frame, the speech model parameters including voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing pitch for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame;
a parameter quantizer that quantizes the model parameters to produce parameter bits;
a transform coefficient generator that divides the frame into one or more subframes and computes transform coefficients for the digital speech samples representing the subframes;
a transform coefficient quantizer that quantizes the transform coefficients in unvoiced regions of the frame to produce transform bits; and
a combiner that combines the parameter bits and the transform bits to produce the set of encoded bits. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25)
companding all sets of spectral magnitudes in the frame to produce sets of companded spectral magnitudes using a companding operation such as the logarithm;
quantizing the last set of the companded spectral magnitudes in the frame;
interpolating between the quantized last set of companded spectral magnitudes in the frame and a quantized set of companded spectral magnitudes from a prior frame to form interpolated spectral magnitudes;
determining a difference between a set of companded spectral magnitudes and the interpolated spectral magnitudes; and
quantizing the determined difference between the spectral magnitudes.
-
-
22. The encoder of claim 18, wherein the speech model parameter estimator computes the spectral magnitudes by:
-
windowing the digital speech samples to produce windowed speech samples;
computing an FFT of the windowed speech samples to produce FFT coefficients;
summing energy in the FFT coefficients around multiples of a fundamental frequency corresponding to the pitch parameter; and
computing the spectral magnitudes as square roots of the summed energies.
-
-
23. The encoder of claim 18, wherein the transform coefficient generator generates the transform coefficients using an overlapped transform that computes transform coefficients for neighboring subframes using overlapping windows of the digital speech samples.
-
24. The encoder of claim 18, wherein the transform coefficient quantizer quantizes the transform coefficients to produce the transform bits by:
-
computing a spectral envelope for the subframe from the model parameters; and
forming multiple sets of candidate coefficients, with each set of candidate coefficients being formed by combining one or more candidate vectors and multiplying the combined candidate vectors by the spectral envelope;
selecting from the multiple sets of candidate coefficients the set of candidate coefficients which is closest to the transform coefficients; and
including the index of the selected set of candidate coefficients in the transform bits.
-
-
25. The encoder of claim 24, wherein the transform coefficient quantizer for each candidate vector from an offset into a known prototype vector and a number of sign bits, wherein each sign bit changes the sign of one or more elements of the candidate vector.
-
26. A method of decoding a frame of digital speech samples from a set of encoded bits, the method comprising:
-
extracting model parameter bits from the set of encoded bits;
reconstructing model parameters representing the frame of digital speech samples from the extracted model parameter bits, wherein the model parameters include voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing the pitch information for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame;
producing voiced speech samples for the frame from the reconstructed model parameters;
extracting transform coefficient bits from the set of encoded bits;
reconstructing transform coefficients representing unvoiced regions of the frame from the extracted transform coefficient bits;
inverse transforming the reconstructed transform coefficients to produce inverse transform samples;
producing unvoiced speech for the frame from the inverse transform samples; and
combining the voiced speech for the frame and the unvoiced speech for the frame to produce the decoded frame of digital speech samples. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35)
dividing the frame into subframes;
separating the reconstructed transform coefficients into groups, each group of reconstructed transform coefficients being associated with a different subframe in the frame;
inverse transforming the reconstructed transform coefficients in a group to produce inverse transform samples associated with the corresponding subframe; and
overlapping and adding the inverse transform samples associated with consecutive subframes to produce unvoiced speech for the frame.
-
-
33. The method of claim 32, wherein the inverse transform samples are computed using the inverse of an overlapped transform possessing both critical sampling and perfect reconstruction properties.
-
34. The method of claim 26, wherein the reconstructed transform coefficients are produced from the transform coefficient bits by:
-
computing a spectral envelope from the reconstructed model parameters;
reconstructing one or more candidate vectors from the transform coefficient bits; and
forming reconstructed transform coefficients by combining the candidate vectors and multiplying the combined candidate vectors by the spectral envelope.
-
-
35. The method of claim 34, wherein a candidate vector is reconstructed from the transform coefficient bits by use of an offset into a known prototype vector and a number of sign bits, wherein each sign bit changes the sign of one or more elements of the candidate vector.
-
36. A decoder for decoding a frame of digital speech samples from a set of encoded bits, the decoder comprising:
-
a model parameter extractor that extracts model parameter bits from the set of encoded bits;
a model parameter reconstructor that reconstructs model parameters representing the frame of digital speech samples from the extracted model parameter bits, wherein the model parameters include voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing the pitch information for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame;
a voiced speech synthesizer that produces voiced speech samples for the frame from the reconstructed model parameters;
a transform coefficient extractor that extracts transform coefficient bits from the set of encoded bits;
a transform coefficient reconstructor that reconstructs transform coefficients representing unvoiced regions of the frame from the extracted transform coefficient bits;
an inverse transformer that inverse transforms the reconstructed transform coefficients to produce inverse transform samples;
an unvoiced speech synthesizer that synthesizes unvoiced speech for the frame from the inverse transform samples; and
a combiner that combines the voiced speech for the frame and the unvoiced speech for the frame to produce the decoded frame of digital speech samples. - View Dependent Claims (37, 38)
-
-
39. A method of encoding a speech signal into a set of encoded bits, the method comprising:
-
digitizing the speech signal to produce a sequence of digital speech samples;
dividing the digital speech samples into a sequence of frames, each of the frames spanning multiple digital speech samples;
estimating a set of speech model parameters for a frame, wherein the speech model parameters include a voicing parameter, at least one pitch parameter representing pitch for the frame, and spectral parameters representing spectral information for the frame;
quantizing the model parameters to produce parameter bits;
dividing the frame into one or more subframes and computing transform coefficients for the digital speech samples representing the subframes, wherein computing the transform coefficients comprises using a transform possessing critical sampling and perfect reconstruction properties;
quantizing at least some of the transform coefficients to produce transform bits; and
including the parameter bits and the transform bits in the set of encoded bits.
-
-
40. A method of decoding a frame of digital speech samples from a set of encoded bits, the method comprising:
-
extracting model parameter bits from the set of encoded bits;
reconstructing model parameters representing the frame of digital speech samples from the extracted model parameter bits, wherein the model parameters include a voicing parameter, at least one pitch parameter representing pitch information for the frame, and spectral parameters representing spectral information for the frame;
producing voiced speech samples for the frame using the reconstructed model parameters;
extracting transform coefficient bits from the set of encoded bits;
reconstructing transform coefficients from the extracted transform coefficient bits;
inverse transforming the reconstructed transform coefficients to produce inverse transform samples, wherein the inverse transform samples are produced using the inverse of an overlapped transform possessing both critical sampling and perfect reconstruction properties;
producing unvoiced speech for the frame from the inverse transform samples; and
combining the voiced speech for the frame and the unvoiced speech for the frame to produce the decoded frame of digital speech samples.
-
-
41. A method of encoding a speech signal into a set of encoded bits, the method comprising:
-
digitizing the speech signal to produce a sequence of digital speech samples;
dividing the digital speech samples into a sequence of frames, each of the frames spanning multiple digital speech samples;
estimating a set of speech model parameters for a frame, wherein the speech model parameters include a voicing parameter, at least one pitch parameter representing pitch for the frame, and spectral parameters representing spectral information for the frame, the spectral parameters including one or more sets of spectral magnitudes estimated in a manner which is independent of the voicing parameter for the frame;
quantizing the model parameters to produce parameter bits;
dividing the frame into one or more subframes and computing transform coefficients for the digital speech samples representing the subframes;
quantizing at least some of the transform coefficients to produce transform bits; and
including the parameter bits and the transform bits in the set of encoded bits.
-
-
42. A method of decoding a frame of digital speech samples from a set of encoded bits, the method comprising:
-
extracting model parameter bits from the set of encoded bits;
reconstructing model parameters representing the frame of digital speech samples from the extracted model parameter bits, wherein the model parameters include a voicing parameter, at least one pitch parameter representing pitch information for the frame, and spectral parameters representing spectral information for the frame;
producing voiced speech samples for the frame using the reconstructed model parameters and synthetic phase information computed from the spectral magnitudes;
extracting transform coefficient bits from the set of encoded bits;
reconstructing transform coefficients from the extracted transform coefficient bits;
inverse transforming the reconstructed transform coefficients to produce inverse transform samples;
producing unvoiced speech for the frame from the inverse transform samples; and
combining the voiced speech for the frame and the unvoiced speech for the frame to produce the decoded frame of digital speech samples.
-
Specification