Low-complexity, low-delay, scalable and embedded speech and audio coding with adaptive frame loss concealment
First Claim
1. A system for processing audio signals comprising:
- (a) a frame extractor for dividing an input audio signal into a plurality of signal frames corresponding to successive time intervals;
(b) a transform processor for performing transform computation of the input audio signal in at least one signal frame, said transform processor generating a transform signal having one or more (NB) bands;
(c) a quantizer providing quantized values associated with the transform signal in said NB bands;
(d) an output processor for forming an output bit stream corresponding to an encoded version of the input audio signal; and
(e) a decoder capable of recontructing from the output bit stream at least two replicas of the input audio signal, each replica having a different sampling rate, without using downsampling.
2 Assignments
0 Petitions
Accused Products
Abstract
High-quality, low-complexity and low-delay scalable and embedded system and method are disclosed for coding speech and general audio signals. The invention is particularly suitable in Internet Protocol (IP)-based multimedia communications. Adaptive transform coding, such as a Modified Discrete Cosine Transform, is used, with multiple small-size transforms in a given signal frame to reduce the coding delay and computational complexity. In a preferred embodiment, for a chosen sampling rate of the input signal, one or more output sampling rates may be decoded with varying degrees of complexity. Multiple sampling rates and bit rates are supported due to the scalable and embedded coding approach underlying the present invention. Further, a novel adaptive frame loss concealment approach is used to reduce the distortion caused by packet loss in communications using IP networks.
174 Citations
43 Claims
-
1. A system for processing audio signals comprising:
-
(a) a frame extractor for dividing an input audio signal into a plurality of signal frames corresponding to successive time intervals;
(b) a transform processor for performing transform computation of the input audio signal in at least one signal frame, said transform processor generating a transform signal having one or more (NB) bands;
(c) a quantizer providing quantized values associated with the transform signal in said NB bands;
(d) an output processor for forming an output bit stream corresponding to an encoded version of the input audio signal; and
(e) a decoder capable of recontructing from the output bit stream at least two replicas of the input audio signal, each replica having a different sampling rate, without using downsampling. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
5. The system of claim 3 wherein said bit allocator warps possibly quantized log-gain values to target signal-to-noise ratio (TSNR) values in the base-2 log domain using a predefined warping function.
-
6. The system of claim 5, wherein said bit allocator allocates to the band with the largest TSNR value one bit for each transform coefficient in that band, and reduces the TSNR correspondingly, and repeats the operation until all available bits are exhausted.
-
7. The system of claim 3 wherein the output bit stream formed by the output processor further comprises quantized log-gain values for at least some of the NB bands of the transform signal.
-
8. The system of claim 1 wherein the decoder (e) is capable of identifying missing frames in the input signal.
-
9. The system of claim 8 wherein the decoder comprises an adaptive frame loss concealment processor operating to reduce the effect of missing frames on the quality of the output signal.
-
10. The system of claim 9 wherein the adaptive frame loss concealment processor computes an optimum time lag for waveform signal interpolation.
-
-
11. A method for processing audio signals, comprising:
-
dividing an input audio signal into frames corresponding to successive time intervals;
for each frame performing at least two relatively short-size transform computations;
extracting one set of side information about the frame from said at least two relatively short-size transform computations;
encoding information about the frame, said encoded information comprising the side information and transform coefficients from said at least two transform computations; and
reconstructing the audio signal based on the encoded information. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
16. The method of claim 15 wherein each transform includes a DCT type IV transform computation, given by the expression:
-
where xn is the time domain signal, Xk is the DCT type IV transform of xn, and M is the transform size.
-
-
17. The method of claim 11 wherein the size of the frame is selected relatively short to enable low algorithmic delay processing.
-
18. The method of claim 15 wherein transform coefficients T(k,m) obtained by each of said at least two transform computations are divided into NB frequency bands, and encoding information about each frame is done using the base-2 logarithm of the average power of the coefficients in the NB bands, said base-2 logarithm of the average power being defined as the log-gain.
-
19. The method of claim 18 wherein the bandwidth BW(i) of the i-th transform domain band is given by the expression
-
20. The method of claim 19 wherein bit allocation for the encoding of transform coefficients is performed based on the log-gains LG(i) in the NB bands.
-
21. The method of claim 20 wherein prior to bit allocation, the NB log-gains are mapped to a Target Signal to Noise Ratio (TSNR) scale using a warping curve.
-
22. The method of claim 21 wherein the warping curve is a piece-wise linear function.
-
23. The method of claim 21 wherein the band with the largest TSNR value is given one bit for each transform coefficient in that band and the TSNR is reduced correspondingly, and the bit allocation is repeated cyclically, until all available bits are exhausted.
-
24. The method of claim 21 wherein the number of bits assigned to each of the transform coefficients is based on the formula:
-
where R is the average bit rate, N is the number of transform coefficients, Rk is the bit rate for the k-th transform coefficient, and σ
k2 is the square of the standard deviation of the k-th transform coefficient.
-
-
25. The method of claim 24 wherein the bit allocation formula is modified to:
-
or
-
-
26. A method for adaptive frame loss concealment in processing of audio signals divided into frames corresponding to successive time intervals, where for each input frame one or more transform domain computations are performed over partially overlapping windows covering the audio signal, and output synthesis is performed using an overlap-and- add method, the method comprising:
-
in a sequence of received frames identifying a frame as missing;
analyzing the immediately preceding frame to determine an optimum time lag for waveform signal extrapolation;
based on the determined optimum time lag performing waveform signal extrapolation to synthesize a first portion of the missing frame, said synthesis using information already available as part of the preceding frame to minimize discontinuities at the frame boundary; and
performing waveform signal extrapolation in the remaining portion of the missing frame. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33)
-
-
34. A method for scalable processing of audio signals sampled at a first sampling rate and divided into frames corresponding to successive time intervals, where for each input frame one or more relatively short-size transform domain computations are performed over windows covering portions of the audio signal, comprising:
-
receiving transform domain coefficients corresponding to said one or more transform domain computations; and
directly reconstructing the audio signal at a second sampling rate lower than the first sampling rate using an inverse transform operating only on a portion of the received transform domain coefficients, without downsampling. - View Dependent Claims (35, 36, 37)
where xn is the time domain signal, Xk is the DCT type IV transform of xn, and M is the transform size, and the inverse DCT type IV is given by the expression;
-
-
36. The method of claim 35, wherein the step of directly synthesizing at a ¼
- sampling rate without downsampling comprises computing a (M/4)-point DCT type IV for the first quarter of the received DCT coefficients, as follows;
where so that
- sampling rate without downsampling comprises computing a (M/4)-point DCT type IV for the first quarter of the received DCT coefficients, as follows;
-
37. The method of claim 35, wherein the step of directly synthesizing at a ½
- sampling rate without downsampling comprises computing a (M/2)-point DCT type IV for the first half of the received DCT coefficients, as follows;
where so that where;
and using the above quantities in a DCT type IV inverse computation to obtain the reconstructed output signal having a ½
sampling rate.
- sampling rate without downsampling comprises computing a (M/2)-point DCT type IV for the first half of the received DCT coefficients, as follows;
-
38. A coding method for use in processing of audio signals divided into frames corresponding to successive time intervals, where for each input frame at least one transform domain computation is performed, and the transform coefficients are divided into NB bands, the method comprising:
-
computing a base-2 logarithm of the average power of the transform coefficients in the NB bands to obtain a log-gain array LG(i), i=0, . . . , NB−
1;
encoding information about each frame based on the log-gain array LG(i), said encoded information comprising the transform coefficients, where the encoding step comprises;
computing a quantized log-gain array LGQ(i), i=0, . . . , NB−
1; and
converting the quantized log-gain coefficients of the array LGQ(i) into a linear-gain domain using the following steps;
(1) providing a table containing all possible values of the linear gain g(0) corresponding to the number of bits allocated to LGQ(0);
(2) finding the value of g(0) using table lookup;
(3) from the second band onward, applying the formula;
- View Dependent Claims (39)
-
-
40. An embedded coding method for use in processing of an audio signal divided into frames corresponding to successive time intervals, where for each input frame at least one transform domain computation is performed and the resulting transform coefficients are divided into NB bands, each band having at least one transform coefficient, the method comprising:
-
for a pre-specified first bit rate providing a first output bit stream which comprises information about transform coefficients in M1≦
NB bands and information about the average power in the M1 bands, and wherein bit allocation is determined based on a target signal-to-noise ratio (TSNR) in the NB bands, said first output bit stream being sufficient to reconstruct a representation of the audio signal;
for at least a second pre-specified bit rate higher than the first bit rate, providing an output bit stream embedding said first output bit stream and further comprising information about transform coefficients in M2 bands, where M1≦
M2≦
NB, and information about the average power in the M2 bands, and wherein bit allocation is determined based on the difference between the TSNR in the NB bands and a value determined by the number of bits allocated to each band at the next-lower bit rate; and
reconstructing a representation of the input signal using an embedded bit stream corresponding to the desired bit rate. - View Dependent Claims (41)
for a given first bit rate, providing a bit allocation algorithm that takes into account band encoding information about each frame, said information comprising the transform coefficients, based on the gain array G(i); and
decoding said encoded information about each frame to reconstruct the input audio signal.
-
-
42. A system for embedded coding of audio signals comprising:
-
a frame extractor for dividing an input audio signal into a plurality of signal frames corresponding to successive time intervals;
means for performing transform computation to provide transform-domain representation of the input audio signal in each frame, said transform-domain representation having n NB bands, where n>
1;
means for providing a first encoded data stream corresponding to a user-specified portion of the transform-domain representation having m NB bands, where m<
n, which first encoded data stream contains information sufficient to reconstruct a representation of the input audio signal;
means for providing one or more secondary encoded data streams comprising additional information to the user-specified portion of the transform-domain representation of the input audio signal; and
means for providing an embedded output signal based at least on said first encoded data stream and said one or more secondary encoded data streams.
-
-
43. A method for processing audio signals, comprising:
-
dividing an input audio signal into frames corresponding to successive time intervals;
for each frame performing at least two relatively short-size transform computations to obtain a two-dimensional output transform coefficient array T(k,m) defined as;
-
Specification