LOW-COMPLEXITY, LOW-DELAY, SCALABLE AND EMBEDDED SPEECH AND AUDIO CODING WITH ADAPTIVE FRAME LOSS CONCEALMENT

US 20020007273A1
Filed: 03/30/1999
Published: 01/17/2002
Est. Priority Date: 03/30/1998
Status: Active Grant

First Claim

Patent Images

1. A system for processing audio signals comprising;

(a) a frame extractor for dividing an input audio signal into a plurality of signal frames corresponding to successive time intervals;

(b) a transform processor for performing transform computation of a signal in at least one signal frame, said transform processor generating a transform signal having one or more (NB) bands;

(c) a quantizer providing quantized values associated with the transform signal in said NB bands;

(d) an output processor for forming an output bit stream corresponding to an encoded version of the input signal; and

(e) a decoder capable of reconstructing from the output bit stream at least two replicas of the input signal, each replica having a different sampling rate, without using downsampling.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

High-quality, low-complexity and low-delay scalable and embedded system and method are disclosed for coding speech and general audio signals. The invention is particularly suitable in Internet Protocol (IP)-based multimedia communications. Adaptive transform coding, such as a Modified Discrete Cosine Transform, is used, with multiple small-size transforms in a given signal frame to reduce the coding delay and computational complexity. In a preferred embodiment, for a chosen sampling rate of the input signal, one or more output sampling rates may be decoded with varying degrees of complexity. Multiple sampling rates and bit rates are supported due to the scalable and embedded coding approach underlying the present invention. Further, a novel adaptive frame loss concealment approach is used to reduce the distortion caused by packet loss in communications using IP networks.

131 Citations

43 Claims

1. A system for processing audio signals comprising;
- (a) a frame extractor for dividing an input audio signal into a plurality of signal frames corresponding to successive time intervals;
  
  (b) a transform processor for performing transform computation of a signal in at least one signal frame, said transform processor generating a transform signal having one or more (NB) bands;
  
  (c) a quantizer providing quantized values associated with the transform signal in said NB bands;
  
  (d) an output processor for forming an output bit stream corresponding to an encoded version of the input signal; and
  
  (e) a decoder capable of reconstructing from the output bit stream at least two replicas of the input signal, each replica having a different sampling rate, without using downsampling.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, further comprising an adaptive bit allocator for determining an optimum bit-allocation for encoding at least one of said NB bands of the transform signal.
  - 3. The system of claim 2 further comprising a log-gain calculator for computing log-gain values corresponding to the base-2 logarithm of the average power of the coefficients in the NB bands of the transform signal.
  - 4. The system of claim 3 wherein the bandwidth BW(i) of the i-th transform domain band is given by the expression BW(i)=BI(i+1)−
    - BI(i) where BI(i) is an array containing the indices of corresponding to the transform domain boundaries between bands, and the log-gains are calculated as
  - 5. The system of claim 3 wherein said bit allocator warps possibly quantized log-gain values to target signal-to-noise ratio (TSNR) values in the base-2 log domain using a predefined warping function.
  - 6. The system of claim 5, wherein said bit allocator allocates to the band with the largest TSNR value one bit for each transform coefficient in that band, and reduces the TSNR correspondingly, and repeats the operation until all available bits are exhausted.
  - 7. The system of claim 3 wherein the output bit stream formed by the output processor further comprises quantized log-gain values for at least some of the NB bands of the transform signal.
  - 8. The system of claim 1 wherein the decoder (e) is capable of identifying missing frames in the input signal.
  - 9. The system of claim 8 wherein the decoder comprises an adaptive frame loss concealment processor operating to reduce the effect of missing frames on the quality of the output signal.
  - 10. The system of claim 9 wherein the adaptive frame loss concealment processor computes an optimum time lag for waveform signal interpolation.

11. A method for processing audio signals, comprising:
- dividing an input audio signal into frames corresponding to successive time intervals;
  
  for each frame performing at least two relatively short-size transform computations;
  
  extracting one set of side information about the frame from said at least two relatively short-size transform computations;
  
  encoding information about the frame, said encoded information comprising the side information and transform coefficients from said at least two transform computations; and
  
  reconstructing the audio signal based on the encoded information.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 12. The method of claim 11 using M transforms for each signal frame, said transforms performed over partially overlapping windows which cover the audio signal in a current frame and least one adjacent frame, wherein the overlapping portion is equal to 1/M of the frame size.
  - 13. The method of claim 11 wherein a short-size transform is performed about every 4 ms.
  - 14. The method of claim 11 wherein said at least two relatively short-size transforms are Modified Discrete Cosine Transforms (MDCTs).
  - 15. The method of claim 11 wherein for each frame is computed a two-dimensional output transform coefficient array T(k,m) defined as:
    - T(k, m), k=0, 1, 2, . . . , M−
      
      1, and m=0, 1, . . . , NTPF−
      
      1, where M is the number of transform coefficients in each transform, and NTPF is the number of transforms per frame.
  - 16. The method of claim 15 wherein each transform includes a DCT type IV transform computation, given by the expression:
  - 17. The method of claim 11 wherein the size of the frame is selected relatively short to enable low algorithmic delay processing.
  - 18. The method of claim 15 wherein transform coefficients T(k,m) obtained by each of said at least two transform computations are divided into NB frequency bands, and encoding information about each frame is done using the base-2 logarithm of the average power of the coefficients in the NB bands, said base-2 logarithm of the average power being defined as the log-gain.
  - 19. The method of claim 18 wherein the bandwidth BW(i) of the i-th transform domain band is given by the expression BW(i)=BI(i+1)−
    - BI(i). where BI(i) is an array containing the indices of corresponding to the transform domain boundaries between bands, and the log-gains are calculated as
  - 20. The method of claim 19 wherein bit allocation for the encoding of transform coefficients is performed based on the log-gains LG(i) in the NB bands.
  - 21. The method of claim 20 wherein prior to bit allocation, the NB log-gains are mapped to a Target Signal to Noise Ratio (TSNR) scale using a warping curve.
  - 22. The method of claim 21 wherein the warping curve is a piece-wise linear function.
  - 23. The method of claim 21 wherein the band with the largest TSNR value is given one bit for each transform coefficient in that band and the TSNR is reduced correspondingly, and the bit allocation is repeated cyclically, until all available bits are exhausted.
  - 24. The method of claim 21 wherein the number of bits assigned to each of the transform coefficients is based on the formula:
  - 25. The method of claim 24 wherein the bit allocation formula is modified to:

26. A method for adaptive frame loss concealment in processing of audio signals divided into frames corresponding to successive time intervals, where for each input frame one or more transform domain computations are performed over partially overlapping windows covering the audio signal, and output synthesis is performed using an overlap-and-add method, the method comprising:
- in a sequence of received frames identifying a frame as missing;
  
  analyzing the immediately preceding frame to determine an optimum time lag for waveform signal extrapolation;
  
  based on the determined optimum time lag performing waveform signal extrapolation to synthesize a first portion of the missing frame, said synthesis using information already available as part of the preceding frame to minimize discontinuities at the frame boundary; and
  
  performing waveform signal extrapolation in the remaining portion of the missing frame.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33)
- - 27. The method of claim 26 wherein the step of analyzing is performed at least in part using a filtered and decimated version of the synthesis signal for the immediately preceding frame.
  - 28. The method of claim 27 wherein the optimum time lag in the step of analyzing is identified using a peak of the cross-correlation function of the decimated version of the synthesis signal.
  - 29. The method of claim 28 wherein the optimum time lag is further refined using the full version of the synthesis signal.
  - 30. The method of claim 27 wherein the optimum time lag in the step of analyzing is identified as the time lag that minimizes discontinuities in the waveform sample from the preceding frame to the extrapolated current frame.
  - 31. The method of claim 30 wherein a measure of discontinuities is computed in terms of both waveform sample values and waveform slope.
  - 32. The method of claim 31 wherein the measure of discontinuities is computed using the decimated version of the synthesis signal for the immediately preceding frame and the extrapolated version of the decimated signal.
  - 33. The method of claim 26 wherein the waveform extrapolation extends to the first portion of the frame immediately following the missing frame and further comprises windowing and overlap-and-add buffer update in preparation for the synthesis of the frame immediately following the missing frame.

34. A method for scalable processing of audio signals sampled at a first sampling rate and divided into frames corresponding to successive time intervals, where for each input frame one or more relatively short-size transform domain computations are performed over windows covering portions of the audio signal, comprising:
- receiving transform domain coefficients corresponding to said one or more transform domain computations; and
  
  directly reconstructing the audio signal at a second sampling rate lower than the first sampling rate using an inverse transform operating only on a portion of the received transform domain coefficients, without downsampling.
- View Dependent Claims (35, 36, 37)
- - 35. The method of claim 34 wherein the one or more relatively short-size transform computations include Discrete Cosine transform (DCT) type IV computations, defined as:
  - 36. The method of claim 35, wherein the step of directly synthesizing at a ¼
    - sampling rate without downsampling comprises computing a (M/4)-point DCT type IV for the first quarter of the received DCT coefficients, as follows;
  - 37. The method of claim 35, wherein the step of directly synthesizing at a ½
    - sampling rate without downsampling comprises computing a (M/2)-point DCT type IV for the first half of the received DCT coefficients, as follows;

38. A coding method for use in processing of audio signals divided into frames corresponding to successive time intervals, where for each input frame at least one transform domain computation is performed, and the transform coefficients are divided into NB bands, the method comprising:
- computing a base-2 logarithm of the average power of the transform coefficients in the NB bands to obtain a log-gain array LG(i), i=0 , . . . , NB−
  
  1;
  
  encoding information about each frame based on the log-gain array LG(i), said encoded information comprising the transform coefficients, where the encoding step comprises;
  
  computing a quantized log-gain array LGQ(i), i=0, . . . ,NB−
  
  1; and
  
  converting the quantized log-gain coefficients of the array LGQ(i) into a linear-gain domain using the following steps;
  
  (1) providing a table containing all possible values of the linear gain g(0) corresponding to the number of bits allocated to LGQ(0);
  
  (2) finding the value of g(0) using table lookup;
  
  (3) from the second band onward, applying the formula;
  
  $\begin{matrix} g (i) = 2^{LGQ (i) / 2} \\ = 2^{\frac{1}{2} [DLGQ (i) + LGQ (i - 1)]} \\ = 2^{LGQ (i - 1) / 2} \times 2^{DLGQ (i) / 2} \\ = g (i - 1) \times 2^{DLGQ (i) / 2} \end{matrix}$ to compute recursively all linear gains using a single multiplication per linear gain, where each of the quantities 2^DLGQ(i)/2are found using table lookup; and
  
  decoding said encoded information about each frame to reconstruct the input audio signal.
- View Dependent Claims (39)
- - 39. The method of claim 38 wherein the step of encoding information further comprises encoding the values of the log-gain array LG(i).

40. An embedded coding method for use in processing of an audio signal divided into frames corresponding to successive time intervals, where for each input frame at least one transform domain computation is performed and the resulting transform coefficients are divided into NB bands, each band having at least one transform coefficient, the method comprising:
- for a pre-specified first bit rate providing a first output bit stream which comprises information about transform coefficients in M₁≦
  
  NB bands and information about the average power in the M₁bands, and wherein bit allocation is determined based on a target signal-to-noise ratio (TSNR) in the NB bands, said first output bit stream being sufficient to reconstruct a representation of the audio signal;
  
  for at least a second pre-specified bit rate higher than the first bit rate, providing an output bit stream embedding said first output bit stream and further comprising information about transform coefficients in M₂bands, where M₁≦
  
  M₂≦
  
  NB, and information about the average power in the M₂bands, and wherein bit allocation is determined based on the difference between the TSNR in the NB bands and a value determined by the number of bits allocated to each band at the next-lower bit rate; and
  
  reconstructing a representation of the input signal using an embedded bit stream corresponding to the desired bit rate.
- View Dependent Claims (41)
- - 41. The method of claim 40 wherein the first output bit stream corresponds to a at a first bit rate;
    - for a given first bit rate, providing a bit allocation algorithm that takes into account band encoding information about each frame, said information comprising the transform coefficients, based on the gain array G(i); and
      
      decoding said encoded information about each frame to reconstruct the input audio signal.

42. A system for embedded coding of audio signals comprising:
- (a) a frame extractor for dividing an input signal into a plurality of signal frames corresponding to successive time intervals;
  
  (b) means for providing transform-domain representations of the signal in each frame;
  
  (c) means for providing a first encoded data stream corresponding to a user-specified transform-domain representation, which first encoded data stream contains information sufficient to reconstruct a representation of the input signal;
  
  (d) means for providing one or more secondary encoded data streams comprising additional information in the transform-domain representation of the signal; and
  
  (e) means for providing an embedded output signal based at least on said first encoded data portion and said one or more secondary encoded data portions of the user-selected transform representation.

43. A method for processing audio signals, comprising:
- dividing an input audio signal into frames corresponding to successive time intervals;
  
  for each frame performing at least two relatively short-size transform computations to obtain a two-dimensional output transform coefficient array T(k,m) defined as;
  
  T(k,m),k=0, 1, 2, . . . , M−
  
  1, and m=0, 1, . . . , NTPF−
  
  1, where M is the number of transform coefficients in each transform, and NTPF is the number of transforms per frame;
  
  extracting one set of side information about the frame from said at least two relatively short-size transform computations;
  
  encoding information about the frame, said encoded information comprising the side information and transform coefficients T(k,m) from said at least two transform computations wherein said transform coefficients being divided into NB frequency bands, and further wherein bit allocation is done by;
  
  (a) constructing an approximation of the signal spectrum envelope using the log-gains of the coefficients in the NB bands;
  
  (b) estimating a noise masking threshold function on the basis of the constructed approximation;
  
  (c) mapping the signal-to-masking threshold ratio to target signal-to-noise (TSNR) values; and
  
  (d) performing bit allocation based on the mapping in (c); and
  
  reconstructing the audio signal based on the encoded information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Alcatel-Lucent USA, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
CHEN, JUIN-HWEY

Granted Patent

US 6,351,730 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/229
CPC Class Codes

G10L 19/02   using spectral analysis, e....

G10L 19/0212   using orthogonal transforma...

G10L 19/022   Blocking, i.e. grouping of ...

LOW-COMPLEXITY, LOW-DELAY, SCALABLE AND EMBEDDED SPEECH AND AUDIO CODING WITH ADAPTIVE FRAME LOSS CONCEALMENT

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

131 Citations

43 Claims

Specification

Solutions

Use Cases

Quick Links

LOW-COMPLEXITY, LOW-DELAY, SCALABLE AND EMBEDDED SPEECH AND AUDIO CODING WITH ADAPTIVE FRAME LOSS CONCEALMENT

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

131 Citations

43 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links