Digital signal processor having inverse discrete cosine transform engine for video decoding and partitioned distributed arithmetic multiply/accumulate unit therefor

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
7Forward
Citations 
0
Petitions 
6
Assignments
First Claim
1. A distributed arithmetic multiply/accumulate (MAC) unit for computing inverse discrete cosine transformations, comprising:
 a first pipeline stage of said distributed arithmetic MAC configured to perform dot products on received sequential input data, wherein row transformations are performed initially and column transformations are performed subsequent to said row transformations; and
a second pipeline stage of said distributed arithmetic MAC coupled to said first pipeline stage and configured to compute additions and subtractions of said dot products to yield sequential output data,wherein said distributed arithmetic MAC unit has a 12bit precision.
6 Assignments
0 Petitions
Accused Products
Abstract
A distributed arithmetic multiply/accumulate (MAC) unit for computing inverse discrete cosine transformations (IDCTs). In one embodiment, the distributed arithmetic MAC unit includes: (1) a first pipeline stage configured to perform dot products on received sequential input data and (2) a second pipeline stage coupled to the first pipeline stage and configured to compute additions and subtractions of the dot products to yield sequential output data.
12 Citations
View as Search Results
Digital Signal Processing Systems  
Patent #
US 20110055445A1
Filed 03/15/2010

Current Assignee
Sunpower Corporation

Sponsoring Entity
Sunpower Corporation

Function Generator  
Patent #
US 20110055303A1
Filed 03/15/2010

Current Assignee
Sunpower Corporation

Sponsoring Entity
Sunpower Corporation

Transform and Quantization Architecture for Video Coding and Decoding  
Patent #
US 20120082212A1
Filed 09/30/2011

Current Assignee
Texas Instruments Inc.

Sponsoring Entity
Texas Instruments Inc.

Transform and quantization architecture for video coding and decoding  
Patent #
US 9,378,185 B2
Filed 09/30/2011

Current Assignee
Texas Instruments Inc.

Sponsoring Entity
Texas Instruments Inc.

Transform and quantization architecture for video coding and decoding  
Patent #
US 10,326,991 B2
Filed 06/24/2016

Current Assignee
Texas Instruments Inc.

Sponsoring Entity
Texas Instruments Inc.

Processor for realizing at least two categories of functions  
Patent #
US 10,372,359 B2
Filed 05/10/2017

Current Assignee
ChengDu HaiCun IP Technology LLC

Sponsoring Entity
ChengDu HaiCun IP Technology LLC

Configurable processor with inpackage lookup table  
Patent #
US 10,445,067 B2
Filed 11/28/2018

Current Assignee
Hangzhou Haicun Information Technology Co. Ltd.

Sponsoring Entity
Hangzhou Haicun Information Technology Co. Ltd.

Rapid and low cost of inverse discrete cosine transform system and method thereof  
Patent #
US 20050050128A1
Filed 07/22/2004

Current Assignee
MediaTek Inc.

Sponsoring Entity
MediaTek Inc.

Dct arithmetic device  
Patent #
US 6,574,648 B1
Filed 10/02/2000

Current Assignee
Matsushita Electric Industrial Company Limited

Sponsoring Entity
Matsushita Electric Industrial Company Limited

Inverse discrete cosine transformer in an MPEG decoder  
Patent #
US 6,327,602 B1
Filed 04/07/1999

Current Assignee
LG Electronics Inc.

Sponsoring Entity
LG Electronics Inc.

Method and apparatus for performing fast discrete cosine transforms and fast inverse discrete cosine transforms using lookup tables  
Patent #
US 6,112,219 A
Filed 09/23/1993

Current Assignee
Intel Corporation

Sponsoring Entity
RealNetworks Inc.

Matrix multiplier circuit  
Patent #
US 5,226,002 A
Filed 06/28/1991

Current Assignee
Industrial Technology Research Institute

Sponsoring Entity
Industrial Technology Research Institute

16 Claims
 1. A distributed arithmetic multiply/accumulate (MAC) unit for computing inverse discrete cosine transformations, comprising:
a first pipeline stage of said distributed arithmetic MAC configured to perform dot products on received sequential input data, wherein row transformations are performed initially and column transformations are performed subsequent to said row transformations; and a second pipeline stage of said distributed arithmetic MAC coupled to said first pipeline stage and configured to compute additions and subtractions of said dot products to yield sequential output data, wherein said distributed arithmetic MAC unit has a 12bit precision.  View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
 9. A digital signal processor (DSP), comprising:
a core containing a pipeline control unit having a coprocessor interface; and a coprocessor coupled to said core and containing an distributed arithmetic multiply/accumulate (MAC) unit for computing inverse discrete cosine transformations, including; a first pipeline stage of said distributed arithmetic MAC configured to perform dot products based on sequential input data received from said coprocessor interface, wherein row transformations are performed initially and column transformations are performed subsequent to said row transformations, and a second pipeline stage of said distributed arithmetic MAC coupled to said first pipeline stage and configured to compute additions and subtractions of said dot products to yield sequential output data for transmission toward said coprocessor interface, wherein said distributed arithmetic MAC unit has a 12bit precision.  View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
1 Specification
The present application is related to U.S. patent application Ser. No. 11/083,646, filed concurrently herewith by Rayala, entitled “Digital Signal Processor having Inverse Discrete Cosine Transform Engine for Video Decoding and Partitioned Distributed Arithmetic Multiply/Accumulate Unit Therefor,” commonly assigned with the present invention and incorporated herein by reference in its entirety.
The present invention is directed, in general, to digital signal processors and, more specifically, to a digital signal processor having an inverse discrete cosine transformation (IDCT) IDCT engine for video decoding and a partitioned distributed arithmetic multiply/accumulate (MAC) unit for such IDCT engine.
Twodimensional (2D) discrete cosine transformation (DCT) operations are commonly used in Motion Picture Experts Group (MPEG) video encoding to decorrelate the image samples which allows compression of the input data. DCT is a close approximation of the optimal KarhunenLoeve transform for a wide class of images. In the IDCT engine, an inverse discrete cosine transformation (IDCT) is used to recorrelate the image samples. In MPEG encoding or decoding, a 2D 8×8 IDCT is normally used.
Since IDCT is an orthogonal transformation, the 2D transform can be computed by first performing onedimensional (1D) transformations on the eight rows and then performing 1D transformations on the columns of the result from the row stage. Those skilled in the art usually refer to this process as “rowcolumn decomposition.” Various fast algorithms have been developed for computing the 1D IDCT, but are outside the scope of the present discussion.
To perform MPEG decoding, a software implementation of the IDCT has been developed for the ZSP500 Digital Signal Processor (see, ZSP500 Digital Signal Processor Technical Reference Manual, LSI Logic Corp., Milpitas, Calif., USA, 2003, incorporated herein by reference in its entirety). To take advantage of the architectural features of the ZSP500 processor, an EvenOdd decomposition IDCT algorithm (see, e.g., Hung, et al., “A Compact IDCT design for MPEG Video Decoding,” in Proceedings 1997 IEEE Workshop on Signal Processing Systems (SiPS 1997), November 1997) incorporated herein by reference in its entirety. This algorithm has uniform butterfly structure that makes it amenable for implementing on processors such as ZSP500. Unfortunately, software implementations are computationally burdensome, reducing the speed at which a DSP can perform IDCT and reducing processor bandwidth perhaps useful for purposes.
Accordingly, what is needed in the art is a hardwarebased IDCT circuit. More specifically, what is needed in the art is a hardwarebased IDCT circuit that is fast and efficient and can be used as a coprocessor for a DSP, for example.
To address the abovediscussed deficiencies of the prior art, the present invention provides, in one aspect, a distributed arithmetic multiply/accumulate (MAC) unit for computing IDCTs. In one embodiment, the distributed arithmetic MAC unit includes: (1) a first pipeline stage configured to perform dot products on received sequential input data and (2) a second pipeline stage coupled to the first pipeline stage and configured to compute additions and subtractions of the dot products to yield sequential output data.
In another aspect, the present invention provides a DSP. In one embodiment, the DSP includes: (1) a core containing a pipeline control unit having a coprocessor interface and (2) a coprocessor coupled to the core and containing an distributed arithmetic multiply/accumulate (MAC) unit for computing inverse discrete cosine transformations, including: (2a) a first pipeline stage configured to perform dot products based on sequential input data received from the coprocessor interface and (2b) a second pipeline stage coupled to the first pipeline stage and configured to compute additions and subtractions of the dot products to yield sequential output data for transmission toward the coprocessor interface.
The foregoing has outlined preferred and alternative features of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention.
For a more complete understanding of the invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
Before describing an embodiment of an IDCT engine constructed in accordance with the broad principles of the present invention, some preferred design goals will first be outlined. More specifically, the illustrated embodiment of the IDCT engine has two broad design goals. First, the IDCT engine should be able to support multiple frame sizes and frame rates in MPEG2 and MPEG4 standards. These include the frame rates and sizes of the MPEG2 profiles from MP@LL to MP@HL and MPEG4 profiles from SP@L3 to Main@L4. The profiles MP@HL and Main@L4 specify the maximum digital High Definition Television (HDTV) frame size and rate requirements. These are given in Table 1, below.
The IDCT engine can certainly support other profiles with lower performance requirements. To support the above profiles, the IDCT engine has to perform 8×8 2D IDCTs with frame sizes and rates shown in Table 1. The number of IDCTs that need to be calculated at these frame sizes and rates in one second is shown in Table 2, below. Assuming that one 2D 8×8 IDCT can be computed in 100 clock cycles, the clock at which the IDCT engine has to be run is also shown in Table 2.
Second, the IDCT engine should, in some embodiments, be employable as a coprocessor to the ZSP programmable DSP core (see, ZSP500 Digital Signal Processor Technical Reference Manual, supra, incorporated herein by reference in its entirety. Accordingly, as an example,
The exemplary embodiment of the DSP core 100 is a ZSP500, commercially available from LSI Logic of Cupertino, Calif. The DSP core 100 includes a prefetch unit (PFU) 110 configured to prefetch instructions from memory (not shown) and store them temporarily in an instruction cache (shown but not separately referenced). An instruction sequence unit (ISU) 120 is coupled to the PFU and is configured to retrieve instructions from the instruction cache of the PFU 110 in proper order for efficient and orderly execution.
A pipeline control unit (PCU) 130 is coupled to the ISU 120 and is configured to control the execution of the instructions as ordered by the ISU 120. Accordingly, the PCU 130 is illustrated as including an interrupt control 132, a coprocessor interface 134, a debug interface 136 (advantageously a Joint Test Action Group, or JTAG, interface) and a control register file 138 that contains control registers. A register file 150 is coupled to the ISU 120 and the PCU 130 and contains operands upon which the instructions work.
A load/store unit (LSU) 140 is coupled to the register file 150 and is configured to load (retrieve) operands from memory and store (write back) operands to memory. Bypass logic 160 is coupled to the register file 150 and the PCU 130 and is configured to route operands to various processing resources in a multiply/arithmetic logic unit (MAC/ALU) 170 and an ALU unit 180. The MAC/ALU 170 includes a 40bit ALU 172 and first and second 16×16 MACs 174, 176. The ALU unit 180 includes first and second 16bit ALUs 182, 184.
When used as a coprocessor, the IDCT engine (not shown in
The input to the exemplary IDCT engine has 12bit precision. During decoding, the internal precision of the IDCT algorithm affects the quality of the reconstructed image. Therefore, to ensure that the encoded stream gets decoded with minimal degradation, it is preferable that the implementation of the IDCT algorithm in the IDCT engine hold bit errors in the reconstructed output to within certain tolerable levels. These are specified as part of the IEEE 1180 specification (see, IEEE Standard Specifications for the Implementation of 8×8 Inverse Discrete Cosine Transform, IEEE 11801990 (R1997), IEEE, 1990, incorporated herein by reference in its entirety). The MPEG2 and MPEG4 standards specify a few more tests in addition to those IEEE 1180 sets. Any MPEG2 and MPEG4 IDCT implementations must also to satisfy these requirements.
An 8point, onedimensional IDCT is defined as:
where X_{m }are the IDCT inputs and x_{n }are the pixels from the inverse transformation. The coefficients α_{m }are defined as:
Using matrix symmetry properties, the 1D IDCT can be decomposed in to even and odd parts known as “evenodd decomposition.” Evenodd decomposition can be represented in matrixvector form as two matrices:
where the coefficients c_{k }are defined as:
Then the IDCT output can be obtained from the above as two matrices:
Turning now to
As mentioned above, a 2D IDCT can be performed in two stages. First, a 1D IDCT of the rows is computed. Then, a 1D IDCT is computed on the columns of the intermediate result. Using this technique, a computation of 2D IDCT can be pipelined into two stages. In the first stage, eight 1D IDCTs are performed along the rows are performed, and the intermediate results are stored. Then, using the same 1D IDCT engine, the next eight 1D IDCTs are performed on the columns of the intermediate results.
The IDCT engine may be used with different video system architectures. For example, the IDCT engine may be used as a looselycoupled coprocessor (LCC) to ZSP secondgeneration DSPs or as a memorymapped coprocessor (MMC) to ZSP firstgeneration DSPs (as these DSPs do not have a coprocessor interface). When used as a coprocessor, the input to the IDCT engine may come from the DSP. Alternatively, if the coprocessor interface bandwidth is not sufficient, a dedicated memory interface may be used to get the input directly from the DSP'"'"'s memory. In both cases, a straightforward controller will manage the data and the operation of the IDCT engine, allowing the IDCT engine to be adapted to a particular application.
In view of the above, the IDCT engine has been designed to perform only one stage (row or column) of the 2D IDCT at a time. The IDCT engine reads data sequentially, performs either the row or the column transformation and provides the results sequentially. When enabled, the IDCT engine computes eight 1D IDCTs in a pipelined fashion.
As mentioned above, an external controller may need to be designed depending on the application. The controller advantageously provides the data to the IDCT engine sequentially and manages the IDCT engine'"'"'s input/output and temporary buffers. In the first stage (processing of rows), the controller advantageously reads the data either from the ZSP processor or from a dedicated memory interface and stores the intermediate results in the temporary buffer. In the second stage (processing of columns), the controller advantageously reads data from the temporary buffer and sends the final results to either a ZSP processor or to the dedicated memory interface.
Two primary stages of computation are evident from Equations (3) and (5), above, and the signal flow diagram of the IDCT algorithm in
Consider the first row of the first matrix of Equation (3):
r_{0}=X_{0}c_{4}+X_{4}c_{4}+X_{2}c_{2}+X_{6}c_{6} (6)
Equation (6) is a sumofproduct of the elements of two vectors of length four. Similarly, the other rows in the first matrix of Equation (3) can also be described in the same way. The computation of the values r_{0 }. . . r_{3 }and s_{0 }. . . s_{3 }involves multiplyaccumulate operations. If these eight values are computed by eight independent arithmetic units, the multiplier and the adder structures needed for implementing the multiplyaccumulate operation determine the area and speed of these units.
However, recall that in the two matrices of Equation (3), the coefficients c_{1 }. . . c_{7 }are constants. Since c_{1 }. . . c_{7 }are constants, full multipliers are not needed in implementing the multiplyaccumulate units. Stanley A. White, in “Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review,” IEEE ASSP Magazine, pp. 419, July 1989 (incorporated herein by reference), presents one of the ways to implement such multiplyaccumulate operations efficiently when the coefficients are constants.
Since the coefficient values are known a priori, the multiplyaccumulate operations can be distributed in lookup table (LUT) operations and additions/subtractions (see, e.g., White, supra). Consider the computation of dot product of two vectors of length K:
where y is the result of the dot product, X_{k }are two'"'"'s complement N+1bit input samples and c_{k }are the Mbit fixed coefficients. Since the inputs X_{k }to the IDCT algorithm are signed integer values, they can be expressed as:
where b_{kn }are either 0 or 1, and b_{kN }is the sign bit. Substituting Equation (8) in Equation (7) yields:
Equation (9) represents the basic distributed arithmetic operation. Note that the above formulation is slightly different from the one in White (supra) where the inputs are assumed to be fractions with the property that X_{k}<1. Since the bits b_{kn }take only values 0 and 1 and the values c_{n }are fixed coefficients, the term in the brackets of Equation (9) can only take 2K values. Therefore, these can be precomputed and stored as one or more ROM LUTs. The bits b_{kn }can be extracted from the input data and used as index in to these LUTs. The values read can then be accumulated in accordance with Equation (9). Various techniques may be employed to reduce the LUT sizes. White, supra, and Parhi, VLSI Digital Signal Processing Systems Design and Implementation, New York, John Wiley, 1999, incorporated herein by reference, provide more details on how distributed arithmetic can be implemented and various schemes to reduce the LUT sizes.
In the present IDCT case, the values for K, N and M are 4, 15 and 12 respectively. Equation (9) operates on one bit at a time. Accordingly, computing the dot product of K=4 length vectors, given a 16bit input, requires 16 clock cycles. The number of clock cycles required can be reduced by operating on multiple bits every clock cycle. White and Parhi, supra, describe a scheme that uses consecutive bits in the input samples and accumulates multiple LUT results into a single multiinput accumulator. For example, if four input bits are operated upon at a time, four LUT results have to be accumulated into a multiinput accumulator. Uramoto, et al., “A 100MHz 2D Discrete Cosine Transform Core Processor,” IEEE Journal of Solid State Circuits, vol. 27, no. 4, pp. 491499, April 1992 (incorporated herein by reference) discloses a scheme operating on two bits at a time. However, Uramoto, et al.'"'"'s scheme is limited to operating on only two bits at a time. More than two bits should be operated upon at a time to reduce latency.
Katayama, et al., “A Single Chip MPEG1 audio/video decoder using macrocore and cell based implementation,” IEEE VLSI Signal Processing, VIII (Osaka, Japan), pp. 431440, October 1995 (incorporated herein by reference) discloses a scheme operating on three bits at a time and uses multiple multiinput adders. Masaki, et al., “VLSI Implementation of Inverse Discrete Cosine Transformer and Motion Compensator for MPEG2 HDTV Video Decoding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 5, pp. 387395, 1995 (incorporated herein by reference) discloses a scheme that operates on four bits and uses a WallaceTree implemented with a multiinput accumulator. Unfortunately, the critical path in both of these schemes is through the multiinput accumulator after the LUT operation. Since the IDCT engine of the present invention advantageously operates at high clock rates, a more efficient pipelining of the distributed arithmetic operation is much desired.
To pipeline more efficiently, one embodiment of the present invention calls for the distributed arithmetic algorithm to be partitioned into two stages. In the first stage, the distributed arithmetic algorithm is partitioned into parallel and smaller distributed arithmetic operations. In the second stage, the results from the first stage are added to get the final results. This partitioning ensures that the critical path involves only the delay of a twoinput adder (not that of a multiinput adder as in other schemes) after the table lookup operation just like the original DA algorithm. This enables achieving very high operating clock speeds and lower silicon area. Also, if more than four bits need to be processed at a time in order to reduce the latency more, only the partitioning of the DA algorithm as described herein allows it. The schemes based on multiinput accumulators do not scale well when more than four bits need to be processed as the number of inputs to the accumulator increases. As the number of inputs to the accumulator increases, achieving higher clock speeds and lower silicon area becomes much more difficult. The resulting algorithm, which will hereinafter be referred to as the partitioned distributed arithmetic (PDA) algorithm, and a concomitant arithmetic MAC architecture will now be discussed. The PDA algorithm enables very efficient pipelining.
The value of each of the terms in the brackets depends only on the index n. Therefore, Equation (9) can be represented as:
where F(c,n) is defined as:
The functions F(c,n) represent a LUT operation in which a group of n^{th }bits of the input values are used as index into a LUT. Since, in the case of IDCT engine, the value of N is 15, Equation (10) can be written as:
y=2^{0}[F(c,0)+2^{1}F(c,1)+2^{2}F(c,2)+2^{3}F(c,3)]
2^{4}[F(c,4)+2^{1}F(c,5)+2^{2}F(c,6)+2^{3}F(c,7)]
2^{8}[F(c,8)+2^{1}F(c,9)+2^{2}F(c,10)+2^{3}F(c,11)]
2^{12}[F(c,12)+2^{1}F(c,13)+2^{2}F(c,14)+2^{3}F(c,15)] (12)
The terms in the brackets in Equation (11) can be accumulated independent of each other to generate partial sums which can then be accumulated again to get the final result. Rewriting Equation (12) yields:
y=2^{0}Q_{0}+2^{4}Q_{1}+2^{8}Q_{2}+2^{12}Q_{3} (13)
where the partial sums Q_{0}, Q_{1}, Q_{2 }and Q_{3 }are defined as:
Q_{0}=F(c,0)+2^{1}F(c,1)+2^{2}F(c,2)+2^{3}F(c,3)
Q_{1}=F(c,4)+2^{1}F(c,5)+2^{2}F(c,6)+2^{3}F(c,7)
Q_{2}=F(c,8)+2^{1}F(c,9)+2^{2}F(c,10)+2^{3}F(c,11)
Q_{3}=F(c,12)+2^{1}F(c,13)+2^{2}F(c,14)+2^{3}F(c,15) (14)
From Equations (13) and (14), it can be seen that the distributed arithmetic operation has been partitioned into two stages. The first stage corresponds to the independent accumulations defined in the Equation (14). The second stage then is the final accumulation defined in Equation (13).
In Equation (14), the distributed arithmetic operation has been partitioned into four independent accumulations, so four input bits are operated upon at a time. Since in the present IDCT case, the number of bits in the input samples is (N+1)=16, the four independent accumulations in Equation (14) can be completed in four clock cycles.
Equation (13) involves four additions which can be completed in four clock cycles by using a single adder. Therefore, the two stages of the PDA can be block pipelined so that after every four clock cycles, the second stage operation can be performed on the results from the first stage. As stated above, eight values (namely r_{0 }. . . r_{3 }and s_{0 }. . . s_{3}) need to be computed in the first stage of the IDCT algorithm. Since the computation of these values involves similar operations, the architecture of an exemplary PDA MAC unit for computing r_{0 }in these two pipeline stages will now be discussed.
To compute the sumofproducts using distributed arithmetic, all the eight IDCT values are needed before the computation can be started. Also, since the sumofproducts stage of the 1D IDCT has been partitioned into two stages, it can be seen that there are four main stages in the computation of IDCT. Using this property, an efficient architecture using a pipelining scheme at the block and clock cycle level is developed.
Turning now to
The 1D IDCT engine 300 has four input and two output ports. An input signal idct1d_enable is a control signal that enables or disables the IDCT engine. After the IDCT engine 300 is enabled, from next clock cycle onwards, the IDCT engine 300 accepts two 16bit IDCT samples every clock cycle represented by the input signals idct1d_in0 and idct1d_in1. Since eight 1D IDCTs contain 64 data samples, the data samples can be fed to the IDCT engine 300 in 32 clock cycles. The signal idct2d_stage informs the IDCT engine 300 which stage of the 2D IDCT is being performed. The signal idct2d_stage is used (a) in selecting an appropriate rounding factor that is needed to make the IDCT engine 300 IEEE 1180compliant and (b) to select appropriate bits for the output.
The IDCT engine 300 has two 16bit outputs, namely idct1d_out0 and idct1d_out1, and can sustain two outputs per clock cycle after a fixed latency from the time the IDCT engine 300 is enabled. One stage of a 2D IDCT can potentially be performed in 46 clock cycles. As a result, a 2D IDCT can potentially be completed in around 92 clock cycles.
Turning now to
As stated above, the PDAbased 1D IDCT algorithm primarily has four stages of operation. By pipelining these four stages at a block level, the IDCT engine can be made to operate on multiple 1D IDCTs sequentially. The idea of block pipelining will now be shown. Accordingly, turning now to
It is apparent from
Pipeline Stage B_{0 }
Since, from
Since the first stage of the PDA needs all the eight inputs for the next four clock cycles, these samples need to be preserved when the next eight samples are being gathered in B_{0 }pipeline stage. This can be done by capturing the inputs into a simple register file in the pipeline stage B_{0 }and then transferring to another register file at the end of every four clock cycles. The first stage of PDA in the pipeline stage B_{1 }then uses the second register file as input.
Pipeline Stage B_{1 }
In the pipeline stage B_{1}, the first stage of the PDA is performed. Consider the partial sum Q_{0 }in Equation (6). It involves four lookup table operations and a shifting and accumulation of the results. Since the precision of the coefficients is 12 bits, the functions F(c,n) defined in the matrices of Equation (3) need 14 bits to represent the maximum value. In Equation (6), since the lookup table values are shifted by a maximum shift of four, the final result of the partial sums can be represented by 18 bits. As a result, 20bit adders may be advantageously employed to implement the first stage of the PDA algorithm. An exemplary architecture of the pipeline stage B_{1 }will now be described with reference to
Four identical subunits 610, 620, 630, 640 correspond to the four partial sums in Equation (6). The input signals B1_in0, B1_in2, B1_in4 and B1_in6 represent the input samples gathered in the pipeline stage B_{0}. The registers B2_Acc0 . . . B2_Acc3 hold the partial sums Q_{0 }. . . Q_{3 }in Equation (6).
As can be seen in
If the shift and accumulation of the table lookup is implemented as defined in Equation (6), a costly barrel shifter would be needed. However, to avoid the barrel shifter, the shiftaccumulation may be advantageously performed by shifting the table lookup output value by a fixed amount to the left and then shifting the accumulator right to obtain an operation equivalent to a barrel shifter.
Since the second stage B_{2 }of the PDA needs to keep the accumulated values from the first stage B_{1 }until it completes the final result, two separate registers are used as accumulators. In the first three clock cycles represented by the states S_{0}, S_{1 }and S_{2}, the values are accumulated into an internal accumulator. For example, in the first subunit 610, the internal accumulator is represented by a register B1_Acc0. In the fourth clock cycle represented by the state S_{3}, the internal accumulator is cleared for the next accumulation, and the final result is routed to a second register B2_Acc0 which acts as input to the B_{2 }pipeline stage.
The path defined by the LUT operation, followed by the accumulation into the register is the critical path in the illustrated embodiment of the IDCT engine, since it determines the speed at which the IDCT engine can be clocked.
Pipeline Stage B_{2 }
In pipeline stage B_{2}, the second stage of the PDA is performed; partial sums are accumulated to yield the final result r_{0 }as defined in Equation (5). At the end of every four clock cycles, four registered outputs represented by accumulators B2_Acc0, B2_Acc1, B2_Acc2 and B2_Acc3 from pipeline stage B_{1 }are available. These are to be accumulated over four clock cycles in pipeline stage B_{2}. This can implemented by the architecture shown in
In the first clock cycle corresponding to the state S0, the accumulator is initialized by the result from the addition of the B2_Acc0 accumulator shifted by 12 with the initialization value as shown in
In the second, third and fourth clock cycles corresponding to the states S_{1}, S_{2 }and S_{3}, the accumulator is shifted right by four and the shifted values from the first stage of PDA are accumulated to the register. At the end of four clock cycles, the new value of the accumulator is available for the next pipeline stage B_{3 }for performing the addition/subtraction butterfly operation. Since the accumulators r_{0 }. . . r_{3 }and s_{0 }. . . s_{3 }are initialized at the end of the first clock cycle, they need to be preserved in the pipeline stage B_{3}. This is done by copying all of the accumulator values to another register file so that they remain available during the pipeline stage B_{3}.
The amount of rounding required to be performed on the final results from the addition/subtraction butterfly operation in the pipeline stage B_{3 }determines the initialization value. The value of rounding itself is decided based on (a) whether row or column stage is being processed and (b) where the most significant bits of the output are present in the 32bit accumulator.
In the row processing stage of the IDCT engine, the input and the constant coefficients are of 12bit precision. Taking into consideration the maximum possible bit growth, bits 9 to 24 are the most significant 16bits of the output. It is advantageous to add a half bit to the final result to achieve the desired rounding. Therefore, since the least significant bit starts at bit 9, the rounding value is 28=256.
In the column stage of the IDCT, the input is of 16bit and the constant coefficients are of 12bit precision. Taking into consideration, the maximum possible bit growth, bits 16 to 31 are the most significant 16bit outputs. Here also, a half bit should be added to the final result to achieve the desired rounding. Therefore, since the least significant bit starts at bit 16, the rounding value is 215=16384.
Pipeline Stage B_{3 }
In the pipeline stage B_{3}, three operations are performed. First, the outputs from the PDA MAC units are added or subtracted as shown in the Equation (5) and per the signal flow graph of
The architecture of the B_{3 }pipeline stage is built around two 32bit adder/subtractors 810, 820. In every clock cycle, either two additions or subtractions are performed. Appropriate inputs to the adder/subtractors 810, 820 are routed depending which of the IDCT outputs are being.
Depending on whether the row or column IDCTs are being performed, appropriate bits are extracted from the outputs of the adder/subtractors 810, 820 of
Turning now to
From the second clock cycle onwards, the IDCT engine accepts two 16bit input samples every clock cycle for the next 32 clock cycles. Therefore, the pipeline stage B_{0 }is active during this time. At the beginning of the fifth clock cycle, the PDA MAC units are enabled to begin the processing of the pipeline stage B_{1 }of the PDA. At the end of the fifth clock cycle, the eight IDCT inputs are transferred to the pipeline stage B_{1}. At that time, the pipeline stage B_{1 }becomes active and stays active for the next 32 clock cycles.
At the end of the ninth clock cycle, the pipeline stage B_{2 }of the PDA becomes active and stays active for the next 32 clock cycles. After 13 clock cycles, the pipeline stage B_{3 }becomes active and stays active for the next 32 clock cycles. The first pair of outputs appear at the end of the 15^{th }clock cycle. Then the IDCT engine continues to accept input and sustain the output every clock cycle. At the end of the 33^{rd }clock cycle and assuming no further IDCT work to be done, the pipeline stage B_{0 }becomes inactive as all the 64 inputs samples have been fed to the IDCT engine. After this, the pipeline stages B_{1}, B_{2 }and B_{3 }becomes inactive at the end of 37^{th}, 41^{st }and 45^{th }clock cycles, respectively.
At the beginning of the 45^{th }clock cycle, the controller may disable the IDCT engine. Also, at the end of 45^{th }clock cycle, the last pair of IDCT outputs are available after which the controller may change the idct2d_stage signal. From
Recall from
Accordingly, turning now to
From the above, it is apparent that a novel architecture and exemplary implementation details of an IDCT engine for MPEG decoding applications has been introduced. The IDCT engine is based on an efficient algorithm called “evenodd decomposition.” The computational and precision requirements and the ways in which the IDCT engine can be used either in a standalone mode or as a coprocessor are discussed. To achieve high clock rates and small area, a technique for efficient pipelining according to a distributed arithmetic method is introduced.
Although the present invention has been described in detail, those skilled in the art should understand that they can make various changes, substitutions and alterations herein without departing from the spirit and scope of the invention in its broadest form.