EFFICIENT COMPLEX MULTIPLICATION AND FAST FOURIER TRANSFORM (FFT) IMPLEMENTATION ON THE MANARRAY ARCHITECTURE

US 20030088601A1
Filed: 06/22/1999
Published: 05/08/2003
Est. Priority Date: 10/09/1998
Status: Active Grant

First Claim

Patent Images

1. An apparatus for the efficient processing of complex multiplication computations, the apparatus comprising:

at least one controller sequence processor (SP);

a memory for storing process control instructions;

a first multiply complex numbers instruction stored in the memory and operative to control the PEs to carry out a multiplication operation involving a pair of complex numbers; and

hardware for implementing the first multiply complex numbers instruction.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Efficient computation of complex multiplication results and very efficient fast Fourier transforms (FFTs) are provided. A parallel array VLIW digital signal processor is employed along with specialized complex multiplication instructions and communication operations between the processing elements which are overlapped with computation to provide very high performance operation. Successive iterations of a loop of tightly packed VLIWs are used allowing the complex multiplication pipeline hardware to be efficiently used. In addition, efficient techniques for supporting combined multiply accumulate operations are described.

Citations

39 Claims

1. An apparatus for the efficient processing of complex multiplication computations, the apparatus comprising:
- at least one controller sequence processor (SP);
  
  a memory for storing process control instructions;
  
  a first multiply complex numbers instruction stored in the memory and operative to control the PEs to carry out a multiplication operation involving a pair of complex numbers; and
  
  hardware for implementing the first multiply complex numbers instruction.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The apparatus of claim 1 further comprising a plurality of processing elements (PEs) interconnected with said SP and arranged in an N×
    - N array interconnected in a manifold array interconnection network.
  - 3. The apparatus of claim 1 wherein the first multiply complex instruction completes execution in 2 cycles.
  - 4. The apparatus of claim 1 wherein the first multiply complex instruction is tightly pipelineable.
  - 5. The apparatus of claim 1 wherein each complex number is stored as a word, each word comprising a first half word and a second half word, with a real component of each complex number being stored as the first half word and an imaginary component of each complex number being stored as the second half word.
  - 6. The apparatus of claim 1 wherein the first multiply complex instruction includes a plurality of rounding modes, the rounding modes including:
    - rounding toward a nearest integer;
      
      rounding toward zero;
      
      rounding toward infinity; and
      
      rounding toward negative infinity.
  - 7. The apparatus of claim 1 wherein the first multiply complex numbers instruction is one of the following group of instructions:
    - a multiply complex numbers (MPYCX), a multiply complex numbers instruction (MPYCXJ) operative to carry out the multiplication of a pair of complex numbers where an argument is conjugated, a multiply complex numbers instruction (MPYCXD2) operative to carry out the multiplication of a pair of complex numbers with a result divided by two, and a multiply complex numbers instruction (MPYCXJD2) operative to carry out the multiplication of a pair of complex numbers where an argument is conjugated with a result divided by two.
  - 8. The apparatus of claim 1 further comprising a multiply accumulate unit including the memory for storing the first multiply complex numbers instruction.
  - 9. The apparatus of claim 8 wherein the multiply accumulate unit operates in response to a multiply accumulate instruction (MPYA) to extend a multiplication operation with an accumulate operation.
  - 10. The apparatus of claim 8 wherein the multiply accumulate unit operates in response to a sum two product accumulate instruction (SUM2PA) to extend two multiplication operations with an accumulate operation.
  - 11. The apparatus of claim 9 wherein the multiply accumulate unit operates in response to a multiply complex with accumulate instruction (MPYCXA) to carry out the multiplication of a pair of complex numbers with accumulation of a third complex number.
  - 12. The apparatus of claim 11 wherein the MPYCXA instruction completes execution in 2 cycles.
  - 13. The apparatus of claim 12 wherein the MPYCXA instruction is tightly pipelineable.
  - 14. The apparatus of claim 1 further comprising one or more of the following additional instructions (MPYCXA, MPYCXAD2, MPYCXJA or MPYCXJAD2) stored in the memory to carry out complex multiplication operations pipelined in 2 cycles.

15. A method for the computation of an FFT by a plurality of processing elements (PEs), the method comprising the steps of:
- loading input data from a memory into each PE in a cyclic manner;
  
  calculating a local FFT by each PE;
  
  multiplying by the twiddle factors and calculating a FFT by the cluster of PEs; and
  
  loading the FFTs into the memory.

16. A method for the computation of a distributed FFT by an N×
- N processing element (PE) array, the method comprising the steps of;
  
  loading a complex number x and a corresponding twiddle factor w from a memory into each of the PEs;
  
  calculating a first product by the multiplication of the complex numbers x and w;
  
  transmitting the first product from each of the PEs to another PE in the N×
  
  N array;
  
  receiving the first product and treating it as a second product in each of the PEs;
  
  selectively adding or subtracting the first product and the second product to form a first result;
  
  calculating a third product in selected PEs;
  
  transmitting the first result or third product in selected PEs to another PE in the N×
  
  N array;
  
  selectively adding or subtracting the received values to form a second result; and
  
  storing the second results in the memory.

17. A method for efficient computation by a 2×
- 2 processing element (PE) array interconnected in a manifold array interconnection network, the array comprising four PEs (PE0, PE1, PE2 and PE3), the method comprising the steps of;
  
  loading a complex number x and a corresponding twiddle factor w from a memory into each of the four PEs, complex number x including subparts x0, x1, x2 and x3, twiddle factor w including subparts w0, w1, w2 and w3;
  
  multiplying the complex numbers x and w, such that PE0 multiplies x0 and w0 to produce a product0, PE1 multiplies x1 and w1 to produce a product1, PE2 multiplies x2 and w2 to produce a product2, and PE3 multiplies x3 and w3 to produce a product3;
  
  transmitting the product0, the product1, the product2 and the product3, such that PE0 transmits the product0 to PE2, PE1 transmits the product1 to PE3, PE2 transmits the product2 to PE0, and PE3 transmits the product3 to PE1; and
  
  performing arithmetic logic operations, such that PE0 adds the product0 and the product2 to produce a sum t0, PE1 adds the product1 and the product3 to produce a sum t2, PE2 subtracts the product2 from the product0 to produce a sum t1, and PE3 subtracts the product3 from the product1 to produce a result which is multiplied by −
  
  i to produce a sum t3.
- View Dependent Claims (18)
- - 18. The method of claim 17 further comprising the steps of:
    - transmitting the sums t0, t1, t2 and t3, such that PE0 transmits t0 to PE1, PE1 transmits t2 to PE0, PE2 transmits t1 to PE3, and PE3 transmits t3 to PE2;
      
      performing the arithmetic logic operations, such that PE0 adds t0 and t2 to produce a y0, PE1 subtracts t2 from t0 to produce a y1, PE2 adds t1 and t3 to produce a y2, and PE3 subtracts t3 from t1 to produce a y3; and
      
      storing y0, y1, y2 and y3 in a memory.

19. A special hardware instruction for handling the multiplication with accumulate for two complex numbers from a source register whereby utilizing said instruction and accumulated complex product of two source operands is rounded according to a rounding mode specified in the instruction and loaded into a target register with the complex numbers organized in the source such that a halfword (H1) contains the real component and a halfword (H0) contains the imaginary component.
- View Dependent Claims (20)
- - 20. The special hardware instruction of claim 19 wherein the accumulated complex product is divided by two before it is rounded.

21. An apparatus to efficiently fetch instructions including complex multiplication instructions and an accumulate form of multiplication instructions from a memory element and dispatch the fetched instruction to at least one of a plurality of multiply complex and multiply with accumulate execution units to carry out the instruction specified operation, the apparatus comprising:
- a memory element;
  
  means for fetching said instructions from the memory element;
  
  a plurality of multiply complex and multiply with accumulate execution units; and
  
  means to dispatch the fetched instruction to at least one of said plurality of execution units to carry out the instruction specified operation.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39)
- - 22. The apparatus of claim 21 further comprising:
    - an instruction register to hold a dispatched multiply complex instruction (MPYCX);
      
      means to decode the MPYCX instruction and control the execution of the MPYCX instruction;
      
      two source registers each holding a complex number as operand inputs to the multiply complex execution hardware;
      
      four multiplication units to generate terms of the complex multiplication;
      
      four pipeline registers to hold the multiplication results;
      
      an add function which adds two of the multiplication results from the pipeline registers for the imaginary component of the result;
      
      a subtract function which subtracts two of the multiplication results from the pipeline registers for the real component of the result;
      
      a round and select unit to format the real and imaginary results; and
      
      a result storage location for saving the final multiply complex result, whereby the apparatus is operative for the efficient processing of multiply complex computations.
  - 23. The apparatus of claim 21 wherein the means for fetching said instructions is a sequence processor (SP) controller.
  - 24. The apparatus of claim 22 wherein the round and select unit provides a shift right as a divide by 2 operation for a multiply complex divide by 2 instruction (MPYCXD2).
  - 25. The apparatus of claim 21 further comprising:
    - an instruction register to hold a dispatched multiply complex instruction (MPYCXJ);
      
      means to decode the MPYCXJ instruction and control the execution of the MPYCXJ instruction;
      
      two source registers each holding a complex number as operand inputs to the multiply complex execution hardware;
      
      four multiplication units to generate terms of the complex multiplication;
      
      four pipeline registers to hold the multiplication results;
      
      an add function which adds two of the multiplication results from the pipeline registers for the real component of the result;
      
      a subtract function which subtracts two of the multiplication results from the pipeline registers for the imaginary component of the result;
      
      a round and select unit to format the real and imaginary results; and
      
      a result storage location for saving the final multiply complex conjugate result, whereby the apparatus is operative for the efficient processing of multiply complex conjugate computations.
  - 26. The apparatus of claim 25 wherein the round and select unit provides a shift right as a divide by 2 operation for a multiply complex conjugate divide by 2 instruction (MPYCXJD2).
  - 27. The apparatus of claim 21 further comprising:
    - an instruction register to hold the dispatched multiply accumulate instruction (MPYA);
      
      means to decode the MPYA instruction and control the execution of the MPYA instruction;
      
      two source registers each holding a source operand as inputs to the multiply accumulate execution hardware;
      
      at least two multiplication units to generate two products of the multiplication;
      
      at least two pipeline registers to hold the multiplication results;
      
      at least two accumulate operand inputs to the second pipeline stage accumulate hardware;
      
      at least two add functions which each adds the results from the pipeline registers with the third accumulate operand creating two multiply accumulate results;
      
      a round and select unit to format the results if required by the MPYA instruction; and
      
      a result storage location for saving the final multiply accumulate result, whereby the apparatus is operative for the efficient processing of multiply accumulate computations.
  - 28. The apparatus of claim 21 further comprising:
    - an instruction register to hold a dispatched multiply accumulate instruction (SUM2PA);
      
      means to decode the SUM2PA instruction and control the execution of the SUM2PA instruction;
      
      at least two source registers each holding a source operand as inputs to the SUM2PA execution hardware;
      
      at least two multiplication units to generate two products of the multiplication;
      
      at least two pipeline registers to hold the multiplication results;
      
      at least one accumulate operand input to the second pipeline stage accumulate hardware;
      
      at least one add function which adds the results from the pipeline registers with the third accumulate operand creating a SUM2PA result;
      
      a round and select unit to format the results if required by the SUM2PA instruction; and
      
      a result storage location for saving the final result, whereby the apparatus is operative for the efficient processing of sum of 2 products accumulate computations
  - 29. The apparatus of claim 21 further comprising:
    - an instruction register to hold the dispatched multiply complex accumulate instruction (MPYCXA);
      
      means to decode the MPYCXA instruction and control the execution of the MPYCXA instruction;
      
      two source registers each holding a complex number as operand inputs to the multiply complex accumulate execution hardware;
      
      four multiplication units to generate terms of the complex multiplication;
      
      four pipeline registers to hold the multiplication results;
      
      at least two accumulate operand inputs to the second pipeline stage accumulate hardware;
      
      an add function which adds two of the multiplication results from the pipeline registers and also adds one of the accumulate operand input for the imaginary component of the result;
      
      a subtract function which subtracts two of the multiplication results from the pipeline registers and also adds the other accumulate operand input for the real component of the result;
      
      a round and select unit to format the real and imaginary results; and
      
      a result storage location for saving the final multiply complex accumulate result, whereby the apparatus is operative for the efficient processing of multiply complex accumulate computations.
  - 30. The apparatus of claim 29 wherein the round and select unit provides a shift right as a divide by 2 operation for a multiply complex accumulate divide by 2 instruction (MPYCXAD2).
  - 31. The apparatus of claim 21 further comprising:
    - an instruction register to hold the dispatched multiply complex conjugate accumulate instruction (MPYCXJA);
      
      means to decode the MPYCXJA instruction and control the execution of the MPYCXJA instruction;
      
      two source registers each holding a complex number as operand inputs to the multiply complex accumulate execution hardware;
      
      four multiplication units to generate terms of the complex multiplication;
      
      four pipeline registers to hold the multiplication results;
      
      at least two accumulate operand inputs to the second pipeline stage accumulate hardware;
      
      an add function which adds two of the multiplication results from the pipeline registers and also adds one of the accumulate operand input for the real component of the result;
      
      a subtract function which subtracts two of the multiplication results from the pipeline registers and also adds the other accumulate operand input for the imaginary component of the result;
      
      a round and select unit to format the real and imaginary results; and
      
      a result storage location for saving the final multiply complex conjugate accumulate result, whereby the apparatus is operative for the efficient processing of multiply complex conjugate accumulate computations.
  - 32. The apparatus of claim 31 wherein the round and select unit provides a shift right as a divide by 2 operation for a multiply complex conjugate accumulate divide by 2 instruction (MPYCXJAD2).
  - 33. The apparatus of claim 21 wherein the complex multiplication instructions and accumulate form of multiplication instructions include MPYCX, MPYCXD2, MPYCXJ, MPYCXJD2, MPYCXA, MPYCXAD2, MPYCXJA, MPYCXJAD2 instructions, and all of these instructions complete execution in 2 cylces.
  - 34. The apparatus of claim 21 wherein the complex multiplication instructions and accumulate form of multiplication instructions include MPYCX, MPYCXD2, MPYCXJ, MPYCXJD2, MPYCXA, MPYCXAD2, MPYCXJA, MPYCXJAD2 instructions, and all of these instructions are tightly pipelineable.
  - 36. The apparatus of claim 22 wherein the add function and subtract function are selectively controlled functions allowing either addition or subtraction operations as specified by the instruction.
  - 37. The apparatus of claim 25 wherein the add function and subtract function are selectively controlled functions allowing either addition or subtraction operations as specified by the instruction.
  - 38. The apparatus of claim 29 wherein the add function and subtract function are selectively controlled functions allowing either addition or subtraction operations as specified by the instruction.
  - 39. The apparatus of claim 31 wherein the add function and subtract function are selectively controlled functions allowing either addition or subtraction operations as specified by the instruction.

35. An apparatus for the efficient processing of an FFT, the apparatus comprising:
- at least one controller sequence processor (SP);
  
  a plurality of processing elements (PEs) arranged in an N×
  
  N array interconnected in a manifold (ManArray) interconnection network; and
  
  a memory for storing instructions to be processed by the SP and by the array of PEs.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Altera Corporation (Intel Corporation)
Original Assignee
PTS Corporation
Inventors
PITSIANIS, NIKOS P., PECHANEK, GERALD G., RODRIGUEZ, RICARDO E.

Granted Patent

US 6,839,728 B2
Time in Patent Office

Days
Field of Search
US Class Current

708/622
CPC Class Codes

G06F 15/8023   Two dimensional arrays, e.g...

G06F 15/8038   Associative processors

G06F 15/82   data or demand driven

G06F 17/142   Fast Fourier transforms, e....

G06F 9/30014   with variable precision

G06F 9/30032   Movement instructions, e.g....

G06F 9/30036   Instructions to perform ope...

G06F 9/3853   of compound instructions

G06F 9/3885   using a plurality of indepe...

EFFICIENT COMPLEX MULTIPLICATION AND FAST FOURIER TRANSFORM (FFT) IMPLEMENTATION ON THE MANARRAY ARCHITECTURE

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

EFFICIENT COMPLEX MULTIPLICATION AND FAST FOURIER TRANSFORM (FFT) IMPLEMENTATION ON THE MANARRAY ARCHITECTURE

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links