Pipelined linear array of processor elements for performing matrix computations
First Claim
1. An apparatus comprising at least one digital processor element, the digital processor element comprising:
- arithmetic circuitry operative to multiply real and imaginary parts of at least two complex numbers by real and imaginary parts of at least another two complex numbers thereby forming at least sixteen partial products, and to form one or more additive combinations of the partial products, each of the additive combinations representing a real or imaginary number; and
a register file having at least a first port and a second port, each of the first and second ports being capable of reading two complex words or writing two complex words to or from the register file, the ports being coupled to the arithmetic circuitry for supplying the complex numbers thereto and receiving the real or imaginary numbers therefrom.
9 Assignments
0 Petitions
Accused Products
Abstract
A pipelined linear array of processor elements (PEs) for performing matrix computations in an efficient manner. The linear array generally includes a head PE and a set of regular PEs, the head PE being a functional superset of the regular PE, with interconnections between nearest neighbor PEs in the array and a feedback path from a non-neighbor regular PE back to the head PE. Each PE includes arithmetic circuitry for performing multiply, combine and accumulate operations, and a register file for storing inputs and outputs of the arithmetic circuitry. The head PE further includes a non-linear function generator. Each PE is pipelined such that the latency for an arithmetic operation to complete is a multiple of the period with which new operations can be initiated. A Very Large Instruction Word (VLIW) program or other type of program may be used to control the array. The array is particularly efficient at performing complex matrix operations, such as, e.g., the solution of a set of linear equations, matrix inversion, matrix-matrix multiplication, and computation of covariance and cross correlation.
125 Citations
36 Claims
-
1. An apparatus comprising at least one digital processor element, the digital processor element comprising:
-
arithmetic circuitry operative to multiply real and imaginary parts of at least two complex numbers by real and imaginary parts of at least another two complex numbers thereby forming at least sixteen partial products, and to form one or more additive combinations of the partial products, each of the additive combinations representing a real or imaginary number; and
a register file having at least a first port and a second port, each of the first and second ports being capable of reading two complex words or writing two complex words to or from the register file, the ports being coupled to the arithmetic circuitry for supplying the complex numbers thereto and receiving the real or imaginary numbers therefrom. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
bi-directional interconnections between each processor element and its neighbor allowing complex c and s values of the Givens rotation to be passed from the head element to the second processor element, the c and s values then to be passed from one processor element to the next processor element through the array;
a program including instruction words containing individual sub-instructions directed to each of the processor elements, each of the sub-instructions being applied to its associated processor element; and
a sequencer for successively applying the instruction words in time to control the functioning of the array, such that the program controls the array to perform the QR decomposition of the matrix.
-
-
14. The apparatus of claim 13 wherein elements of the matrix upon which QR decomposition is to be performed are input to the processor elements via external data input ports associated therewith.
-
15. The apparatus of claim 13 wherein elements of the matrix upon which QR decomposition is to be performed reside in the register files associated with the processor elements at the beginning of the computation.
-
16. The apparatus of claim 13 being further operative to solve a set of linear equations represented as a matrix equation, by first performing the QR decomposition of the matrix equation, where the resulting matrix equation in triangular form is stored in the register files of the processor elements of the array and where the apparatus then solves the matrix equation in triangular form by utilizing back substitution, the back substitution being accomplished by storing additional program instructions in an instruction memory and sequencing the additional instructions to control the functioning of the array.
-
17. The apparatus of claim 13 being further operative to find the inverse of a known matrix, beginning with an identity that the known matrix times its unknown inverse equals the identity matrix and representing this identity by a matrix representation comprising the known matrix with the identity matrix concatenated to it and performing QR decomposition on the matrix representation, wherein the resulting matrix equation in triangular form is stored in the register files of the processor elements of the array and wherein the apparatus then solves for the unknown inverse matrix by utilizing multiple back substitution operations, each of the back substitution operations being accomplished by storing additional program instructions in an instruction memory and sequencing the additional instructions to control the functioning of the array.
-
18. The apparatus of claim 11 being operative to multiply two input matrices, the apparatus storing the two input matrices to be multiplied together in the register file of a given processor element of the array, the register file of the given processor element being organized such that two by two complex sub-matrices of each of the two matrices to be multiplied are stored in pairs of word lines so that a pair of complex words representing either a row or a column of a the sub-matrix may be directed to a register file port, the apparatus further comprising:
-
a program including instruction words containing instructions to control the given processor element, the instructions being applied to the given processor element;
a sequencer for successively applying the instructions in time to control the functioning of the given processor element, such that the program sequentially places pairs of values from two rows of the first input matrix and pairs of values from two columns of the second input matrix at the inputs to the arithmetic circuitry of the processor element, the program controlling the given processor element to perform matrix-matrix multiplication by breaking the resulting product matrix into two by two complex sub-matrices and accumulating four complex elements of the resulting sub-matrix as all the cross products of a full two rows of the first input matrix and a full two columns of the second input matrix, the resulting product sub-matrix being stored in the register file.
-
-
19. The apparatus of claim 18 wherein the linear array of the processor elements are interconnected in a circular manner such that each processor element can pass complex values to its nearest neighbor in one direction.
-
20. The apparatus of claim 19 wherein storage of the two matrices to be multiplied together in the register file is organized such that each entire column of each of the input matrices is initially stored in the register file of a single processor element, the storage of all of the columns of both the input matrices being distributed among the processor elements of the array.
-
21. The apparatus of claim 11 wherein a subset of the processor elements in the linear array of the processor elements are interconnected in a circular manner such that each processor element in the subset can pass complex values to its nearest neighbor in one direction, the array including at least one other processor element not a member of the subset and connected to the rest of the array at only one point in the circle of the subset of processor elements, such that the other processor element can receive the same complex values that are being passed around the circle formed by the subset.
-
22. The apparatus of claim 21 wherein the linear array is configured to compute a covariance and a cross correlation of an input vector and input scalar, wherein the input vector is a sequence of complex numbers and the input scalar is a complex number that either already exists in the register files of the processor elements or is input through external inputs of the processor elements, each of the processor elements of the subset holding two complex elements of the input vector, the apparatus further comprising:
-
a program including instruction words containing sub-instructions to control each the processor element, the sub-instructions being applied to the corresponding processor elements; and
a sequencer for successively applying the instructions in time to control the functioning of the array in accordance with the program so as to leave a pair of the input vector elements stationary at one pair of inputs of the arithmetic circuitry of each processor element in the subset and leaving the input scalar stationary at an input of the arithmetic circuitry of the processor element not in the subset but attached to the circle, shift the pairs of input vector elements around the circle of the subset of processor elements, control the processor elements in the subset to multiply each the two stationary complex numbers with the pairs of complex numbers being shifted around the circle, and control the processor element not in the subset but attached to the circle to multiply the stationary complex number with the pairs of complex numbers being shifted around the circle, the resulting products being stored in the register file.
-
-
23. The apparatus of claim 11 further comprising:
-
a bus that can hold one or more complex numbers; and
a connection from the bus to an input port of the head processor element;
at least one of the regular processor elements containing a tri-state driver of the bus, the input of each the tri-state driver being driven by its processor element with complex words to pass to the head element, such that the array has a configurable circular connection of a subset of the processor elements, allowing different numbers of the processor elements to be members of the circular interconnection together with the head processor element.
-
-
24. The apparatus of claim 11 further comprising:
-
a bus that can hold one or more complex numbers; and
a connection from an output port of the head processor element to the bus;
at least one of the regular processor elements containing a multiplexer, one input of the multiplexer being driven by the bus and the other input of the multiplexer being driven by an output port of a neighboring processor element, the output of the multiplexer being selectable as values on either one of its inputs, at most one multiplexer being selectable for the input being driven by the bus and the output of the multiplexer being available to the processor element, such that the array effectively has a configurable circular connection of a subset of the processor elements, allowing different numbers of the processor elements to be members of the circular interconnection together with the head processor element.
-
-
25. The apparatus of claim 9 wherein at least one of the processor elements has an external port for inputting a complex number from other logic circuitry for subsequent use in a computation performed by the array.
-
26. The apparatus of claim 9 wherein at least one of the processor elements has an external port for outputting a resulting complex number so that it can be used by other logic circuitry.
-
27. The apparatus of claim 1 wherein the processing element further includes a non-linear function generator.
-
28. The apparatus of claim 27 wherein the non-linear function generator comprises:
-
a normalization circuit operative to scale a variable by shifting the variable into a fixed domain, the normalization logic outputting an exponent value indicating how much the variable was shifted;
a memory for storing tables of values of interpolation coefficients for at least one non-linear function; and
logic circuitry for taking only a selected number of the most significant bits of the variable and using the selected number of the most significant bits as an address to look up interpolation coefficients for the function in the table, such that the value of the function at the variable can be approximated by computing an interpolation using the interpolation coefficients.
-
-
29. The apparatus of claim 28 wherein the at least one non-linear function comprises at least one of an inverse square root function and a reciprocal function.
-
30. The apparatus of claim 28 wherein the logic circuitry is further operative to compute a square magnitude of up to two numbers, each of which may be complex, and for forming the sum of the square magnitudes.
-
31. The apparatus of claim 28 wherein the non-linear function generator further comprises one or more multipliers for forming a product of the non-linear function with real or complex numbers, and denormalization logic that scales the product based on the exponent by shifting.
-
32. The apparatus of claim 1 wherein the processor element further comprises:
-
a pipelined delay line operative to output the same sequence of digital words that appeared at its input at an earlier time, multiple members of the sequence traversing the delay line at once in various stages; and
a multiplexer that can select its output as either the output of the delay line or the undelayed input sequence of the delay line, the selection being controlled by an input to the processor element.
-
-
33. The apparatus of claim 32 wherein external control of the processor elements allows the selection of the output of the multiplexers to facilitate the performance of a plurality of computational algorithms on the linear array, a first subset of the algorithms requiring a delay in passing data from one processor element to the next in the array, and a second subset of the algorithms not requiring a delay in passing data from one processor element to the next in the array.
-
34. An apparatus comprising:
-
a plurality of digital processor elements arranged in a linear array, each of the processor elements in the linear array being coupled to at least one other processor element in the array, wherein at least one of the digital processor elements comprises;
arithmetic circuitry operative to multiply real and imaginary parts of at least two complex numbers by real and imaginary parts of at least another two complex numbers thereby forming at least sixteen partial products, and to form one or more additive combinations of the partial products, each of the additive combinations representing a real or imaginary number; and
a register file having at least a first port and a second port, each of the first and second ports being capable of reading two complex words or writing two complex words to or from the register file, the ports being coupled to the arithmetic circuitry for supplying the complex numbers thereto and receiving the real or imaginary numbers therefrom. - View Dependent Claims (35, 36)
-
Specification