Programmatic synthesis of processor element arrays
First Claim
Patent Images
1. A programmatic method for transforming a nested loop into a set of parallel processes for synthesis into a parallel array of processor elements, the method comprising:
- obtaining the nested loop and a performance requirement for executing the nested loop as parallel processes on the array of processor elements, where the nested loop has a loop body with one or more operations;
based on a specified performance requirement, programmatically transforming the nested loop into parallel processes for synthesis into an array of processor elements, where each of the parallel processes corresponds to a set of iterations of the loop body, expressed as a single loop mapped to a processor element, and where each iteration is assigned a start time to initiate execution on the processor element; and
generating code representing the transformed nested loop as a two dimensional loop in which a first dimension iterates over time and a second dimension iterates over processor elements.
2 Assignments
0 Petitions
Accused Products
Abstract
A programmatic method transforms a nested loop in a high level programming language into a set of parallel processes, each a single time loop, such that the parallel processes satisfy a specified design constraint. Another programmatic method synthesizes a processor array from the set of parallel processes and a specified design constraint.
403 Citations
27 Claims
-
1. A programmatic method for transforming a nested loop into a set of parallel processes for synthesis into a parallel array of processor elements, the method comprising:
-
obtaining the nested loop and a performance requirement for executing the nested loop as parallel processes on the array of processor elements, where the nested loop has a loop body with one or more operations;
based on a specified performance requirement, programmatically transforming the nested loop into parallel processes for synthesis into an array of processor elements, where each of the parallel processes corresponds to a set of iterations of the loop body, expressed as a single loop mapped to a processor element, and where each iteration is assigned a start time to initiate execution on the processor element; and
generating code representing the transformed nested loop as a two dimensional loop in which a first dimension iterates over time and a second dimension iterates over processor elements. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
performing a data dependence analysis on operations in the loop body to determine data dependence constraints between iterations, the data dependence constraints including a delay constraint of a recurrence cycle; and
based on the data dependency constraints, determining the iteration schedule such that the iteration schedule satisfies the dependence constraints.
-
-
4. The method of claim 2 further including:
-
transforming the nested loop from iteration space coordinates to time—
virtual processor coordinates, where the nested loop is expressed in a two dimensional loop in which a first dimension iterates over time and a second dimension iterates over virtual processors; and
transforming the nested loop from time—
virtual processor coordinates to time—
physical processor element coordinates, where the nested loop is expressed in a two dimensional loop in which the first dimension iterates over time and the second dimension iterates over physical processor elements, in transforming the nested loop to time—
physical processor coordinates, assigning clusters of virtual processors to physical processors.
-
-
5. The method of claim 1 further including:
transforming the loop body into a dynamic single assignment form by converting an array in the loop body into a uniformized array.
-
6. The method of claim 2 including:
programmatically partitioning the nested loop into sets of iterations to be initiated sequentially so as to minimize local storage of the processor elements while satisfying the memory bandwidth constraint.
-
7. The method of claim 2 wherein programmatically transforming the nested loop includes:
determining an iteration schedule for iterations of the nested loop such that the schedule provides start times for initiating execution of the iterations on a processor element such that no more than one iteration is started on a processor element for each initiation interval.
-
8. The method of claim 5 further including:
generating code to represent the array in the dynamic single assignment form, along with a specification of registers that implement the array in local storage of each processor element.
-
9. The method of claim 5 including:
programmatically transforming the loop nest so as to eliminate accesses to or from global memory from the processor elements, except as necessary at loop boundaries.
-
10. The method of claim 1 including:
generating code to test a loop index against a loop bound index vector of the nested loop locally on a processor element using a previous value of the test computed in a previous iteration on the same processor element.
-
11. The method of claim 1 generating code of a conditional statement determining whether an iteration is at a loop boundary, where the loop boundary is a tile, cluster, or iteration space boundary.
-
12. The method of claim 11 including:
generating code to compute a boolean variable of the conditional statement from a previous value of the boolean variable at a previous iteration in the nested loop.
-
13. The method of claim 11 including:
-
converting operations in the loop body that are dependent on the conditional statement into predicated code; and
synthesizing functional units for the processor elements that support predicated execution of the predicated code.
-
-
14. A computer readable medium on which is stored software, which, when executed by a computer, performs the method of claim 1.
-
15. A programmatic method for synthesizing a set of parallel processes, each comprising a single loop in parallel form, into a parallel array of processor elements, the method comprising:
-
synthesizing a structural representation of a data path for a processor element that executes the single loop based on operations in the single loop;
scheduling the operations in the single loop for execution in the data path so as to satisfy a specified processor cost or performance constraint;
programmatically synthesizing an interconnect between functional units and local storage in the data path for each processor element; and
programmatically replicating and interconnecting the processor elements into a parallel array that satisfies the processor cost or performance constraint. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
transforming code of the loop such that when synthesized into a structural description of a processor element, the structural processor element has a minimized cost measured in chip area, complexity of interconnect, number of hardware elements or power consumption.
-
-
17. The method of claim 15 including:
transforming code of the loop such that when synthesized into a structural description of a processor element, the structural processor element has an optimized performance measured in execution time.
-
18. The method of claim 15 wherein the processor elements are programmable processor elements.
-
19. The method of claim 15 wherein the processor elements are non-programmable processor elements.
-
20. The method of claim 15 including:
converting conditional statements in the code to predicated code.
-
21. The method of claim 20 including:
synthesizing the predicated code into the structural representation of the data path and the interconnect of a processor element.
-
22. The method of claim 15 including:
-
performing one or more of the following code transformations before synthesizing the code into a processor element;
dead code elimination, copy propagation/elimination, loop invariant removal, partial redundancy elimination, and common sub-expression elimination, strength reduction, constant propagation and folding, critical path reduction and speculative code motion;
wherein the code transformation transforms the code so as to optimize the synthesized processor element.
-
-
23. The method of claim 22 wherein one or more of the code transformations are applied to predicated code in which conditional statements are converted to predicates.
-
24. The method of claim 22 wherein one or more of the code transformations are applied to code including expanded virtual registers representing uniformized arrays.
-
25. A computer readable media on which is stored software, which when executed by a computer, performs the method of claim 15.
-
26. A computer readable media on which is stored software, which when executed by a computer, performs the method of claim 15.
-
27. A processor element array comprising:
-
an array of processor elements, each having one or more functional units for executing operations, one or more registers for storing inputs and outputs of the operations, and an interconnect for connecting registers to functional units;
the functional units supporting predicated execution of the operations and producing output for the operations as predicate-data pairs; and
the interconnect using the predicates to control transfer of data from the functional units to corresponding registers.
-
Specification