Programmatic method for reducing cost of control in parallel processes
First Claim
1. In a process of transforming a nested loop having an iteration space defined by loop indices into a single loop for execution on each processor element in an array of parallel processors, a method for optimizing code in the single loop comprising:
- obtaining a mapping of iterations of the nested loop to processor elements in the array and a schedule of start times for initiating execution of the iterations on corresponding processor elements; and
from the mapping of iterations and the schedule of start times, generating code to compute iteration coordinates on a processor element for an iteration of the single loop based on values of the iteration coordinates for a previous iteration of the single loop on the same processor element.
3 Assignments
0 Petitions
Accused Products
Abstract
A parallel compiler exploits temporal recursion to reduce the cost of control code generated in transforming a sequential nested loop program into a set of parallel processes mapped to an array of processors. A parallel compiler process transforms a nested loop program into a set of single loops, where each single loop is assigned to execute on a processor element in a parallel processor array. The parallel compiler obtains a mapping of iterations of the nested loop to processor elements in the array and a schedule of start times for initiating execution of the iterations on corresponding processor elements in the array. Based on this mapping and iteration schedule, the parallel compiler generates code to compute iteration coordinates on a processor element for an iteration of the single loop from iteration coordinates computed on the same processor element for a previous iteration of the single loop. The parallel compiler uses this method to generate code to compute loop indices, memory addresses, and tests of loop bounds efficiently based on values from a previous iteration.
-
Citations
20 Claims
-
1. In a process of transforming a nested loop having an iteration space defined by loop indices into a single loop for execution on each processor element in an array of parallel processors, a method for optimizing code in the single loop comprising:
-
obtaining a mapping of iterations of the nested loop to processor elements in the array and a schedule of start times for initiating execution of the iterations on corresponding processor elements; and
from the mapping of iterations and the schedule of start times, generating code to compute iteration coordinates on a processor element for an iteration of the single loop based on values of the iteration coordinates for a previous iteration of the single loop on the same processor element. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
iterations are mapped to a virtual processor-null direction space, where virtual processors in the virtual processor space each have a corresponding set of iterations, virtual processors are mapped to processor elements such that a cluster of virtual processors is assigned to each processor element, and the iteration coordinates comprise local coordinates of a virtual processor in a specified cluster.
-
-
3. The method of claim 1 including:
generating code to compute a quantity that is linearly dependent on the iteration coordinates using a corresponding quantity computed on the same processor element for a previous iteration of the single loop.
-
4. The method of claim 3 wherein iterations in the iteration space are mapped to a virtual processor, where each virtual processor is assigned to a set of iterations, and each iteration in the set is assigned a start time of execution, and the quantity comprises a virtual processor coordinate in the virtual processor space.
-
5. The method of claim 1 including:
generating code to compute a value of a predicate used to evaluate a loop boundary condition on a processor element for an iteration of the single loop from a previous value of the predicate computed on the same processor element for a previous iteration of the single loop.
-
6. The method of claim 5 wherein the loop boundary condition includes a test indicating whether the iteration coordinates are within the iteration space to determine whether there is an iteration scheduled for a processor element at a specified time.
-
7. The method of claim 5 wherein the iteration space is partitioned into tiles of iterations that are initiated sequentially and the loop boundary condition includes a test indicating whether an iteration is at a tile boundary.
-
8. The method of claim 5 wherein:
-
iterations are mapped to a virtual processor-null direction space, where virtual processors in the virtual processor space each have a corresponding set of iterations, virtual processors are mapped to processor elements such that a cluster of virtual processors is assigned to each processor element, and the loop condition includes a test indicating whether an iteration is at a cluster boundary, the cluster boundary being defined as iterations at an edge of the cluster shape, and the cluster shape being defined by the mapping of virtual processors to the processor elements.
-
-
9. The method of claim 1 including:
generating code to compute a memory address for an array element in an operation within the loop from a previous value of the memory address computed on the same processor element for a previous iteration of the single loop.
-
10. The method of claim 1 wherein the schedule provides start times for initiating execution of the iterations on a processor element such that no more than one iteration is started on a processor element for each initiation interval.
-
11. The method of claim 1 wherein:
-
iterations in the iteration space are mapped to a virtual processor space, where each virtual processor is assigned to a set of iterations, and each iteration associated with a virtual processor is assigned a time of initiation, a cluster of virtual processors is mapped to each processor element, and the frequency with which an iteration associated with a virtual processor in the cluster is initiated on the physical processor of the cluster is periodic;
the method further including;
generating code to buffer data computed for an iteration so that the data is propagated to a subsequent iteration on the processor for use in calculating the iteration coordinates or values that are linearly dependent on the iteration coordinates.
-
-
12. The method of claim 1 including:
-
generating code representing a decision tree that implements the computation of the iteration coordinates from values of the iteration coordinates on the same processor at an earlier time;
wherein the decision tree is a binary tree, the binary tree has a depth equal to a number of dimensions of a cluster having a dimension more than one, a test at each internal node of the decision tree compares one cluster coordinate to a constant, and the leaves of the tree specify for a current iteration a change in iteration coordinates relative to previous iteration coordinates calculated in the same processor.
-
-
13. The method of claim 12 wherein the changes in iteration coordinates specified in the leaves of the tree are used to compute linearly related quantities to the iteration coordinates using data propagated from a previous iteration on the processor element.
-
14. The method of claim 13 wherein the linearly related quantities include array indices of a variable in the loop body.
-
15. The method of claim 13 wherein the linearly related quantities include memory addresses of a variable stored external to local memory of the processor element.
-
16. The method of claim 1 wherein:
-
iterations in the iteration space are mapped to a virtual processor space, where each virtual processor is assigned to a set of iterations, and each iteration associated with a virtual processor is assigned a time of initiation, a cluster of virtual processors is mapped to each processor element, and the frequency with which an iteration associated with a virtual processor in the cluster is initiated on the physical processor of the cluster is periodic;
the method further including;
generating code to buffer data that is periodic so that periodic a periodic quantity is propagated to a subsequent iteration for re-use in the subsequent iteration.
-
-
17. The method of claim 16 wherein the periodic quantity is boolean value representing a test of a loop boundary condition.
-
18. The method of claim 16 wherein the periodic quantity is an iteration coordinate.
-
19. A computer readable medium on which is stored software for performing the method of claim 1.
-
20. In a parallel compiler for transforming a nested loop having an iteration space defined by loop indices into a single loop for execution on each processor element in an array of parallel processors, a compiler system for optimizing code in the single loop comprising:
-
means for accessing a data structure representing the mapping of iterations of the nested loop to processor elements in the array and a data structure representing a schedule of start times for initiating execution of the iterations on corresponding processor elements; and
means for generating code to compute iteration coordinates on a processor element for an iteration of the single loop from iteration coordinates computed on the same processor element for a previous iteration of the single loop from the mapping of iterations and the schedule of start times.
-
Specification