OPTIMIZING DATA PARTITIONING FOR DATA-PARALLEL COMPUTING
First Claim
1. A system for optimizing data partitioning for a distributed execution engine, the system comprising:
- a code/EPG analysis module for deriving properties of a data-parallel program code in each vertex in a corresponding execution plan graph (EPG) compiled from the data-parallel program code;
a complexity module for at least deriving the computational complexity of each vertex in the EPG;
a data analysis module for generating a plurality of compact data representations corresponding to an input data for processing by the data-parallel program code;
a statistics and samples module for determining the relationship between input data size versus computational and input-output (I/O) costs;
a cost modeling and estimation module for estimating the runtime cost of each vertex in the EPG and the overall runtime cost represented by the EPG; and
a cost optimization module for determining a data partitioning plan.
2 Assignments
0 Petitions
Accused Products
Abstract
A data partitioning plan is automatically generated that—given a data-parallel program and a large input dataset, and without having to first run the program on the input dataset—substantially optimizes performance of the distributed execution system that explicitly measures and infers various properties of both data and computation to perform cost estimation and optimization. Estimation may comprise inferring the cost of a candidate data partitioning plan, and optimization may comprise generating an optimal partitioning plan based on the estimated costs of computation and input/output.
-
Citations
20 Claims
-
1. A system for optimizing data partitioning for a distributed execution engine, the system comprising:
-
a code/EPG analysis module for deriving properties of a data-parallel program code in each vertex in a corresponding execution plan graph (EPG) compiled from the data-parallel program code; a complexity module for at least deriving the computational complexity of each vertex in the EPG; a data analysis module for generating a plurality of compact data representations corresponding to an input data for processing by the data-parallel program code; a statistics and samples module for determining the relationship between input data size versus computational and input-output (I/O) costs; a cost modeling and estimation module for estimating the runtime cost of each vertex in the EPG and the overall runtime cost represented by the EPG; and a cost optimization module for determining a data partitioning plan. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method for optimizing data partitioning for a distributed execution engine, the method comprising:
-
determining a plurality of parts of a data-parallel program code corresponding to each vertex in a corresponding execution plan graph (EPG), the EPG comprising a plurality of vertices corresponding to a plurality of initial data partitions; deriving a computational complexity for each vertex from among the plurality of vertices in the EPG; determining a plurality of relationships between a plurality of input data and a plurality of execution costs; estimating a runtime cost for each vertex from among the plurality of vertices in the EPG; and estimating the overall runtime cost represented by the EPG. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A computer-readable medium comprising computer-readable instructions for optimizing data partitioning for a distributed execution engine, the computer-readable instructions comprising instructions that cause a processor to:
-
analyze a data-parallel program code and its corresponding execution plan graph (EPG); analyze the input data and a plurality of corresponding initial data partitions in view of the results of analyzing the data-parallel program code and the EPG; estimate a runtime cost for each vertex from among a plurality of vertices comprising the EPG; determine an improved data partitioning plan and update the EPG accordingly; and repeat the estimate and the determine until an optimized EPG is found. - View Dependent Claims (19, 20)
-
Specification