Stage-aware performance modeling for computer cluster sizing
First Claim
1. A computer-implemented method of configuring a computer cluster, the computer implemented method comprising:
- receiving, by a processor unit, job information identifying a data processing job to be performed, wherein the data processing job to be performed comprises a plurality of stages, and wherein the job information defines characteristics of the plurality of stages that include number of tasks, resource profile, data access pattern, output selectivity, amount of shuffle, resource consumption dynamicity, and data set content sensitivity of respective stages in the plurality of stages of the data processing job;
receiving, by the processor unit, cluster information identifying a candidate computer cluster;
identifying, by the processor unit, stage performance models for corresponding to modeled stages having similar characteristics to the characteristics of plurality of stages that include the number of tasks, resource profile, data access pattern, output selectivity, amount of shuffle, resource consumption dynamicity, and data set content sensitivity of the respective stages in the plurality of stages of the data processing job;
predicting, by the processor unit, stage performance times for performing the plurality of stages on the candidate computer cluster using the stage performance models;
combining, by the processor unit, the predicted stage performance times to determine a predicted job performance time;
using, by the processor unit, the predicted job performance time to configure the candidate computer cluster to perform the data processing job; and
performing, by the candidate computer cluster, the date processing job.
1 Assignment
0 Petitions
Accused Products
Abstract
A method, apparatus, and computer program product for configuring a computer cluster. Job information identifying a data processing job to be performed is received by a processor unit. The data processing job to be performed comprises a plurality of stages. Cluster information identifying a candidate computer cluster is also received by the processor unit. The processor unit identifies stage performance models for modeled stages that are similar to the plurality of stages. The processor unit predicts predicted stage performance times for performing the plurality of stages on the candidate computer cluster using the stage performance models and combines the predicted stage performance times for the plurality of stages to determine a predicted job performance time. The predicted job performance time may be used to configure the computer cluster.
-
Citations
20 Claims
-
1. A computer-implemented method of configuring a computer cluster, the computer implemented method comprising:
-
receiving, by a processor unit, job information identifying a data processing job to be performed, wherein the data processing job to be performed comprises a plurality of stages, and wherein the job information defines characteristics of the plurality of stages that include number of tasks, resource profile, data access pattern, output selectivity, amount of shuffle, resource consumption dynamicity, and data set content sensitivity of respective stages in the plurality of stages of the data processing job; receiving, by the processor unit, cluster information identifying a candidate computer cluster; identifying, by the processor unit, stage performance models for corresponding to modeled stages having similar characteristics to the characteristics of plurality of stages that include the number of tasks, resource profile, data access pattern, output selectivity, amount of shuffle, resource consumption dynamicity, and data set content sensitivity of the respective stages in the plurality of stages of the data processing job; predicting, by the processor unit, stage performance times for performing the plurality of stages on the candidate computer cluster using the stage performance models; combining, by the processor unit, the predicted stage performance times to determine a predicted job performance time; using, by the processor unit, the predicted job performance time to configure the candidate computer cluster to perform the data processing job; and performing, by the candidate computer cluster, the date processing job. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer system for configuring a computer cluster, the computer system comprising:
-
a bus system; a storage device connected to the bus system, wherein the storage device stores computer readable program code; and a processor unit connected to the bus system, wherein the processor unit executes the computer readable program code to; receive job information identifying a data processing job to be performed and cluster information identifying a candidate computer cluster, wherein the data processing job to be performed comprises a plurality of stages; and
wherein the job information defines characteristics of the plurality of the stages that include the number of tasks, resource profile, data access pattern, output selectivity, amount of shuffle, resource consumption dynamicity, and data set content sensitivity of respective stages in the plurality of stages of the data processing job;identify stage performance models corresponding to modeled stages having similar characteristics to the characteristics of that plurality of stages include the number of tasks, resource profile, data access pattern, output selectivity, amount of shuffle, resource consumption dynamicity, and data set content sensitivity of the respective stages in the plurality of stages of the data processing job; predict stage performance times for performing the plurality of stages on the candidate computer cluster using the stage performance models and combine the predicted stage performance times for the plurality of stages to determine a predicted job performance time; use the predicted job performance time to configure the candidate computer cluster to perform the data processing job; and perform the data processing job. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer program product for configuring a computer cluster, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising:
-
receiving job information identifying a data processing job to be performed, wherein the data processing job to be performed comprises a plurality of stages, and wherein the job information defines characteristics of the plurality of stages that include number of tasks, resource profile, data access pattern, output selectivity, amount of shuffle, resource consumption dynamicity, and data set content sensitivity of respective stages in the plurality of stages of the data processing job; receiving cluster information identifying a candidate computer cluster; identifying stage performance models corresponding to modeled stages having similar characteristics to the characteristics of the plurality of stages that include the number of tasks, resource profile, data access pattern, output selectivity, amount of shuffle, resource consumption dynamicity, and data set content sensitivity of the respective stages in the plurality of stages of the data processing job; predicting stage performance times for performing the plurality of stages on the candidate computer cluster using the stage performance models; combining the predicted stage performance times for the plurality of stages to determine a predicted job performance time; using the predicted job performance time to configure the candidate computer cluster to perform the data processing job; and performing the data processing job. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification