Computing cluster with latency control
First Claim
1. A method implemented by one or more processing units, the method comprising:
- analyzing a structure of a job to determine;
dependencies among tasks of the job, the dependencies including one task using output from another task, anda critical path of the job;
building a model of execution time of the job by a computing cluster, the model based on the dependencies among the tasks and a length of the critical path, the model relating aggregate time spent to complete the tasks of the job to resources allocated to the job; and
at a plurality of times;
assessing processing remaining on the job,updating the model based on the processing remaining, the updating including considering the dependencies, the critical path, and execution failures of individual tasks of the job, andadjusting the resources allocated to the job based on the updated model and a utility function.
2 Assignments
0 Petitions
Accused Products
Abstract
A computing cluster operated according to a resource allocation policy based on a predictive model of completion time. The predictive model may be applied in a resource control loop that iteratively updates resources assigned to an executing job. At each iteration, the amount of resources allocated to the job may be updated based on of the predictive model so that the job will be scheduled to complete execution at a target completion time. The target completion time may be derived from a utility function determined for the job. The utility function, in turn, may be derived from a service level agreement with service guarantees and penalties for late completion of a job. Allocating resources in this way may maximize utility for an operator of the computing cluster while minimizing disruption to other jobs that may be concurrently executing.
-
Citations
20 Claims
-
1. A method implemented by one or more processing units, the method comprising:
-
analyzing a structure of a job to determine; dependencies among tasks of the job, the dependencies including one task using output from another task, and a critical path of the job; building a model of execution time of the job by a computing cluster, the model based on the dependencies among the tasks and a length of the critical path, the model relating aggregate time spent to complete the tasks of the job to resources allocated to the job; and at a plurality of times; assessing processing remaining on the job, updating the model based on the processing remaining, the updating including considering the dependencies, the critical path, and execution failures of individual tasks of the job, and adjusting the resources allocated to the job based on the updated model and a utility function. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. At least one computer-readable memory device or storage device storing computer-executable instructions that, when executed by one or more processing devices, cause the one or more processing devices to perform acts comprising:
-
building a model of execution of a job by a computing cluster, the building comprising; analyzing a structure of the job to determine dependencies among tasks of the job, a completion time of the job, and a length of a critical path of the job, simulating task failures in execution of the job, and relating aggregate time spent to complete the tasks of the job to an allocation of resources of the computing cluster for the execution of the job; updating the model based on progress in an actual execution of the job, the updating including considering the dependencies, the critical path, and actual execution failures of the tasks of the job; and dynamically allocating the resources of the computing cluster to the job based on the updated model. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A system comprising:
-
a plurality of processing units; and a scheduler configured to execute on the plurality of processing units and cause the plurality of processing units to; receive a job for execution, the job comprising tasks, construct a model of the execution of the job based on a directed graph representing dependencies among the tasks of the job and a length of a critical path of the job, the model relating aggregate time spent to complete the tasks to computing resources allocated to the job; updating the model based on assessed progress of the execution of the job, the updating considering processing time associated with failures during the execution of some of the tasks, and dynamically allocate the computing resources based on the updated model. - View Dependent Claims (17, 18, 19, 20)
-
Specification