System and method for proactive task scheduling of a copy of outlier task in a computing environment
First Claim
1. A system comprising:
- one or more processing devices; and
one or more computer-readable storage media storing instructions which, when executed by the one or more processing devices, configure the one or more processing devices to;
identify an outlier task from a plurality of tasks of a phase of a job based on corresponding runtimes of the plurality of tasks, the outlier task being identified while the outlier task is executing and taking longer to complete than other tasks from the phase of the job, wherein the plurality of tasks share the same code;
make a determination whether the outlier task has more input data to process than the other tasks of the phase of the job;
in a first instance when the determination is that the outlier task has more input data to process than the other tasks of the phase of the job, continue to let the outlier task execute without scheduling a copy of the outlier task responsive to the determination; and
in a second instance when the determination is that the outlier task does not have more input data to process than the other tasks of the phase of the jobcompare an estimated remaining time for the outlier task to complete to an estimated time for the copy of the outlier task to complete, andwhen the estimated time for the copy of the outlier task to complete is less than the estimated remaining time for the outlier task to complete, schedule the copy of the outlier task.
2 Assignments
0 Petitions
Accused Products
Abstract
The described implementations relate to distributed computing. One implementation provides a system that can include an outlier detection component that is configured to identify an outlier task from a plurality of tasks based on runtimes of the plurality of tasks. The system can also include a cause evaluation component that is configured to evaluate a cause of the outlier task. For example, the cause of the outlier task can be an amount of data processed by the outlier task, contention for resources used to execute the outlier task, or a communication link with congested bandwidth that is used by the outlier task to input or output data. The system can also include one or more processing devices configured to execute one or more of the components.
33 Citations
20 Claims
-
1. A system comprising:
-
one or more processing devices; and one or more computer-readable storage media storing instructions which, when executed by the one or more processing devices, configure the one or more processing devices to; identify an outlier task from a plurality of tasks of a phase of a job based on corresponding runtimes of the plurality of tasks, the outlier task being identified while the outlier task is executing and taking longer to complete than other tasks from the phase of the job, wherein the plurality of tasks share the same code; make a determination whether the outlier task has more input data to process than the other tasks of the phase of the job; in a first instance when the determination is that the outlier task has more input data to process than the other tasks of the phase of the job, continue to let the outlier task execute without scheduling a copy of the outlier task responsive to the determination; and in a second instance when the determination is that the outlier task does not have more input data to process than the other tasks of the phase of the job compare an estimated remaining time for the outlier task to complete to an estimated time for the copy of the outlier task to complete, and when the estimated time for the copy of the outlier task to complete is less than the estimated remaining time for the outlier task to complete, schedule the copy of the outlier task. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system comprising:
-
one or more processing devices; and one or more computer-readable storage devices comprising instructions which, when executed by one or more processing devices, cause the one or more processing devices to; monitor execution of a plurality of tasks associated with a job, the plurality of tasks comprising an individual task that is processing input data; after the individual task has already processed some of the input data and is continuing to process remaining input data; determine a rate at which the individual task is processing the input data and an amount of the remaining input data that the individual task has yet to process, determine an estimated remaining time for the individual task to complete based on the rate at which the individual task is processing the input data and the amount of the remaining input data, determine a predicted completion time for a new copy of the individual task, determine an estimated probability that the new copy of the individual task will complete sooner than the individual task based on the estimated remaining time for the individual task to complete and the predicted completion time for the new copy of the task, and while the individual task continues executing, schedule the new copy of the individual task that is currently executing when the estimated probability that the new copy of the individual task will complete sooner than the individual task exceeds a threshold. - View Dependent Claims (14, 15, 16)
-
-
17. A method performed by at least one computing device, the method comprising:
-
determining an estimated probability that output data of a completed task will be lost due to a fault on a server that executed the completed task, wherein the completed task has processed input data to obtain the output data; determining an estimated time to repeat the completed task based on an amount of the input data that was processed by the completed task; determining a cost to recompute the completed task based on both the estimated probability that the output data of the completed task will be lost and the estimated time to repeat the completed task; determining another estimated time to replicate the output data of the completed task by transferring the output data to another server; comparing the another estimated time to replicate the output data to the cost to recompute the completed task to determine whether to replicate the output data on the another server; and replicating the output data on the another server when the another estimated time to replicate the output data is less than the cost to recompute the completed task. - View Dependent Claims (18, 19, 20)
-
Specification