System and method for proactive task scheduling of a copy of outlier task in a computing environment

US 9,307,048 B2
Filed: 12/28/2010
Issued: 04/05/2016
Est. Priority Date: 12/28/2010
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more processing devices; and

one or more computer-readable storage media storing instructions which, when executed by the one or more processing devices, configure the one or more processing devices to;

identify an outlier task from a plurality of tasks of a phase of a job based on corresponding runtimes of the plurality of tasks, the outlier task being identified while the outlier task is executing and taking longer to complete than other tasks from the phase of the job, wherein the plurality of tasks share the same code;

make a determination whether the outlier task has more input data to process than the other tasks of the phase of the job;

in a first instance when the determination is that the outlier task has more input data to process than the other tasks of the phase of the job, continue to let the outlier task execute without scheduling a copy of the outlier task responsive to the determination; and

in a second instance when the determination is that the outlier task does not have more input data to process than the other tasks of the phase of the jobcompare an estimated remaining time for the outlier task to complete to an estimated time for the copy of the outlier task to complete, andwhen the estimated time for the copy of the outlier task to complete is less than the estimated remaining time for the outlier task to complete, schedule the copy of the outlier task.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The described implementations relate to distributed computing. One implementation provides a system that can include an outlier detection component that is configured to identify an outlier task from a plurality of tasks based on runtimes of the plurality of tasks. The system can also include a cause evaluation component that is configured to evaluate a cause of the outlier task. For example, the cause of the outlier task can be an amount of data processed by the outlier task, contention for resources used to execute the outlier task, or a communication link with congested bandwidth that is used by the outlier task to input or output data. The system can also include one or more processing devices configured to execute one or more of the components.

33 Citations

View as Search Results

20 Claims

1. A system comprising:
- one or more processing devices; and
  
  one or more computer-readable storage media storing instructions which, when executed by the one or more processing devices, configure the one or more processing devices to;
  
  identify an outlier task from a plurality of tasks of a phase of a job based on corresponding runtimes of the plurality of tasks, the outlier task being identified while the outlier task is executing and taking longer to complete than other tasks from the phase of the job, wherein the plurality of tasks share the same code;
  
  make a determination whether the outlier task has more input data to process than the other tasks of the phase of the job;
  
  in a first instance when the determination is that the outlier task has more input data to process than the other tasks of the phase of the job, continue to let the outlier task execute without scheduling a copy of the outlier task responsive to the determination; and
  
  in a second instance when the determination is that the outlier task does not have more input data to process than the other tasks of the phase of the jobcompare an estimated remaining time for the outlier task to complete to an estimated time for the copy of the outlier task to complete, andwhen the estimated time for the copy of the outlier task to complete is less than the estimated remaining time for the outlier task to complete, schedule the copy of the outlier task.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The system according to claim 1, wherein the instructions further configure the one or more processing devices to:
    - in the second instance, determine whether a cause of the outlier task is contention for resources by evaluating progress reports reflecting processor utilization, memory utilization, or both on a server that is currently executing the outlier task.
  - 3. The system according to claim 1, wherein the outlier task executes concurrently with an individual other task from the plurality of tasks.
  - 4. The system according to claim 2, wherein the instructions configure the one or more processing devices to:
    - responsive to determining that the cause of the outlier task is contention for resources, schedule the copy of the outlier task on a different server than the outlier task.
  - 5. The system according to claim 4, wherein the instructions further configure the one or more processing devices to:
    - schedule at least one subsequent task from a subsequent processing phase of the job after completion of the outlier task.
  - 6. The system according to claim 1, wherein the instructions further configure the one or more processing devices to:
    - schedule the plurality of tasks in a plurality of processing slots, the outlier task executing within an individual processing slot.
  - 7. The system according to claim 6, wherein the instructions further configure the one or more processing devices to:
    - determine whether there is an available processing slot; and
      
      in the second instance, begin executing the copy of the outlier task in the available processing slot while the outlier task continues executing within the individual processing slot.
  - 8. The system according to claim 1, wherein thephase is a reducing phase of the job and the reduce phase is preceded by a mapping phase.
  - 9. The system according to claim 1, embodied as a scheduling server that initiates the plurality of tasks on other servers.
  - 10. The system according to claim 1, wherein the instructions further configure the one or more processing devices to, in the second instance:
    - determine the estimated remaining time for the outlier task to complete based on an amount of the input data that the outlier task has to process and a rate at which the outlier task processes the input data.
  - 11. The system according to claim 10, wherein the instructions further configure the one or more processing devices to, in the second instance:
    - determine the estimated time for the copy of the outlier task to complete based on processing rates at which the other tasks in the phase are processing the input data.
  - 12. The system according to claim 10, wherein the instructions further configure the one or more processing devices to:
    - determine the estimated time for the copy of the outlier task to complete based on the amount of the input data that the outlier task has to process.

13. A system comprising:
- one or more processing devices; and
  
  one or more computer-readable storage devices comprising instructions which, when executed by one or more processing devices, cause the one or more processing devices to;
  
  monitor execution of a plurality of tasks associated with a job, the plurality of tasks comprising an individual task that is processing input data;
  
  after the individual task has already processed some of the input data and is continuing to process remaining input data;
  
  determine a rate at which the individual task is processing the input data and an amount of the remaining input data that the individual task has yet to process,determine an estimated remaining time for the individual task to complete based on the rate at which the individual task is processing the input data and the amount of the remaining input data,determine a predicted completion time for a new copy of the individual task,determine an estimated probability that the new copy of the individual task will complete sooner than the individual task based on the estimated remaining time for the individual task to complete and the predicted completion time for the new copy of the task, andwhile the individual task continues executing, schedule the new copy of the individual task that is currently executing when the estimated probability that the new copy of the individual task will complete sooner than the individual task exceeds a threshold.
- View Dependent Claims (14, 15, 16)
- - 14. The system according to claim 13, wherein the instructions further cause the one or more processing devices to:
    - schedule the new copy of the individual task to run concurrently with the individual task without killing and restarting the individual task.
  - 15. The system according to claim 13, wherein the instructions further cause the one or more processing devices to:
    - determine whether an available processing slot is available for scheduling the new copy of the individual task; and
      
      schedule the new copy of the individual task when there is an available processing slot.
  - 16. The system according to claim 14, wherein the instructions further cause the one or more processing devices to:
    - limit a number of copies of the individual task to a fixed number.

17. A method performed by at least one computing device, the method comprising:
- determining an estimated probability that output data of a completed task will be lost due to a fault on a server that executed the completed task, wherein the completed task has processed input data to obtain the output data;
  
  determining an estimated time to repeat the completed task based on an amount of the input data that was processed by the completed task;
  
  determining a cost to recompute the completed task based on both the estimated probability that the output data of the completed task will be lost and the estimated time to repeat the completed task;
  
  determining another estimated time to replicate the output data of the completed task by transferring the output data to another server;
  
  comparing the another estimated time to replicate the output data to the cost to recompute the completed task to determine whether to replicate the output data on the another server; and
  
  replicating the output data on the another server when the another estimated time to replicate the output data is less than the cost to recompute the completed task.
- View Dependent Claims (18, 19, 20)
- - 18. The method of claim 17, further comprising:
    - determining the cost to recompute the completed task as a product of the estimated probability that the output data of the completed task will be lost and the estimated time to repeat the completed task.
  - 19. The method of claim 17, wherein the server and the another server both physically occupy the same server rack.
  - 20. The method of claim 17, further comprising:
    - examining prior tasks that ran on the server to determine the estimated probability that the output data will be lost.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Kandula, Srikanth, Ananthanarayanan, Ganesh, Greenberg, Albert, Lu, Yi, Saha, Bikas, Harris, Edward
Primary Examiner(s)
Ghaffari, Abu Z

Application Number

US12/979,338
Publication Number

US 20120167101A1
Time in Patent Office

1,925 Days
Field of Search

718/102, 718/106, 718/107, 709/226, 710/100
US Class Current

1/1
CPC Class Codes

G06F 2209/506   Constraint

G06F 9/5038   considering the execution o...

G06F 9/5088   involving task migration

H04L 67/62   Establishing a time schedul...

System and method for proactive task scheduling of a copy of outlier task in a computing environment

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

33 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

System and method for proactive task scheduling of a copy of outlier task in a computing environment

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others