COMPUTING ON TRANSIENT RESOURCES

US 20190310885A1
Filed: 06/24/2019
Published: 10/10/2019
Est. Priority Date: 01/13/2017
Status: Active Grant

First Claim

Patent Images

1. A computing system, the computing system comprising:

a task scheduler configured for;

accessing instability information of a transient resource and information of a stage of a computational job, the instability information associated with an estimation of availability of the transient resource, and the stage having a plurality of parallel tasks; and

scheduling a task of the plurality of parallel tasks to use the transient resource based at least in part on a rate of data size reduction of the task; and

a checkpointing scheduler, coupled to the task scheduler, configured for;

determining a checkpointing plan for the task based at least in part on a recomputation cost associated with the instability information of the transient resource.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Aspects of the technology described herein can facilitate computing on transient resources. An exemplary computing device may use a task scheduler to access information of a computational task and instability information of a transient resource. Moreover, the task scheduler can schedule the computational task to use the transient resource based at least in part on the rate of data size reduction of the computational task. Further, a checkpointing scheduler in the exemplary computing device can determine a checkpointing plan for the computational task based at least in part on a recomputation cost associated with the instability information of the transient resource. Resultantly, the overall utilization rate of computing resources is improved by effectively utilizing transient resources.

Citations

20 Claims

1. A computing system, the computing system comprising:
- a task scheduler configured for;
  
  accessing instability information of a transient resource and information of a stage of a computational job, the instability information associated with an estimation of availability of the transient resource, and the stage having a plurality of parallel tasks; and
  
  scheduling a task of the plurality of parallel tasks to use the transient resource based at least in part on a rate of data size reduction of the task; and
  
  a checkpointing scheduler, coupled to the task scheduler, configured for;
  
  determining a checkpointing plan for the task based at least in part on a recomputation cost associated with the instability information of the transient resource.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The computing system of claim 1, further comprising:
    - a task executor, coupled to the task scheduler, configured for;
      
      receiving the task from the task scheduler; and
      
      determining an output data block associated with the task; and
      
      a checkpoint manager, coupled to the checkpointing scheduler and the task executor, configured for;
      
      receiving the checkpointing plan from the checkpointing scheduler, wherein the checkpointing plan associates a first identification for the checkpointing plan with a second identification for the task and a third identification for the output data block; and
      
      executing the checkpointing plan based on the first, the second, and the third identifications.
  - 3. The computing system of claim 2, wherein the checkpoint manager is further configured for copying the output data block of the task to another transient resource that has a longer expected remaining time compared with the transient resource.
  - 4. The computing system of claim 2, wherein the checkpoint manager is further configured for inserting the checkpointing plan into a data structure featured with an order of first-in-last-out;
    - and sequentially executing a plurality of checkpointing plans based on the order of first-in-last-out.
  - 5. The computing system of claim 4, wherein the checkpoint manager is further configured for communicating checkpointing status information of the plurality of checkpointing plans to the checkpointing scheduler, and wherein the checkpointing scheduler is further configured to adjust at least one checkpointing plan of the plurality of checkpointing plans based on the checkpointing status information.
  - 6. The computing system of claim 1, wherein the task scheduler is further configured for determining the rate of data size reduction of the task based on an estimated execution time of the task, an input data size of the task, and an output data size of the task.
  - 7. The computing system of claim 1, wherein the task scheduler is further configured for determining a rate of data size reduction of the stage based on respective rates of data size reduction of all tasks in the stage;
    - and determining the stage has a maximum rate of data size reduction among a plurality of stages of the computational job.
  - 8. The computing system of claim 1, wherein the task scheduler is further configured for determining a ratio of an expected execution time of the task and an expected lifetime of the transient resource is less than a predetermined threshold, and scheduling the task to use the transient resource is performed only when the ratio is less than the predetermined threshold.
  - 9. The computing system of claim 1, wherein the checkpointing scheduler is further configured for determining the recomputation cost based at least in part on a first cost to recompute the task and a second cost to recompute one or more tasks associated with the task, wherein respective input data of the one or more tasks would become unavailable due to a failure of the transient resource.
  - 10. The computing system of claim 1, wherein the checkpointing scheduler is further configured for determining the recomputation cost recursively with a predetermined recursion depth limitation.
  - 11. The computing system of claim 1, wherein the checkpointing scheduler is further configured for determining a cost of backing up an output data block associated with the task based on a first cost of backing up the output data block when the transient resource fails before the backup is finished, a second cost of recomputing the task when the transient resource fails before the backup is finished, and a third cost of backing up the output data block when the transient resource fails after the backup is finished.
  - 12. The computing system of claim 1, wherein the transient resource is a virtual machine in a virtual machine cluster.

13. A computer-implemented method for transient resource computing, the method comprising:
- accessing information of a plurality of parallel tasks;
  
  determining a rate of data size reduction of a task of the plurality of parallel tasks based on an estimated execution time of the task, an input data size of the task, and an output data size of the task; and
  
  scheduling the task to use a transient resource based at least in part on the rate of data size reduction of the task being greater than rates of data-size reduction of other tasks in the plurality of parallel tasks.
- View Dependent Claims (14, 15)
- - 14. The method of claim 13, wherein the plurality of parallel tasks belong to a computing stage of a computing job, the method further comprising:
    - determining a rate of data size reduction of the computing stage based on respective rates of data size reduction of the plurality of parallel tasks; and
      
      determining the computing stage has a maximum rate of data size reduction among a plurality of computing stages of the computing job.
  - 15. The method of claim 13, further comprising:
    - determining a ratio of an expected execution time of the task over an expected lifetime of the transient resource; and
      
      scheduling the task to use the transient resource only when the ratio is less than a predetermined threshold.

16. One or more non-transient computer storage media comprising computer-implemented instructions that, when used by one or more computing devices, cause the one or more computing devices to:
- access a task running on a transient resource and an output data block of the task;
  
  determine to checkpoint the task based on (a) a residual lifetime of the transient resource is shorter than a required remaining time to complete the task, and (b) a recomputation cost to recompute the task is greater than a backup cost to back up the output data block of the task; and
  
  checkpoint the task.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The one or more computer storage media of claim 16, wherein the instructions further cause the one or more computing devices to:
    - calculate a recomputation cost for the task based on computing a plurality of preceding tasks in a directed acyclic graph associated with the task, wherein the preceding tasks are limited by a predetermined recursion depth limitation based on the directed acyclic graph.
  - 18. The one or more computer storage media of claim 16, wherein the instructions further cause the one or more computing devices to:
    - determine a cost of backing up the output data block of the task based on a first cost of backing up the output data block when the transient resource fails before the backup is finished, a second cost of recomputing the task when the transient resource fails before the backup is finished, and a third cost of backing up the data block when the transient resource fails after the backup is finished.
  - 19. The one or more computer storage media of claim 16, wherein the instructions further cause the one or more computing devices to:
    - determine, in response to a new computing event, a local backup cost to back up the output data block of the task to a local storage and a remote backup cost to back up the output data block of the task to a remote storage; and
      
      determine to back up the output data block to the local storage or the remote storage based at least in part on a first comparison between the recomputation cost and the local backup cost, and a second comparison between the recomputation cost and the remote backup cost.
  - 20. The one or more computer storage media of claim 16, wherein the instructions further cause the one or more computing devices to:
    - build a checkpointing plan for the task, wherein the checkpointing plan comprises a first identification of the checkpointing plan, a second identification of the task, and a third identification of the output data block;
      
      insert the checkpointing plan into a data structure featured with an order of first-in-last-out; and
      
      sequentially execute a plurality of checkpointing plans in the data structure based on the order of first-in-last-out.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
YAN, YING, GAO, YANJIE, CHEN, YANG, MOSCIBRODA, THOMAS, GANAPATHY, NARAYANAN, CHEN, BOLE, GUO, ZHONGXIN

Granted Patent

US 11,416,286 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 11/0757   by exceeding a time limit, ...

G06F 11/14   Error detection or correcti...

G06F 11/1451   by selection of backup cont...

G06F 11/1461   Backup scheduling policy

G06F 2201/81   Threshold

G06F 9/4881   Scheduling strategies for d...

COMPUTING ON TRANSIENT RESOURCES

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

COMPUTING ON TRANSIENT RESOURCES

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links