×

System and method for topology-aware job scheduling and backfilling in an HPC environment

DC
  • US 9,189,278 B2
  • Filed: 10/11/2013
  • Issued: 11/17/2015
  • Est. Priority Date: 04/15/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • determining, using one or more computers, a failure of a node included in a virtual cluster of a plurality of communicatively coupled nodes of a computing environment, the virtual cluster comprising a logical grouping of nodes configured to process a job having a plurality of tasks;

    removing the failed node from the virtual cluster;

    determining whether the job is associated with the failed node;

    responsive to a determination that the job is associated with the failed node, determining other operational nodes in the virtual cluster associated with the job;

    terminating the plurality of tasks of the job on each of the other operational nodes in the virtual cluster currently executing the plurality of tasks of the job;

    deallocating the other operational nodes associated with the job;

    determining an optimum subset of nodes other than the deallocated nodes and the failed node from the virtual cluster to re-execute the job, wherein the optimum subset of nodes is determined according to a specific topology that allows cooperating tasks of the job to communicate with any other tasks by minimizing distance between any two nodes;

    allocating the optimum subset of nodes; and

    re-executing the job on the optimum subset of nodes.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×