×

System and method for topology-aware job scheduling and backfilling in an HPC environment

  • US 10,621,009 B2
  • Filed: 12/27/2017
  • Issued: 04/14/2020
  • Est. Priority Date: 04/15/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • determining, using one or more processors, a node of a first job space of a plurality of communicatively coupled nodes has failed, the first job space including a logical grouping of nodes configured to process a job and communicatively coupled to accommodate a logical shape specified by at least one of a job policy and a job parameter;

    determining whether the failed node is allocated to execute the job;

    responsive to a determination that the failed node is allocated to execute the job, identifying operational nodes, other than the failed node, in the first job space;

    terminating a task of the job on each of the operational nodes executing the job;

    deallocating the operational nodes from the job;

    identifying a subset of nodes of the plurality of communicatively coupled nodes, other than the failed node, communicatively coupled in the logical shape;

    allocating the identified subset of nodes to execute the job in a second job space; and

    executing the job on the allocated second job space.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×