System and method for topology-aware job scheduling and backfilling in an HPC environment
First Claim
Patent Images
1. A method comprising:
- determining, using one or more processors, a node of a first job space of a plurality of communicatively coupled nodes has failed, the first job space including a logical grouping of nodes configured to process a job and communicatively coupled to accommodate a logical shape specified by at least one of a job policy and a job parameter;
determining whether the failed node is allocated to execute the job;
responsive to a determination that the failed node is allocated to execute the job, identifying operational nodes, other than the failed node, in the first job space;
terminating a task of the job on each of the operational nodes executing the job;
deallocating the operational nodes from the job;
identifying a subset of nodes of the plurality of communicatively coupled nodes, other than the failed node, communicatively coupled in the logical shape;
allocating the identified subset of nodes to execute the job in a second job space; and
executing the job on the allocated second job space.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.
-
Citations
20 Claims
-
1. A method comprising:
-
determining, using one or more processors, a node of a first job space of a plurality of communicatively coupled nodes has failed, the first job space including a logical grouping of nodes configured to process a job and communicatively coupled to accommodate a logical shape specified by at least one of a job policy and a job parameter; determining whether the failed node is allocated to execute the job; responsive to a determination that the failed node is allocated to execute the job, identifying operational nodes, other than the failed node, in the first job space; terminating a task of the job on each of the operational nodes executing the job; deallocating the operational nodes from the job; identifying a subset of nodes of the plurality of communicatively coupled nodes, other than the failed node, communicatively coupled in the logical shape; allocating the identified subset of nodes to execute the job in a second job space; and executing the job on the allocated second job space. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
a plurality of communicatively coupled nodes; and a management node, comprising at least one processing device, configured to; determine, using one or more processors, a node of a first job space of a plurality of communicatively coupled nodes has failed, the first job space including a logical grouping of nodes configured to process a job and communicatively coupled to accommodate a logical shape specified by at least one of a job policy and a job parameter; determine whether the failed node is allocated to execute the job; responsive to a determination that the failed node is allocated to execute the job, identify operational nodes, other than the failed node, in the first job space; terminate a task of the job on each of the operational nodes executing the job; deallocate the operational nodes from the job; identify a subset of nodes of the plurality of communicatively coupled nodes, other than the failed node, communicatively coupled in the logical shape; allocate the identified subset of nodes to execute the job in a second job space; and execute the job on the allocated second job space. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A non-transitory computer readable storage medium having computer readable instructions stored thereon that, when executed by a computer, implement a method, the method comprising:
-
determining a node of a first job space of a plurality of communicatively coupled nodes has failed, the first job space including a logical grouping of nodes configured to process a job and communicatively coupled to accommodate a logical shape specified by at least one of a job policy and a job parameter; determining whether the failed node is allocated to execute the job; responsive to a determination that the failed node is allocated to execute the job, identifying operational nodes, other than the failed node, in the first job space; terminating a task of the job on each of the operational nodes executing the job; deallocating the operational nodes from the job; identifying a subset of nodes of the plurality of communicatively coupled nodes, other than the failed node, communicatively coupled in the logical shape; allocating the identified subset of nodes to execute the job in a second job space; and executing the job on the allocated second job space. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification