System and method for topology-aware job scheduling and backfilling in an HPC environment
DCFirst Claim
Patent Images
1. A method comprising:
- determining, using one or more computers, a failure of a node included in a virtual cluster of a plurality of communicatively coupled nodes of a computing environment, the virtual cluster comprising a logical grouping of nodes configured to process a job having a plurality of tasks;
removing the failed node from the virtual cluster;
determining whether the job is associated with the failed node;
responsive to a determination that the job is associated with the failed node, determining other operational nodes in the virtual cluster associated with the job;
terminating the plurality of tasks of the job on each of the other operational nodes in the virtual cluster currently executing the plurality of tasks of the job;
deallocating the other operational nodes associated with the job;
determining an optimum subset of nodes other than the deallocated nodes and the failed node from the virtual cluster to re-execute the job, wherein the optimum subset of nodes is determined according to a specific topology that allows cooperating tasks of the job to communicate with any other tasks by minimizing distance between any two nodes;
allocating the optimum subset of nodes; and
re-executing the job on the optimum subset of nodes.
1 Assignment
Litigations
0 Petitions
Accused Products
Abstract
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.
-
Citations
18 Claims
-
1. A method comprising:
-
determining, using one or more computers, a failure of a node included in a virtual cluster of a plurality of communicatively coupled nodes of a computing environment, the virtual cluster comprising a logical grouping of nodes configured to process a job having a plurality of tasks; removing the failed node from the virtual cluster; determining whether the job is associated with the failed node; responsive to a determination that the job is associated with the failed node, determining other operational nodes in the virtual cluster associated with the job; terminating the plurality of tasks of the job on each of the other operational nodes in the virtual cluster currently executing the plurality of tasks of the job; deallocating the other operational nodes associated with the job; determining an optimum subset of nodes other than the deallocated nodes and the failed node from the virtual cluster to re-execute the job, wherein the optimum subset of nodes is determined according to a specific topology that allows cooperating tasks of the job to communicate with any other tasks by minimizing distance between any two nodes; allocating the optimum subset of nodes; and re-executing the job on the optimum subset of nodes. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system comprising:
-
a plurality of communicatively coupled nodes of a computing environment; and a management node, comprising at least one processing device, configured to; determine a failure of a node included in a virtual cluster of the plurality of communicatively coupled nodes of the computing environment, the virtual cluster comprising a logical grouping of nodes configured to process a job having a plurality of tasks; remove the failed node from the virtual cluster; determine whether the job is associated with the failed node; responsive to a determination that the job is associated with the failed node, determine other operational nodes in the virtual cluster associated with the job; terminate the plurality of tasks of the job on each of the other operational nodes in the virtual cluster currently executing the plurality of tasks of the job; deallocate the other operational nodes associated with the job; determine an optimum subset of nodes other than the deallocated nodes and the failed node from the virtual cluster to re-execute the job, wherein the optimum subset of nodes is determined according to a specific topology that allows cooperating tasks of the job to communicate with any other tasks by minimizing distance between any two nodes; allocate the optimum subset of nodes; and re-execute the job on the optimum subset of nodes. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A non-transitory computer readable storage medium having computer readable instructions stored thereon that, when executed by a computer, implement a method, the method comprising:
-
determining, using one or more computers, a failure of a node included in a virtual cluster of a plurality of communicatively coupled nodes of a computing environment, the virtual cluster comprising a logical grouping of nodes configured to process a job having a plurality of tasks; removing the failed node from the virtual cluster; determining whether the job is associated with the failed node; responsive to a determination that the job is associated with the failed node, determining other operational nodes in the virtual cluster associated with the job; terminating the plurality of tasks of the job on each of the other operational nodes in the virtual cluster currently executing the plurality of tasks of the job; deallocating the other operational nodes associated with the job; determining an optimum subset of nodes other than the deallocated nodes and the failed node from the virtual cluster to re-execute the job, wherein the optimum subset of nodes is determined according to a specific topology that allows cooperating tasks of the job to communicate with any other tasks by minimizing distance between any two nodes; allocating the optimum subset of nodes; and re-executing the job on the optimum subset of nodes. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification