System and method for topology-aware job scheduling and backfilling in an HPC environment
First Claim
Patent Images
1. A method comprising:
- identifying, using one or more processors, one or more nodes of a first virtual cluster of nodes that are available for processing a job, the first virtual cluster of nodes being one of a plurality of virtual clusters of nodes;
selecting a job from a job queue;
determining whether the identified nodes are sufficient to process the selected job, wherein the determining includes determining whether a topology of the identified nodes accommodates a shape of the selected job;
in response to determining the identified nodes are not sufficient to process the selected job, selecting one or more nodes from other nodes of the plurality of virtual clusters of nodes that are not currently available for processing the selected job;
allocating the selected one or more nodes and the identified nodes for processing the selected job; and
processing the selected job using the allocated nodes after the selected one or more nodes becoming available to process the selected job.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.
208 Citations
26 Claims
-
1. A method comprising:
-
identifying, using one or more processors, one or more nodes of a first virtual cluster of nodes that are available for processing a job, the first virtual cluster of nodes being one of a plurality of virtual clusters of nodes; selecting a job from a job queue; determining whether the identified nodes are sufficient to process the selected job, wherein the determining includes determining whether a topology of the identified nodes accommodates a shape of the selected job; in response to determining the identified nodes are not sufficient to process the selected job, selecting one or more nodes from other nodes of the plurality of virtual clusters of nodes that are not currently available for processing the selected job; allocating the selected one or more nodes and the identified nodes for processing the selected job; and processing the selected job using the allocated nodes after the selected one or more nodes becoming available to process the selected job. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory machine-readable storage device including instructions stored thereon that, when executed by a machine, cause the machine to perform operations comprising:
-
identifying one or more nodes of a first virtual cluster of nodes that are available for processing a job, the first virtual cluster of nodes being one of a plurality of virtual clusters of nodes; selecting a job from a job queue; determining whether the identified nodes are sufficient to process the selected job, wherein the determining includes determining whether a topology of the identified nodes accommodates a shape of the selected job; in response to determining the identified nodes are not sufficient to process the selected job, selecting one or more nodes from other nodes of the plurality of virtual clusters of nodes that are not currently available for processing the selected job; allocating the selected one or more nodes and the identified nodes for processing the selected job; and processing the selected job using the allocated nodes after the selected one or more nodes becoming available to process the selected job. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system comprising:
-
a cluster management engine; virtual clusters of nodes, each node of the virtual clusters comprising a processing device; and the cluster management engine configured to; identify one or more nodes of a first virtual cluster of nodes that are available for processing a job, the first virtual cluster of nodes is one of the virtual clusters of nodes; selecting a job from a job queue; determine whether the identified nodes are sufficient to process the selected job, wherein the determining includes determining whether a topology of the identified nodes accommodates a shape of the selected job; in response to a determination that the identified nodes are not sufficient to process the selected job, select one or more nodes from other nodes of the plurality of virtual clusters of nodes that are not currently available for processing the selected job; and allocate the selected one or more nodes and the identified nodes for processing the selected job; and the allocated nodes to process the selected job, using the processing device, after the selected one or more nodes becoming available to process the selected job. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
-
Specification