System and method for topology-aware job scheduling and backfilling in an HPC environment
First Claim
Patent Images
1. A method comprising:
- determining, using one or more computers, available space in a virtual cluster of a plurality of communicatively coupled nodes included in a computing environment, the plurality of communicatively coupled nodes arranged in a three dimensional node structure, the virtual cluster comprising a logical grouping of nodes configured to process related jobs;
determining a job of a plurality of jobs in a job queue fits the deter ined available space in the virtual cluster of nodes including;
identifying, using the one or more computers, a shape of the job, the shape of the job indicating dimensions of nodes suitable to execute the selected job;
identifying, using the one or more computers, one or more shapes of the available space, the one or more shapes of the available space indicating one or more dimensions of the available space; and
determining, using the one or more computers, whether respective identified dimensions of the available space is greater than or equal to the respective identified dimensions of nodes suitable to execute the job; and
executing the job in the available space in the virtual cluster of nodes, in response to determining the all respective dimensions of the available space are greater than or equal to the respective identified dimensions of one or more nodes suitable to execute the job.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.
205 Citations
27 Claims
-
1. A method comprising:
-
determining, using one or more computers, available space in a virtual cluster of a plurality of communicatively coupled nodes included in a computing environment, the plurality of communicatively coupled nodes arranged in a three dimensional node structure, the virtual cluster comprising a logical grouping of nodes configured to process related jobs; determining a job of a plurality of jobs in a job queue fits the deter ined available space in the virtual cluster of nodes including; identifying, using the one or more computers, a shape of the job, the shape of the job indicating dimensions of nodes suitable to execute the selected job; identifying, using the one or more computers, one or more shapes of the available space, the one or more shapes of the available space indicating one or more dimensions of the available space; and determining, using the one or more computers, whether respective identified dimensions of the available space is greater than or equal to the respective identified dimensions of nodes suitable to execute the job; and executing the job in the available space in the virtual cluster of nodes, in response to determining the all respective dimensions of the available space are greater than or equal to the respective identified dimensions of one or more nodes suitable to execute the job. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
-
a plurality of communicatively coupled nodes of a computing environment, each node comprising at least one hardware processing unit; and a management node configured to; determine available space in a virtual cluster of a plurality of the communicatively coupled nodes, the plurality of communicatively coupled arranged in a three dimensional node structure wherein the three dimensional node structure reduces the number of hops between nodes, the virtual cluster comprising a logical grouping of nodes configured to process related jobs; determine a job of a plurality of jobs in a job queue fits the determined available space in the virtual cluster of nodes including; identify a shape of the job, the shape of the job indicating dimensions of nodes suitable to execute the selected job; identify one or more shapes of the available space, the one or more shapes of the available space indicating one or more dimensions of the available space; and determine whether a respective identified dimension of the one or more dimensions of the available space is greater than or equal to a respective identified dimension of the dimensions of the nodes suitable to execute the job; and execute the job in the available space in the virtual cluster of nodes, in response to determining all respective dimensions of the available space are greater than or equal to each of the one or more identified dimensions of one or more nodes suitable to execute the job. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. One or more computer-readable hardware storage device having computer readable instructions stored thereon that, when executed by a computer, implement a method, the method comprising:
-
determining available space in a virtual cluster of a plurality of communicatively coupled nodes included in a computing environment, the plurality of communicatively coupled nodes arranged in a three dimensional node structure wherein the three dimensional node structure reduces the number of hops between nodes, the virtual cluster comprising a logical grouping of nodes configured to process related jobs; determining a job of a plurality of jobs in a job queue fits the determined available space in the virtual cluster of nodes including; identifying a shape of the job, the shape of the job indicating dimensions of nodes suitable to execute the selected job; identifying one or more shapes of the available nodes, the one or more shapes of the available nodes indicating one or more dimensions of the available nodes; and determining whether respective identified dimensions of the available nodes is greater than or equal to the respective identified dimensions of nodes suitable to execute the job; and executing the job in the available space in the virtual cluster of nodes, in response to determining the all respective dimensions of the available nodes are greater than or equal to the respective identified dimensions of one or more nodes suitable to execute the job. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
-
Specification