System and method for topology-aware job scheduling and backfilling in an HPC environment
DCFirst Claim
Patent Images
1. A method comprising:
- determining, using one or more computers, available space in a virtual cluster of a plurality of communicatively coupled nodes included in a computing environment, the virtual cluster associated with a group of users that submit similar jobs, and comprising a logical grouping of nodes configured to process related jobs;
determining an optimum job that is compatible with the available space in the virtual cluster of nodes by;
determining a number of available nodes in the virtual cluster;
selecting a first job from a job queue;
dynamically determining an optimum shape of the first job;
determining whether the number of available nodes is enough to execute the first job, based on the optimum shape thereof; and
dynamically allocating one or more of the available nodes for the first job, in the event that the determined number of available nodes is enough to execute the first job;
wherein the optimum shape comprises one or more of;
a best fit cube in which the one or more available nodes are allocated in a cubic volume so as to allow cooperating tasks to exchange data with any other tasks by minimizing the distance between any two nodes; and
a best fit sphere in which the one or more available nodes are allocated in a spherical volume such that a first task is placed in a center node of the sphere with remaining tasks placed on nodes surrounding the center node so as to minimize the distance between the first task and the remaining tasks, wherein the remaining tasks communicate with the first task, but not with each other; and
executing the optimum job in the available space in the virtual cluster of nodes.
2 Assignments
Litigations
0 Petitions
Accused Products
Abstract
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.
188 Citations
27 Claims
-
1. A method comprising:
-
determining, using one or more computers, available space in a virtual cluster of a plurality of communicatively coupled nodes included in a computing environment, the virtual cluster associated with a group of users that submit similar jobs, and comprising a logical grouping of nodes configured to process related jobs; determining an optimum job that is compatible with the available space in the virtual cluster of nodes by; determining a number of available nodes in the virtual cluster; selecting a first job from a job queue; dynamically determining an optimum shape of the first job; determining whether the number of available nodes is enough to execute the first job, based on the optimum shape thereof; and dynamically allocating one or more of the available nodes for the first job, in the event that the determined number of available nodes is enough to execute the first job; wherein the optimum shape comprises one or more of; a best fit cube in which the one or more available nodes are allocated in a cubic volume so as to allow cooperating tasks to exchange data with any other tasks by minimizing the distance between any two nodes; and a best fit sphere in which the one or more available nodes are allocated in a spherical volume such that a first task is placed in a center node of the sphere with remaining tasks placed on nodes surrounding the center node so as to minimize the distance between the first task and the remaining tasks, wherein the remaining tasks communicate with the first task, but not with each other; and executing the optimum job in the available space in the virtual cluster of nodes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
-
a plurality of communicatively coupled nodes of a computing environment; and a management node configured to; determine available space in a virtual cluster of a plurality of the communicatively coupled nodes, the virtual cluster associated with a group of users that submit similar jobs, and comprising a logical grouping of nodes configured to process related jobs; determine an optimum job that is compatible with the available space in the virtual cluster of nodes, the optimum job determined by the management node further configured to; determine a number of available nodes in the virtual cluster; select a first job from a job queue; dynamically determine an optimum shape of the first job; determine whether the number of available nodes is enough to execute the first job, based on the optimum shape thereof; and dynamically allocate one or more of the available nodes for the first job, in the event that the determined number of available nodes is enough to execute the first job, wherein the optimum shape comprises one or more of; a best fit cube in which the one or more available nodes are allocated in a cubic volume so as to allow cooperating tasks to exchange data with any other tasks by minimizing the distance between any two nodes; and a best fit sphere in which the one or more available nodes are allocated in a spherical volume such that a first task is placed in a center node of the sphere with remaining tasks placed on nodes surrounding the center node so as to minimize the distance between the first task and the remaining tasks, wherein the remaining tasks communicate with the first task, but not with each other; and execute the optimum job in the available space in the virtual cluster of nodes. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A non-transitory, computer readable storage medium having computer readable instructions stored thereon that, when executed by a computer, implement a method, the method comprising:
-
determining available space in a virtual cluster of a plurality of communicatively coupled nodes included in a computing environment, the virtual cluster associated with a group of users that submit similar jobs, and comprising a logical grouping of nodes configured to process related jobs; determining an optimum job that is compatible with the available space in the virtual cluster of nodes by; determining a number of available nodes in the virtual cluster; selecting a first job from a job queue; dynamically determining an optimum shape of the first job; determining whether the number of available nodes is enough to execute the first job, based on the optimum shape thereof; and dynamically allocating one or more of the available nodes for the first job, in the event that the determined number of available nodes is enough to execute the first job; wherein the optimum shape comprises one or more of; a best fit cube in which the one or more available nodes are allocated in a cubic volume so as to allow cooperating tasks to exchange data with any other tasks by minimizing the distance between any two nodes; and a best fit sphere in which the one or more available nodes are allocated in a spherical volume such that a first task is placed in a center node of the sphere with remaining tasks placed on nodes surrounding the center node so as to minimize the distance between the first task and the remaining tasks, wherein the remaining tasks communicate with the first task, but not with each other; and executing the optimum job in the available space in the virtual cluster of nodes. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
-
Specification