System and method for topology-aware job scheduling and backfilling in an HPC environment
DCFirst Claim
Patent Images
1. A method comprising:
- determining, using one or more computers, an original subset of a plurality of communicatively coupled nodes of a computing environment, the original subset comprising nodes currently unallocated to a job;
selecting a job from a job queue; and
determining that dimensions of the selected job are greater than a topology of the original subset;
selecting one or more nodes from a second plurality of nodes, the second plurality being distinct from the original subset, wherein the selected one or more nodes from the second plurality are unavailable at the time of selecting; and
adding the nodes selected from the second plurality to the original subset to satisfy the dimensions of the selected job after the nodes selected from the second plurality become available; and
executing the selected job.
1 Assignment
Litigations
0 Petitions
Accused Products
Abstract
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.
-
Citations
16 Claims
-
1. A method comprising:
-
determining, using one or more computers, an original subset of a plurality of communicatively coupled nodes of a computing environment, the original subset comprising nodes currently unallocated to a job; selecting a job from a job queue; and determining that dimensions of the selected job are greater than a topology of the original subset; selecting one or more nodes from a second plurality of nodes, the second plurality being distinct from the original subset, wherein the selected one or more nodes from the second plurality are unavailable at the time of selecting; and adding the nodes selected from the second plurality to the original subset to satisfy the dimensions of the selected job after the nodes selected from the second plurality become available; and executing the selected job. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. Software in one or more non-transitory, tangible computer-readable media and when executed operable to:
-
determine an original subset of a plurality of communicatively coupled nodes of a computing environment, the original subset comprising nodes currently unallocated to a job; select a job from a job queue; determine that dimensions of the selected job are greater than a topology of the original subset; select one or more nodes from a second plurality of nodes, the second plurality being distinct from the original subset, wherein the selected one or more nodes from the second plurality are unavailable at the time of selecting, add the selected second nodes to the original subset to satisfy the dimensions of the selected job after the nodes selected from the second plurality become available; and execute the selected job. - View Dependent Claims (8, 9, 10)
-
-
11. A system comprising:
-
a plurality of communicatively coupled nodes of a computing environment; and a management node operable to; determine an original subset of the plurality of nodes, the original subset comprising nodes currently unallocated to a job; select a job from a job queue; determine that dimensions of the selected job are greater than a topology of the original subset; select one or more nodes from a second plurality of nodes, the second plurality being distinct from the original subset, wherein the selected one or more nodes from the second plurality are unavailable at the time of selecting, add the selected one or more nodes from the second plurality to the original subset to satisfy the dimensions of the selected job after the nodes selected from the second plurality become available; and execute the selected job using one or more processors of one or more nodes of the modified original subset. - View Dependent Claims (12, 13, 14, 15, 16)
-
Specification