System and method for topology-aware job scheduling and backfilling in an HPC environment
DCFirst Claim
Patent Images
1. A method comprising:
- determining, using one or more computers, an original subset of a plurality of nodes, the original subset comprising nodes currently unallocated to a job, each node in the plurality of nodes comprising a switching fabric comprising a switch integrated on the card and allowing node to node communication during execution of a job;
selecting a job from a job queue;
executing the selected job using one or more processors of one or more nodes of the original subset; and
determining that dimensions of the selected job are greater than a topology of the original subset;
selecting one or more nodes from a second plurality of nodes, the second plurality being distinct from the original subset, each of the nodes in the second plurality of nodes comprising a switching fabric integrated to a card and at least two processors integrated to the card, wherein the selected one or more nodes from the second plurality are unavailable at the time of selecting; and
adding the nodes selected from the second plurality to the original subset to satisfy the dimensions of the selected job after the nodes selected from the second plurality become available.
2 Assignments
Litigations
0 Petitions
Accused Products
Abstract
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.
-
Citations
21 Claims
-
1. A method comprising:
-
determining, using one or more computers, an original subset of a plurality of nodes, the original subset comprising nodes currently unallocated to a job, each node in the plurality of nodes comprising a switching fabric comprising a switch integrated on the card and allowing node to node communication during execution of a job; selecting a job from a job queue; executing the selected job using one or more processors of one or more nodes of the original subset; and determining that dimensions of the selected job are greater than a topology of the original subset; selecting one or more nodes from a second plurality of nodes, the second plurality being distinct from the original subset, each of the nodes in the second plurality of nodes comprising a switching fabric integrated to a card and at least two processors integrated to the card, wherein the selected one or more nodes from the second plurality are unavailable at the time of selecting; and adding the nodes selected from the second plurality to the original subset to satisfy the dimensions of the selected job after the nodes selected from the second plurality become available. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. Software embodied in one or more non-transitory, tangible computer-readable media and when executed operable to:
-
determine an original subset of a plurality of nodes, the original subset comprising nodes currently unallocated to a job, each node in the plurality of nodes comprising a switching fabric integrated to a card and at least two processors integrated to the card, the switching fabric comprising a switch integrated on the card and allowing node to node communication during execution of a job; select a job from a job queue; execute the selected job using one or more processors of one or more nodes of the original subset; and determine that dimensions of the selected job are greater than a topology of the original subset; select one or more nodes from a second plurality of nodes, the second plurality being distinct from the original subset, each of the second nodes comprising an integrated fabric, wherein the selected one or more nodes from the second plurality are unavailable at the time of selecting; and add the selected second nodes to the original subset to satisfy the dimensions of the selected job after the nodes selected from the second plurality become available. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A system comprising:
-
a plurality of nodes, each node comprising a switching fabric integrated to a card and at least two processors integrated to the card, the switching fabric comprising a switch integrated on the card and allowing node to node communication during execution of a job; and a management node operable to; determine an original subset of the plurality of nodes, the original subset comprising nodes currently unallocated to a job; select a job from a job queue; execute the selected job using one or more processors of one or more nodes of the original subset; and determine that dimensions of the selected job are greater than topology of the original subset; select one or more nodes from a second plurality of nodes, the second plurality being distinct from the original subset, each of the nodes in the second plurality of nodes comprising a switching fabric integrated to a card and at least two processors integrated to the card, wherein the selected one or more nodes from the second plurality are unavailable at the time of selecting; and add the selected one or more nodes from the second plurality to the original subset to satisfy the dimensions of the selected job after the nodes selected from the second plurality become available. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification