System and method for topology-aware job scheduling and backfilling in an HPC environment
DCFirst Claim
Patent Images
1. A method, comprising:
- receiving, by one or more computers, submission of a job from a user;
selecting a virtual cluster of a plurality of communicatively coupled nodes included in a computing environment, the virtual cluster associated with a group of users that submit similar jobs, and comprising a logical grouping of nodes configured to process related jobs, wherein the computing environment is configured to accommodate multiple virtual clusters therein;
retrieving a policy with one or more of the job and the user, and determining dimensions of the job to determine a job space within the selected virtual cluster, the job space comprising a set of nodes within the virtual cluster dynamically allocated to complete the job, wherein the virtual cluster is configured to accommodate multiple job spaces, with each job space being configured to concurrently execute a separate job;
determining whether there are a sufficient number of nodes available within the virtual cluster to allocate to the job space, and in the event sufficient nodes are not available within the virtual cluster, determining an earliest available subset of nodes in the virtual cluster on which to execute the job and adding the job to a job queue until the earliest available subset is available within the virtual cluster; and
upon a determination that a sufficient number of nodes are available within the virtual cluster, dynamically determining an optimum subset of nodes of the virtual cluster, allocating the subset for the job, and executing the job.
1 Assignment
Litigations
0 Petitions
Accused Products
Abstract
A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.
-
Citations
18 Claims
-
1. A method, comprising:
-
receiving, by one or more computers, submission of a job from a user; selecting a virtual cluster of a plurality of communicatively coupled nodes included in a computing environment, the virtual cluster associated with a group of users that submit similar jobs, and comprising a logical grouping of nodes configured to process related jobs, wherein the computing environment is configured to accommodate multiple virtual clusters therein; retrieving a policy with one or more of the job and the user, and determining dimensions of the job to determine a job space within the selected virtual cluster, the job space comprising a set of nodes within the virtual cluster dynamically allocated to complete the job, wherein the virtual cluster is configured to accommodate multiple job spaces, with each job space being configured to concurrently execute a separate job; determining whether there are a sufficient number of nodes available within the virtual cluster to allocate to the job space, and in the event sufficient nodes are not available within the virtual cluster, determining an earliest available subset of nodes in the virtual cluster on which to execute the job and adding the job to a job queue until the earliest available subset is available within the virtual cluster; and upon a determination that a sufficient number of nodes are available within the virtual cluster, dynamically determining an optimum subset of nodes of the virtual cluster, allocating the subset for the job, and executing the job. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system, comprising:
-
a plurality of communicatively coupled nodes of a computing environment; and a management node configured to; receive, by one or more computers, submission of a job from a user; select a virtual cluster of the plurality of communicatively coupled nodes, the virtual cluster associated with a group of users that submit similar jobs, and comprising a logical grouping of nodes configured to process related jobs, wherein the computing environment is configured to accommodate multiple virtual clusters therein; retrieve a policy with one or more of the job and the user, and determine dimensions of the job to determine a job space within the selected virtual cluster, the job space comprising a set of nodes within the virtual cluster dynamically allocated to complete the job, wherein the virtual cluster is configured to accommodate multiple job spaces, with each job space being configured to concurrently execute a separate job; determine whether there are a sufficient number of nodes available within the virtual cluster to allocate to the job space, and in the event sufficient nodes are not available within the virtual cluster, determining an earliest available subset of nodes in the virtual cluster on which to execute the job and adding the job to a job queue until the earliest available subset is available within the virtual cluster; and upon a determination that a sufficient number of nodes are available within the virtual cluster, dynamically determine an optimum subset of nodes of the virtual cluster, allocating the subset for the job, and executing the job. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A non-transitory, computer readable storage medium having computer readable instructions stored thereon that, when executed by a computer, implement a method, the method comprising:
-
receiving submission of a job from a user; selecting a virtual cluster of a plurality of communicatively coupled nodes included in a computing environment, the virtual cluster associated with a group of users that submit similar jobs, and comprising a logical grouping of nodes configured to process related jobs, wherein the computing environment is configured to accommodate multiple virtual clusters therein; retrieving a policy with one or more of the job and the user, and determining dimensions of the job to determine a job space within the selected virtual cluster, the job space comprising a set of nodes within the virtual cluster dynamically allocated to complete the job, wherein the virtual cluster is configured to accommodate multiple job spaces, with each job space being configured to concurrently execute a separate job; determining whether there are a sufficient number of nodes available within the virtual cluster to allocate to the job space, and in the event sufficient nodes are not available within the virtual cluster, determining an earliest available subset of nodes in the virtual cluster on which to execute the job and adding the job to a job queue until the earliest available subset is available within the virtual cluster; and upon a determination that a sufficient number of nodes are available within the virtual cluster, dynamically determining an optimum subset of nodes of the virtual cluster, allocating the subset for the job, and executing the job. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification