System and method for multi-level preemption scheduling in high performance processing
First Claim
1. A method for managing preemption events in a backfill enabled computing system, the method comprising:
- suspending a first low priority job running on one or more nodes of a node cluster upon receipt of a first high priority job until the nodes the first low priority job was running on become available;
running the first high priority job on the one or more nodes of the node cluster;
selecting a second low priority job from a job queue, the second low priority job having a position in the job queue;
running the second low priority job on available nodes of the node cluster while the first high priority job is running;
receiving a request for a second high priority job after the second low priority job has started running;
determining a processing status for the second low priority job;
determining that the processing status of the second low priority job exceeds a predetermined checkpoint threshold;
saving processing performed on the second low priority job in the event the processing status exceeds the predetermined checkpoint threshold;
returning, after receiving the request for the second high priority job, the second low priority job to a job queue in the position in the job queue; and
running the first low priority job and the second low priority job after the first high priority job and the second high priority job are complete.
1 Assignment
0 Petitions
Accused Products
Abstract
A computing system configured to handle preemption events in an environment having jobs with high and low priorities. The system includes a job queue configured to receive job requests from users, the job queue storing the jobs in an order based on the priority of the jobs, and indicating whether a job is a high priority job or a low priority job. The system also includes a plurality of node clusters, each node cluster including a plurality of nodes and a scheduler coupled to the job queue and to the plurality of node clusters and configured to assign jobs from the job queue to the plurality of node clusters. The scheduler is configured to preempt a first low priority job running in a first node cluster with a high priority job that appears in the job queue after the low priority job has started and, in the event that a second low priority job from the job queue may run on a portion of the plurality of nodes in the first node cluster during a remaining processing time for the high priority job, backfill the second low priority job into the portion of the plurality of nodes and, in the event a second high priority job is received in the job queue and may run on the portion of the plurality of nodes, return the second low priority job to the job queue.
-
Citations
7 Claims
-
1. A method for managing preemption events in a backfill enabled computing system, the method comprising:
-
suspending a first low priority job running on one or more nodes of a node cluster upon receipt of a first high priority job until the nodes the first low priority job was running on become available; running the first high priority job on the one or more nodes of the node cluster; selecting a second low priority job from a job queue, the second low priority job having a position in the job queue; running the second low priority job on available nodes of the node cluster while the first high priority job is running; receiving a request for a second high priority job after the second low priority job has started running; determining a processing status for the second low priority job; determining that the processing status of the second low priority job exceeds a predetermined checkpoint threshold; saving processing performed on the second low priority job in the event the processing status exceeds the predetermined checkpoint threshold; returning, after receiving the request for the second high priority job, the second low priority job to a job queue in the position in the job queue; and running the first low priority job and the second low priority job after the first high priority job and the second high priority job are complete. - View Dependent Claims (2, 3)
-
-
4. A method of managing the operation of computing system including a plurality of node clusters, each node cluster including a plurality of nodes, the method comprising:
-
allocating a first low priority job to run on an a first set of the nodes in a first node cluster; running the first low priority job on the first set of nodes; receiving, at a job queue, a first high priority job; suspending the first low priority job until the first set of nodes becomes available; running the first high priority job on a second set of nodes that includes at least one of the nodes in the first set of nodes in the first node cluster for a predetermined amount of time; selecting a second low priority job from the job queue; running the second low priority job on a third set of nodes in the first node cluster; receiving a second high priority job on the job queue after the second low priority job has started running; determining a processing status for the second low priority job; determining that the processing status of the second low priority job exceeds a predetermined checkpoint threshold; saving processing performed on the second low priority job in the event the processing status exceeds the predetermined checkpoint threshold; returning the second low priority job to the job queue; and running the first low priority job and the second low priority job after the first high priority job and second high priority job are complete. - View Dependent Claims (5, 6, 7)
-
Specification