Methods and apparatus for resource management in cluster computing
First Claim
1. A method for tracking jobs performed by computing nodes of a cluster computing system, the method comprising:
- monitoring, by a management computer, a plurality of computing nodes and an availability of resources provided by the plurality of computing nodes in the cluster computing system;
identifying, by the management computer, a first computing node of the plurality of computing nodes that is available for performing a first job submitted to a job queue;
identifying, by the management computer, a second computing node of the plurality of computing nodes that is available for performing a second job submitted to the job queue;
generating a first job state object specific to the first job for tracking a first job status of the first job and a second job state object specific to the second job for tracking a second job status of the second job;
providing the first job state object to the first computing node and the second job state object to the second computing node;
updating, after completion of a task of the first job, the first job state object independently of any updates to the second job state object after completion of a task of the second job; and
updating, after completion of the task of the second job, the second job state object independently of any updates to the first job state object after completion of the task of the first job.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of an event-driven resource management technique may enable the management of cluster resources at a sub-computer level (e.g., at the thread level) and the decomposition of jobs at an atomic (task) level. A job queue may request a resource for a job from a resource manager, which may locate a resource in a resource list and grant the resource to the job queue. After the resource is granted, the job queue sends the job to the resource, on which the job may be partitioned into tasks and from which additional resources may be requested from the resource manager. The resource manager may locate additional resources in the list and grant the resources to the resource. The resource sends the tasks to the granted resources for execution. As resources complete their tasks, the resource manager is informed so that the status of the resources in the list can be updated.
-
Citations
20 Claims
-
1. A method for tracking jobs performed by computing nodes of a cluster computing system, the method comprising:
-
monitoring, by a management computer, a plurality of computing nodes and an availability of resources provided by the plurality of computing nodes in the cluster computing system; identifying, by the management computer, a first computing node of the plurality of computing nodes that is available for performing a first job submitted to a job queue; identifying, by the management computer, a second computing node of the plurality of computing nodes that is available for performing a second job submitted to the job queue; generating a first job state object specific to the first job for tracking a first job status of the first job and a second job state object specific to the second job for tracking a second job status of the second job; providing the first job state object to the first computing node and the second job state object to the second computing node; updating, after completion of a task of the first job, the first job state object independently of any updates to the second job state object after completion of a task of the second job; and updating, after completion of the task of the second job, the second job state object independently of any updates to the first job state object after completion of the task of the first job. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for tracking jobs performed by computing nodes of a cluster computing system, the system comprising:
-
a management computer configured for; monitoring a plurality of computing nodes and an availability of resources provided by the plurality of computing nodes in the cluster computing system, identifying a first computing node of the plurality of computing nodes that is available for performing a first job submitted to a job queue, and identifying a second computing node of the plurality of computing nodes that is available for performing a second job submitted to the job queue; and at least one computer in communication with the management computer and independent of the management computer, the at least one computer configured for; generating a first job state object specific to the first job for tracking a first job status of the first job and a second job state object specific to the second job for tracking a second job status of the second job, providing the first job state object to the first computing node and the second job state object to the second computing node, updating, after completion of a task of the first job, the first job state object independently of any updates to the second job state object after completion of a task of the second job, and updating, after completion of the task of the second job, the second job state object independently of any updates to the first job state object after completion of the task of the first job. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer-readable medium having program code stored thereon that is executable by a processor for tracking jobs performed by computing nodes of a cluster computing system, the program code comprising:
-
program code for monitoring, by a management computer, a plurality of computing nodes and an availability of resources provided by the plurality of computing nodes in the cluster computing system; program code for identifying, by the management computer, a first computing node of the plurality of computing nodes that is available for performing a first job submitted to a job queue; program code for identifying, by the management computer, a second computing node of the plurality of computing nodes that is available for performing a second job submitted to the job queue; program code for generating a first job state object specific to the first job for tracking a first job status of the first job and a second job state object specific to the second job for tracking a second job status of the second job; program code for providing the first job state object to the first computing node and the second job state object to the second computing node; program code for updating, after completion of a task of the first job, the first job state object independently of any updates to the second job state object after completion of a task of the second job; program code for updating, after completion of the task of the second job, the second job state object independently of any updates to the first job state object after completion of the task of the first job. - View Dependent Claims (18, 19, 20)
-
Specification