Methods and apparatus for resource management in cluster computing
First Claim
1. A method for tracking jobs performed by computing nodes of a cluster computing system, the method comprising:
- transmitting, by a job scheduling system, a job description via a data network to a resource tracking system for the cluster computing system, wherein the job description specifies a number of resources for performing a job comprising a plurality of tasks, wherein the resource tracking system maintains a resource list comprising, for each resource of a plurality of resources, a respective availability of the resource and a respective network location identifying a respective computing node in the cluster computing system at which the resource is located;
determining, by the resource tracking system, that a subset of resources from the plurality of resources is available in response to receiving the job description, wherein an availability of the subset of resources is determined by reference to the resource list;
transmitting, by the resource tracking system, at least one network identifier via the data network to the job scheduling system, wherein the at least one network identifier identifies at least one computing node at which the subset of resources are located;
generating, by the job scheduling system, a job state object for tracking a job status for the job;
transmitting, by the job scheduling system, the job state object and the job to the at least one computing node;
updating, by the at least one computing node, the job state object to describe an updated job status subsequent to performing at least one task of the plurality of tasks;
transmitting, by the at least one computing node, job metadata extracted from the job state object in response to a job status query from the job scheduling system, wherein the job metadata indicates the updated job status; and
transmitting, by the at least one computing node, the updated job state object to at least one additional computing node for performing at least one additional task associated with the job;
wherein the at least one additional computing node updates the job state object subsequent to performing the at least one additional task and notifies the job scheduling system of the update to the job state object by the at least one additional computing node.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of an event-driven resource management technique may enable the management of cluster resources at a sub-computer level (e.g., at the thread level) and the decomposition of jobs at an atomic (task) level. A job queue may request a resource for a job from a resource manager, which may locate a resource in a resource list and grant the resource to the job queue. After the resource is granted, the job queue sends the job to the resource, on which the job may be partitioned into tasks and from which additional resources may be requested from the resource manager. The resource manager may locate additional resources in the list and grant the resources to the resource. The resource sends the tasks to the granted resources for execution. As resources complete their tasks, the resource manager is informed so that the status of the resources in the list can be updated.
139 Citations
15 Claims
-
1. A method for tracking jobs performed by computing nodes of a cluster computing system, the method comprising:
-
transmitting, by a job scheduling system, a job description via a data network to a resource tracking system for the cluster computing system, wherein the job description specifies a number of resources for performing a job comprising a plurality of tasks, wherein the resource tracking system maintains a resource list comprising, for each resource of a plurality of resources, a respective availability of the resource and a respective network location identifying a respective computing node in the cluster computing system at which the resource is located; determining, by the resource tracking system, that a subset of resources from the plurality of resources is available in response to receiving the job description, wherein an availability of the subset of resources is determined by reference to the resource list; transmitting, by the resource tracking system, at least one network identifier via the data network to the job scheduling system, wherein the at least one network identifier identifies at least one computing node at which the subset of resources are located; generating, by the job scheduling system, a job state object for tracking a job status for the job; transmitting, by the job scheduling system, the job state object and the job to the at least one computing node; updating, by the at least one computing node, the job state object to describe an updated job status subsequent to performing at least one task of the plurality of tasks; transmitting, by the at least one computing node, job metadata extracted from the job state object in response to a job status query from the job scheduling system, wherein the job metadata indicates the updated job status; and transmitting, by the at least one computing node, the updated job state object to at least one additional computing node for performing at least one additional task associated with the job; wherein the at least one additional computing node updates the job state object subsequent to performing the at least one additional task and notifies the job scheduling system of the update to the job state object by the at least one additional computing node. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A non-transitory computer-readable medium having program code stored thereon that is executable by a processor to track jobs performed by computing nodes of a cluster computing system, the non-transitory computer-readable medium comprising:
-
program code for transmitting, by a job scheduling system, a job description via a data network to a resource tracking system for the cluster computing system, wherein the job description specifies a number of resources for performing a job comprising a plurality of tasks, wherein the resource tracking system maintains a resource list comprising, for each resource of a plurality of resources, a respective availability of the resource and a respective network location identifying a respective computing node in the cluster computing system at which the resource is located; program code for determining, by the resource tracking system, that a subset of resources from the plurality of resources is available in response to receiving the job description, wherein an availability of the subset of resources is determined by reference to the resource list; program code for transmitting, by the resource tracking system, at least one network identifier via the data network to the job scheduling system, wherein the at least one network identifier identifies at least one computing node at which the subset of resources are located; program code for generating, by the job scheduling system, a job state object for tracking a job status for the job; program code for transmitting, by the job scheduling system, the job state object and the job to the at least one computing node; program code for updating, by the at least one computing node, the job state object to describe an updated job status subsequent to performing at least one task of the plurality of tasks; program code for transmitting, by the at least one computing node, job metadata extracted from the job state object in response to a job status query from the job scheduling system, wherein the job metadata indicates the updated job status; program code for transmitting, by the at least one computing node, the updated job state object to at least one additional computing node for performing at least one additional task associated with the job; program code for updating the job state object by the at least one additional node computing node subsequent to performing the at least one additional task; and program code for notifying, by the at least one additional computing node, the job scheduling system of the update to the job state object by the at least one additional computing node. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A cluster computing system comprising:
-
a job scheduling system comprising a first processor, the first processor configured for; transmitting a job description via a data network to a resource tracking system for the cluster computing system, wherein the job description specifies a number of resources for performing a job comprising a plurality of tasks, wherein the resource tracking system maintains a resource list comprising, for each resource of a plurality of resources, a respective availability of the resource and a respective network location identifying a respective computing node in the cluster computing system at which the resource is located; generating a job state object for tracking a job status for the job, and transmitting the job state object and the job to at least one computing node; a resource tracking system in communication with the job scheduling system via the data network, wherein the resource tracking system comprises a second processor configured for; determining that a subset of resources from the plurality of resources is available in response to receiving the job description, wherein an availability of the subset of resources is determined by reference to the resource list; transmitting at least one network identifier via the data network to the job scheduling system, wherein the at least one network identifier identifies the at least one computing node at which the subset of resources are located; the at least one computing node in communication with the job scheduling system via the data network, wherein the at least one computing node comprises a third processor configured for; updating the job state object to describe an updated job status subsequent to performing at least one task of the plurality of tasks, transmitting job metadata extracted from the job state object in response to a job status query from the job scheduling system, wherein the job metadata indicates the updated job status, and transmitting, by the at least one computing node, the updated job state object to at least one additional computing node for performing at least one additional task associated with the job; and the at least one additional computing node in communication with the at least one computing node via the data network, wherein the at least one additional computing node comprises a fourth processor configured for; updating the job state object subsequent to performing the at least one additional task, and notifying the job scheduling system of the update to the job state object by the at least one additional computing node. - View Dependent Claims (14, 15)
-
Specification