Concurrent data processing in a distributed system
First Claim
1. One or more computer storage device having computer-useable instructions embodied thereon for performing a method for scheduling vertices in a cluster, the method comprising:
- receiving a data job;
dividing the data job into a plurality of vertices;
assigning the plurality of vertices to one or more process nodes that comprise the cluster;
receiving resource usage information for one or more vertices, wherein the vertices have run to completion, and wherein resource usage information for the vertices has been determined; and
for each of the plurality of vertices for which resource usage information has not been received;
estimating resource usage of the vertex from the received resource usage information for the completed vertices, wherein estimating resource usage comprises;
(A) estimating an input data size range;
(B) dividing the input data size range into data size buckets,wherein the data size buckets are subsets of the data size range;
(C) storing resource usage information for each completed vertex in the corresponding data size bucket; and
(D) for each data size bucket, calculating estimated resource usage information for uncompleted vertices with an input data size within the data size bucket'"'"'s range; and
transmitting the estimated resource usage of the vertex to the process node in the cluster to which the vertex is assigned, wherein the process node allocates computing resources to the vertex based on the estimated resource usage.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and computer media for scheduling vertices in a distributed data processing network and allocating computing resources on a processing node in a distributed data processing network are provided. Vertices, subparts of a data job including both data and computer code that runs on the data, are assigned by a job manager to a distributed cluster of process nodes for processing. The process nodes run the vertices and transmit computing resource usage information, including memory and processing core usage, back to the job manager. The job manager uses this information to estimate computing resource usage information for other vertices in the data job that are either still running or waiting to be run. Using the estimated computing resource usage information, each process node can run multiple vertices concurrently.
59 Citations
18 Claims
-
1. One or more computer storage device having computer-useable instructions embodied thereon for performing a method for scheduling vertices in a cluster, the method comprising:
-
receiving a data job; dividing the data job into a plurality of vertices; assigning the plurality of vertices to one or more process nodes that comprise the cluster; receiving resource usage information for one or more vertices, wherein the vertices have run to completion, and wherein resource usage information for the vertices has been determined; and for each of the plurality of vertices for which resource usage information has not been received; estimating resource usage of the vertex from the received resource usage information for the completed vertices, wherein estimating resource usage comprises; (A) estimating an input data size range; (B) dividing the input data size range into data size buckets, wherein the data size buckets are subsets of the data size range; (C) storing resource usage information for each completed vertex in the corresponding data size bucket; and (D) for each data size bucket, calculating estimated resource usage information for uncompleted vertices with an input data size within the data size bucket'"'"'s range; and transmitting the estimated resource usage of the vertex to the process node in the cluster to which the vertex is assigned, wherein the process node allocates computing resources to the vertex based on the estimated resource usage. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for allocating computing resources to vertices on a process node, the method comprising:
-
on a process node, receiving information about a plurality of vertices assigned to the process node; receiving estimated resource usage information for at least one of the plurality of vertices assigned to the process node; allocating computing resources to a first assigned vertex; running the first assigned vertex; allocating computing resources to a second assigned vertex based on the received estimated resource usage information; running the second assigned vertex concurrently with the first assigned vertex; allocating computing resources to additional assigned vertices until either all computing resources have been allocated or computing resources are allocated to all additional assigned vertices; reserving an amount of memory for running vertices; reserving an amount of memory for running processes that are not vertices; and reserving an amount of free memory, wherein free memory is not available to be allocated to assigned vertices or available for running processes that are not vertices. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A vertex assignment scheduling system for scheduling vertices in a cluster, the system comprising:
-
a computing device associated with a job manager having one or more process nodes and one or more computer-readable storage media; and a data store coupled with the job manager, wherein the job manager; assigns a plurality of vertices to the one or more process nodes that comprise a cluster, receives resource usage information for assigned vertices that have completed, and calculates, based on the received resource usage information for the completed vertices, estimated resource usage for the assigned vertices for which resource usage information has not been received; and wherein at least one of the one or more process nodes; transmits resource usage information to the job manager for each vertex assigned to the process node that completes, allocates computing resources based upon estimated resource usage transmitted to the process node by the job manager such that multiple vertices run concurrently on the process node, and allocates computing resources according to the position of assigned vertices in a queue on the process node, and wherein if the next vertex awaiting allocation of computing resources in the queue requires computing resources in excess of the available computing resources of the process node, the next vertex is flagged and is not allocated computing resources by the process node. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification