Automated load-balancing of partitions in arbitrarily imbalanced distributed mapreduce computations
First Claim
1. A computer-implemented method of load-balancing in an arbitrarily imbalanced MapReduce job in a distributed computing system, the method comprising:
- identifying K data keys with the highest frequency among received data, the received data comprising pairings of data keys and data values to be processed in the MapReduce job, wherein K is determined according to a data key frequency distribution of the received data and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers, and wherein K is a number selected to keep a maximum imbalance ratio under a threshold;
assigning data for each of the K data keys to a single-key bucket and other data keys to multiple-key buckets;
assigning one respective reduce phase worker to process data values corresponding to data keys of each multiple-key bucket, each multiple-key bucket comprising queued data items having several different keys;
assigning multiple reduce phase workers to process data values corresponding to the data key of each single-key bucket; and
stitching together output of the assigned multiple reduce phase workers on each respective single-key bucket.
9 Assignments
0 Petitions
Accused Products
Abstract
A distributed computing system executes a MapReduce job on streamed data that includes an arbitrary amount of imbalance with respect to the frequency distribution of the data keys in the dataset. A map task module maps the dataset to a coarse partitioning, and generates a list of the top K keys with the highest frequency among the dataset. A sort task module employs a plurality of sorters to read the coarse partitioning and sort the data into buckets by data key. The values for the top K most frequent keys are separated into single-key buckets. The other less frequently occurring keys are assigned to buckets that each have multiple keys assigned to it. Then, more than one worker is assigned to each single-key bucket. The output of the multiple workers assigned to each respective single-key bucket is stitched together.
-
Citations
20 Claims
-
1. A computer-implemented method of load-balancing in an arbitrarily imbalanced MapReduce job in a distributed computing system, the method comprising:
-
identifying K data keys with the highest frequency among received data, the received data comprising pairings of data keys and data values to be processed in the MapReduce job, wherein K is determined according to a data key frequency distribution of the received data and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers, and wherein K is a number selected to keep a maximum imbalance ratio under a threshold; assigning data for each of the K data keys to a single-key bucket and other data keys to multiple-key buckets; assigning one respective reduce phase worker to process data values corresponding to data keys of each multiple-key bucket, each multiple-key bucket comprising queued data items having several different keys; assigning multiple reduce phase workers to process data values corresponding to the data key of each single-key bucket; and stitching together output of the assigned multiple reduce phase workers on each respective single-key bucket. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A nontransitory computer readable storage medium including computer program instructions that, when executed, cause a computer processor to perform operations comprising:
-
identifying K data keys with the highest frequency among received data, the received data comprising pairings of data keys and data values to be processed in the MapReduce job, wherein K is determined according to a data key frequency distribution of the received data and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers, and wherein K is a number selected to keep a maximum imbalance ratio under a threshold; assigning data for each of the K data keys to a single-key bucket and other data keys to multiple-key buckets; assigning one respective reduce phase worker to process data values corresponding to data keys of each multiple-key bucket, each multiple-key bucket comprising queued data items having several different keys; assigning multiple reduce phase workers to process data values corresponding to the data key of each single-key bucket; and stitching together output of the assigned multiple reduce phase workers on each respective single-key bucket. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system comprising:
-
a computer processor; and a computer readable storage medium storing processor-executable computer program instructions, the computer program instructions comprising instructions for; identifying K data keys with the highest frequency among received data, the received data comprising pairings of data keys and data values to be processed in the MapReduce job, wherein K is determined according to a data key frequency distribution of the received data and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers, and wherein K is a number selected to keep a maximum imbalance ratio under a threshold; assigning data for each of the K data keys to a single-key bucket and other data keys to multiple-key buckets; assigning one respective reduce phase worker to process data values corresponding to data keys of each multiple-key bucket, each multiple-key bucket comprising queued data items having several different keys; assigning multiple reduce phase workers to process data values corresponding to the data key of each single-key bucket; and stitching together output of the assigned multiple reduce phase workers on each respective single-key bucket. - View Dependent Claims (20)
-
Specification