Automated load-balancing of partitions in arbitrarily imbalanced distributed mapreduce computations

US 10,642,866 B1
Filed: 02/02/2017
Issued: 05/05/2020
Est. Priority Date: 06/30/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of load-balancing in an arbitrarily imbalanced MapReduce job in a distributed computing system, the method comprising:

identifying K data keys with the highest frequency among received data, the received data comprising pairings of data keys and data values to be processed in the MapReduce job, wherein K is determined according to a data key frequency distribution of the received data and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers, and wherein K is a number selected to keep a maximum imbalance ratio under a threshold;

assigning data for each of the K data keys to a single-key bucket and other data keys to multiple-key buckets;

assigning one respective reduce phase worker to process data values corresponding to data keys of each multiple-key bucket, each multiple-key bucket comprising queued data items having several different keys;

assigning multiple reduce phase workers to process data values corresponding to the data key of each single-key bucket; and

stitching together output of the assigned multiple reduce phase workers on each respective single-key bucket.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A distributed computing system executes a MapReduce job on streamed data that includes an arbitrary amount of imbalance with respect to the frequency distribution of the data keys in the dataset. A map task module maps the dataset to a coarse partitioning, and generates a list of the top K keys with the highest frequency among the dataset. A sort task module employs a plurality of sorters to read the coarse partitioning and sort the data into buckets by data key. The values for the top K most frequent keys are separated into single-key buckets. The other less frequently occurring keys are assigned to buckets that each have multiple keys assigned to it. Then, more than one worker is assigned to each single-key bucket. The output of the multiple workers assigned to each respective single-key bucket is stitched together.

Citations

20 Claims

1. A computer-implemented method of load-balancing in an arbitrarily imbalanced MapReduce job in a distributed computing system, the method comprising:
- identifying K data keys with the highest frequency among received data, the received data comprising pairings of data keys and data values to be processed in the MapReduce job, wherein K is determined according to a data key frequency distribution of the received data and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers, and wherein K is a number selected to keep a maximum imbalance ratio under a threshold;
  
  assigning data for each of the K data keys to a single-key bucket and other data keys to multiple-key buckets;
  
  assigning one respective reduce phase worker to process data values corresponding to data keys of each multiple-key bucket, each multiple-key bucket comprising queued data items having several different keys;
  
  assigning multiple reduce phase workers to process data values corresponding to the data key of each single-key bucket; and
  
  stitching together output of the assigned multiple reduce phase workers on each respective single-key bucket.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the received data is a data stream.
  - 3. The method of claim 1, wherein the identified K keys with the highest frequency among the received data fluctuates as additional data are received.
  - 4. The method of claim 1, wherein K is a number selected so a frequency of a data key assigned to a single-key bucket with a lowest frequency among data keys assigned to single-key buckets exceeds a threshold.
  - 5. The method of claim 1, wherein a number of multiple reduce phase workers to assign to a single-key bucket is determined based on a respective frequency of the data key assigned to the single-key bucket and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers.
  - 6. The method of claim 1, further comprising reporting a frequency distribution of the K highest frequency data keys.
  - 7. The method of claim 1, further comprising combining output across all reduce phase workers to obtain a result of a MapReduce computation.
  - 8. The method of claim 1, wherein assigning each of the K data keys to a single-key bucket and other data keys to multiple-key buckets comprises a single sort step.
  - 9. The method of claim 1, wherein the single-key bucket for a first data key of the identified K data keys is arbitrarily large compared to other buckets.

10. A nontransitory computer readable storage medium including computer program instructions that, when executed, cause a computer processor to perform operations comprising:
- identifying K data keys with the highest frequency among received data, the received data comprising pairings of data keys and data values to be processed in the MapReduce job, wherein K is determined according to a data key frequency distribution of the received data and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers, and wherein K is a number selected to keep a maximum imbalance ratio under a threshold;
  
  assigning data for each of the K data keys to a single-key bucket and other data keys to multiple-key buckets;
  
  assigning one respective reduce phase worker to process data values corresponding to data keys of each multiple-key bucket, each multiple-key bucket comprising queued data items having several different keys;
  
  assigning multiple reduce phase workers to process data values corresponding to the data key of each single-key bucket; and
  
  stitching together output of the assigned multiple reduce phase workers on each respective single-key bucket.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The medium of claim 10, wherein the received data is a data stream.
  - 12. The medium of claim 10, wherein the identified K keys with the highest frequency among the received data fluctuates as additional data are received.
  - 13. The medium of claim 10, wherein a number of multiple reduce phase workers to assign to a single-key bucket is determined based on a respective frequency of the data key assigned to the single-key bucket and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers.
  - 14. The medium of claim 10, wherein the operations further comprise reporting a frequency distribution of the K highest frequency data keys.
  - 15. The medium of claim 10, wherein the operations further comprise combining output across all reduce phase workers to obtain a result of a MapReduce computation.
  - 16. The medium of claim 10, wherein assigning each of the K data keys to a single-key bucket and other data keys to multiple-key buckets comprises a single sort step.
  - 17. The medium of claim 10, wherein the single-key bucket for a first data key of the identified K data keys is arbitrarily large compared to other buckets.
  - 18. The medium of claim 10, wherein K is a number selected so a frequency of a data key assigned to a single-key bucket with a lowest frequency among data keys assigned to single-key buckets exceeds a threshold.

19. A system comprising:
- a computer processor; and
  
  a computer readable storage medium storing processor-executable computer program instructions, the computer program instructions comprising instructions for;
  
  identifying K data keys with the highest frequency among received data, the received data comprising pairings of data keys and data values to be processed in the MapReduce job, wherein K is determined according to a data key frequency distribution of the received data and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers, and wherein K is a number selected to keep a maximum imbalance ratio under a threshold;
  
  assigning data for each of the K data keys to a single-key bucket and other data keys to multiple-key buckets;
  
  assigning one respective reduce phase worker to process data values corresponding to data keys of each multiple-key bucket, each multiple-key bucket comprising queued data items having several different keys;
  
  assigning multiple reduce phase workers to process data values corresponding to the data key of each single-key bucket; and
  
  stitching together output of the assigned multiple reduce phase workers on each respective single-key bucket.
- View Dependent Claims (20)
- - 20. The system of claim 19, wherein a number of multiple reduce phase workers to assign to a single-key bucket is determined based on a respective frequency of the data key assigned to the single-key bucket and a threshold level of acceptable imbalance in reduce phase worker loads across reduce phase workers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Quantcast
Original Assignee
Quantcast
Inventors
Jiang, Wei, Rus, Silvius V.
Primary Examiner(s)
Ly, Anh

Application Number

US15/422,756
Time in Patent Office

1,188 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06F 16/278   Data partitioning, e.g. hor...

G06F 16/285   Clustering or classification

Automated load-balancing of partitions in arbitrarily imbalanced distributed mapreduce computations

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automated load-balancing of partitions in arbitrarily imbalanced distributed mapreduce computations

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links