Method and system for optimizing reduce-side join operation in a map-reduce framework
First Claim
Patent Images
1. A computer system for optimizing reduce-side join operation in a Map-reduce framework between a first data structure and a second data structure, the first data structure being sorted and divided into one or more regions, the system comprising:
- one or more processors; and
a non-transitory memory that includes modules that are executable by said one or more processors, wherein the modules include;
an executing module to execute one or more map operations by one or more processors, wherein to execute one or more map operations by one or more processors comprises to;
fetch input data of the second data structure;
partition the data of the second data structure according to key-value pair;
project the key-value pairs of the second data structure to a partitioner;
maintain one or more region key counters;
wherein the region key counter being used for registering key count value of one or more regions of the second data structure; and
emit the key count value of one or more regions and corresponding data, wherein the key count values are emitted prior to the corresponding data;
a grouping module to group mapped data corresponding to a single region of the second data structure;
an accumulating module to provide the grouped data to a reducer;
a fetching module to retrieve descriptive metadata of one or more regions of the first data structure; and
a selecting module to select one of a look-up approach and a scan approach to perform the join operation by one or more reducers based on associated key count value and predefined criteria by the reducer, to perform the join operation.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a system and method for optimizing reduce-side join operation in a map-reduce framework. The system and method executing one or more map operations on the second data structure, grouping the data tuples to a single region of the second data structure, providing the grouped data to a single reducer and, selecting one of scan approach and a look-up approach by one or more reducers based on region key count value and pre-determined conditions of the user.
6 Citations
20 Claims
-
1. A computer system for optimizing reduce-side join operation in a Map-reduce framework between a first data structure and a second data structure, the first data structure being sorted and divided into one or more regions, the system comprising:
-
one or more processors; and a non-transitory memory that includes modules that are executable by said one or more processors, wherein the modules include; an executing module to execute one or more map operations by one or more processors, wherein to execute one or more map operations by one or more processors comprises to; fetch input data of the second data structure; partition the data of the second data structure according to key-value pair; project the key-value pairs of the second data structure to a partitioner; maintain one or more region key counters;
wherein the region key counter being used for registering key count value of one or more regions of the second data structure; andemit the key count value of one or more regions and corresponding data, wherein the key count values are emitted prior to the corresponding data; a grouping module to group mapped data corresponding to a single region of the second data structure; an accumulating module to provide the grouped data to a reducer; a fetching module to retrieve descriptive metadata of one or more regions of the first data structure; and a selecting module to select one of a look-up approach and a scan approach to perform the join operation by one or more reducers based on associated key count value and predefined criteria by the reducer, to perform the join operation. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
-
-
2. A method for optimizing reduce-side join operation in a Map-reduce framework between a first data structure and a second data structure, the first data structure being sorted and divided into one or more regions, the method comprising:
executing instructions, stored in a memory, by one or more processors to perform; executing one or more map operations, wherein executing one or more map operations comprises; fetching input data of the second data structure; partitioning the data of the second data structure according to key-value pair; projecting the key-value pairs of the second data structure to a partitioner; maintaining one or more region key counters;
wherein the region key counter being used for registering key count value of one or more regions of the second data structure; andemitting the key count value of one or more regions and corresponding data, wherein the key count values are emitted prior to the corresponding data; and grouping mapped data corresponding to a single region of the second data structure; providing the grouped data to a reducer; retrieving descriptive metadata of one or more regions of the first data structure; and selecting one of a look-up approach and a scan approach to perform the join operation by one or more reducers based on associated key count value and predefined criteria by the reducer, for performing the join operation. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9)
-
18. A non-transitory computer program product comprising instructions stored in the computer program product for optimizing reduce-side join operation in a Map-reduce framework between a first data structure and a second data structure, the first data structure being sorted and divided into one or more regions, wherein the instructions are executable by a processor to perform:
-
executing one or more map operations, wherein executing one or more map operations comprises; fetching input data of the second data structure; partitioning the data of the second data structure according to key-value pair; projecting the key-value pairs of the second data structure to a partitioner; maintaining one or more region key counters;
wherein the region key counter being used for registering key count value of one or more regions of the second data structure; andemitting the key count value of one or more regions and corresponding data, wherein the key count values are emitted prior to the corresponding data; and grouping mapped data corresponding to a single region of the second data structure; providing the grouped data to a reducer; retrieving descriptive metadata of one or more regions of the first data structure; and selecting one of a look-up approach and a scan approach to perform the join operation by one or more reducers based on associated key count value and predefined criteria by the reducer, for performing the join operation. - View Dependent Claims (19, 20)
-
Specification