Method and system for optimizing reduce-side join operation in a map-reduce framework

US 10,185,743 B2
Filed: 11/25/2014
Issued: 01/22/2019
Est. Priority Date: 11/26/2013
Status: Active Grant

First Claim

Patent Images

1. A computer system for optimizing reduce-side join operation in a Map-reduce framework between a first data structure and a second data structure, the first data structure being sorted and divided into one or more regions, the system comprising:

one or more processors; and

a non-transitory memory that includes modules that are executable by said one or more processors, wherein the modules include;

an executing module to execute one or more map operations by one or more processors, wherein to execute one or more map operations by one or more processors comprises to;

fetch input data of the second data structure;

partition the data of the second data structure according to key-value pair;

project the key-value pairs of the second data structure to a partitioner;

maintain one or more region key counters;

wherein the region key counter being used for registering key count value of one or more regions of the second data structure; and

emit the key count value of one or more regions and corresponding data, wherein the key count values are emitted prior to the corresponding data;

a grouping module to group mapped data corresponding to a single region of the second data structure;

an accumulating module to provide the grouped data to a reducer;

a fetching module to retrieve descriptive metadata of one or more regions of the first data structure; and

a selecting module to select one of a look-up approach and a scan approach to perform the join operation by one or more reducers based on associated key count value and predefined criteria by the reducer, to perform the join operation.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a system and method for optimizing reduce-side join operation in a map-reduce framework. The system and method executing one or more map operations on the second data structure, grouping the data tuples to a single region of the second data structure, providing the grouped data to a single reducer and, selecting one of scan approach and a look-up approach by one or more reducers based on region key count value and pre-determined conditions of the user.

6 Citations

View as Search Results

20 Claims

1. A computer system for optimizing reduce-side join operation in a Map-reduce framework between a first data structure and a second data structure, the first data structure being sorted and divided into one or more regions, the system comprising:
- one or more processors; and
  
  a non-transitory memory that includes modules that are executable by said one or more processors, wherein the modules include;
  
  an executing module to execute one or more map operations by one or more processors, wherein to execute one or more map operations by one or more processors comprises to;
  
  fetch input data of the second data structure;
  
  partition the data of the second data structure according to key-value pair;
  
  project the key-value pairs of the second data structure to a partitioner;
  
  maintain one or more region key counters;
  
  wherein the region key counter being used for registering key count value of one or more regions of the second data structure; and
  
  emit the key count value of one or more regions and corresponding data, wherein the key count values are emitted prior to the corresponding data;
  
  a grouping module to group mapped data corresponding to a single region of the second data structure;
  
  an accumulating module to provide the grouped data to a reducer;
  
  a fetching module to retrieve descriptive metadata of one or more regions of the first data structure; and
  
  a selecting module to select one of a look-up approach and a scan approach to perform the join operation by one or more reducers based on associated key count value and predefined criteria by the reducer, to perform the join operation.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
- - 10. The system as claimed in claim 1, wherein the descriptive metadata comprises region key count value of one or more regions of the first data structure.
  - 11. The system as claimed in claim 1, wherein each set of mapped data includes a set of tuples, each tuple characterized by key/value pair, wherein the keys and values are sets of attributes.
  - 12. The system as claimed in claim 1, wherein:
    - the join operation is carried out by a plurality of reducers; and
      
      the data that is not intermediate data, for a particular reducer, includes data that is associated with another reducer.
  - 13. The system as claimed in claim 1, wherein the join operation includes relating the data among the plurality of the data structures.
  - 14. The system as claimed in claim 1, wherein the one or more processors comprises multiple processors included in a cluster of computers.
  - 15. The system as claimed in claim 1, wherein the first data structure being sorted and divided into data tuples is stored in key count format.
  - 16. The system as claimed in claim 1, wherein the one or more processors comprises multiple processors included in a cluster of computers.
  - 17. The system as claimed in claim 16 wherein the cluster of computers include a persisting memory cache across the cluster of computers, the system further comprising instructions in one or more of the cluster of computers that executable to:
    - replace a failed of one of the computers with a different computer as a single transaction; and
      
      obtain a redundant copy of the data from one or more remaining computers in the cluster.

2. A method for optimizing reduce-side join operation in a Map-reduce framework between a first data structure and a second data structure, the first data structure being sorted and divided into one or more regions, the method comprising:
- executing instructions, stored in a memory, by one or more processors to perform;
  
  executing one or more map operations, wherein executing one or more map operations comprises;
  
  fetching input data of the second data structure;
  
  partitioning the data of the second data structure according to key-value pair;
  
  projecting the key-value pairs of the second data structure to a partitioner;
  
  maintaining one or more region key counters;
  
  wherein the region key counter being used for registering key count value of one or more regions of the second data structure; and
  
  emitting the key count value of one or more regions and corresponding data, wherein the key count values are emitted prior to the corresponding data; and
  
  grouping mapped data corresponding to a single region of the second data structure;
  
  providing the grouped data to a reducer;
  
  retrieving descriptive metadata of one or more regions of the first data structure; and
  
  selecting one of a look-up approach and a scan approach to perform the join operation by one or more reducers based on associated key count value and predefined criteria by the reducer, for performing the join operation.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9)
- - 3. The method as claimed in claim 2, wherein the descriptive metadata comprises region key count value of one or more regions of the first data structure.
  - 4. The method as claimed in claim 2, wherein each set of mapped data includes a set of tuples, each tuple characterized by key/value pair, wherein the keys and values are sets of attributes.
  - 5. The method as claimed in claim 2, wherein:
    - the join operation is carried out by a plurality of reducers; and
      
      the data that is not intermediate data, for a particular reducer, includes data that is associated with another reducer.
  - 6. The method as claimed in claim 2, wherein the join operation includes relating the data among the plurality of the data structures.
  - 7. The method as claimed in claim 2, wherein executing the instructions is implemented using a cluster of machines.
  - 8. The method as claimed in claim 2, wherein the first data structure being sorted and divided into data tuples is stored in key count format.
  - 9. The method as claimed in claim 2, wherein executing instructions comprises executing instructions by processors of a cluster of computers, the method further comprising persisting memory cache across the cluster of computers wherein:
    - a failure of one of the computers results in replacing the failed computer with a different computer;
      
      the replacing is performed as a single transaction; and
      
      a redundant copy of the data is obtained from one or more remaining computers in the cluster.

18. A non-transitory computer program product comprising instructions stored in the computer program product for optimizing reduce-side join operation in a Map-reduce framework between a first data structure and a second data structure, the first data structure being sorted and divided into one or more regions, wherein the instructions are executable by a processor to perform:
- executing one or more map operations, wherein executing one or more map operations comprises;
  
  fetching input data of the second data structure;
  
  partitioning the data of the second data structure according to key-value pair;
  
  projecting the key-value pairs of the second data structure to a partitioner;
  
  maintaining one or more region key counters;
  
  wherein the region key counter being used for registering key count value of one or more regions of the second data structure; and
  
  emitting the key count value of one or more regions and corresponding data, wherein the key count values are emitted prior to the corresponding data; and
  
  grouping mapped data corresponding to a single region of the second data structure;
  
  providing the grouped data to a reducer;
  
  retrieving descriptive metadata of one or more regions of the first data structure; and
  
  selecting one of a look-up approach and a scan approach to perform the join operation by one or more reducers based on associated key count value and predefined criteria by the reducer, for performing the join operation.
- View Dependent Claims (19, 20)
- - 19. The non-transitory computer program product in claim 18, wherein the descriptive metadata comprises region key count value of one or more regions of the first data structure.
  - 20. The non-transitory computer program product in claim 18, wherein each set of mapped data includes a set of tuples, each tuple characterized by key/value pair, wherein the keys and values are sets of attributes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
InMobi Pte. Ltd
Original Assignee
InMobi Pte. Ltd
Inventors
Sundarrajan, Srikanth, Shivalingamurthy, Shwetha G.
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
Tessema, Aida

Application Number

US14/553,786
Publication Number

US 20150149437A1
Time in Patent Office

1,519 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/2453 Query optimisation

Method and system for optimizing reduce-side join operation in a map-reduce framework

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

6 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for optimizing reduce-side join operation in a map-reduce framework

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

6 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links