COMPILER-GUIDED SOFTWARE ACCELERATOR FOR ITERATIVE HADOOP JOBS

US 20140047422A1
Filed: 06/21/2013
Published: 02/13/2014
Est. Priority Date: 08/07/2012
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

identifying intermediate data, generated by an iterative HADOOP application, below a predetermined threshold size and used less than a predetermined threshold time period, the intermediate data being stored in a memory device; and

minimizing input, output, and synchronization overhead for the intermediate data by selectively using at any given time any one of a Message Passing Interface and a HADOOP Distributed File System as a communication layer, the Message Passing Interface being co-located with the HADOOP Distributed File System.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Various methods are provided directed to a compiler-guided software accelerator for iterative HADOOP jobs. A method includes identifying intermediate data, generated by an iterative HADOOP application, below a predetermined threshold size and used less than a predetermined threshold time period. The intermediate data is stored in a memory device. The method further includes minimizing input, output, and synchronization overhead for the intermediate data by selectively using at any given time any one of a Message Passing Interface and Distributed File System as a communication layer. The Message Passing Interface is co-located with the HADOOP Distributed File System.

32 Citations

View as Search Results

18 Claims

1. A method, comprising:
- identifying intermediate data, generated by an iterative HADOOP application, below a predetermined threshold size and used less than a predetermined threshold time period, the intermediate data being stored in a memory device; and
  
  minimizing input, output, and synchronization overhead for the intermediate data by selectively using at any given time any one of a Message Passing Interface and a HADOOP Distributed File System as a communication layer, the Message Passing Interface being co-located with the HADOOP Distributed File System.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the Message Passing Interface comprises a Message Passing Interface based data distribution library.
  - 3. The method of claim 1, wherein the message passing interface is co-located with the HADOOP Distributed File System in a HADOOP stack.
  - 4. The method of claim 1, wherein the Message Passing Interface and the HADOOP Distributed File System are selectively used depending on parameters of a data transfer in the iterative HADOOP application.
  - 5. The method of claim 1, further comprising automatically transforming the iterative HADOOP application to selectively use the Message Passing Interface and the HADOOP Distributed File System depending on a data transfer in the iterative HADOOP application.
  - 6. The method of claim 5, wherein said transforming step automatically and selectively replaces invocations to the HADOOP Distributed File System with invocations to the Message Passing Interface.
  - 7. The method of claim 1, wherein the method is implemented on a non-transitory computer readable storage medium having computer executable code stored thereon for performing the method.

8. A method, comprising:
- identifying a set of map tasks and reduce tasks capable of being reused across multiple iterations of an iterative HADOOP application; and
  
  reducing a system load imparted on a computer system executing the iterative HADOOP application by transforming a source code of the iterative HADOOP application to launch the map tasks in the set only once and keep the map tasks in the set alive for an entirety of the execution of the iterative HADOOP application.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
- - 9. The method of claim 8, the system load comprises at least one of a cost of re-launching virtual machines, a cost of scheduling tasks in a HADOOP cluster, and a cost of re-reading invariant input data.
  - 10. The method of claim 8, wherein the map tasks in the set are kept alive for the entirety of the execution by guarding an invocation to a runjob( ) function beginning at a first iteration of the iterative HADOOP application to prevent a re-launching of any of the maps tasks and reduce tasks in the set in subsequent iterations of the iterative HADOOP application.
  - 11. The method of claim 8, wherein the map tasks in the set are kept alive for the entirety of the execution by moving code of a map( ) function of the iterative HADOOP application to a new map function( ) added to at least one map task in the set and storing key/value pairs in a hash table by the new map function( ), the key/value pairs representing intermediate results produced by map tasks in the set.
  - 12. The method of claim 11, wherein the map tasks in the set are kept alive for the entirety of the execution by introducing a two-level nested loop in the new map function( ), the two-level nest loop having an inner loop and an outer loop, wherein the inner loop iterates over the key/value pairs stored in the hash table, and the outer loop supports keeping the map tasks in the set alive for the entirety of the execution.
  - 13. The method of claim 12, wherein the map tasks in the set are kept alive for the entirety of the execution by inserting a statement in the outer loop to invoke an internal HADOOP function for flushing and closing a data output stream of the iterative HADOOP application.
  - 14. The method of claim 12, wherein the map tasks in the set are kept alive for the entirety of the execution by inserting a statement in the outer loop to wait for a configuration object of a next iteration of the iterative HADOOP application.
  - 15. The method of claim 8, wherein the reduce tasks in the set are kept alive for the entirety of the execution by selectively inserting a statement to throw an exception at an end of each reduce task in the set, and wherein a decision to insert the statement is based on a result of a convergence check of a corresponding iteration of the iterative HADOOP application.
  - 16. The method of claim 8, wherein the method is implemented on a non-transitory computer readable storage medium having computer executable code stored thereon for performing the method.

17. A method, comprising:
- automatically transforming an iterative HADOOP application to selectively use at any given time any one of a Message Passing Interface and a HADOOP Distributed File System depending on parameters of a data transfer in the iterative HADOOP application, the Message Passing Interface being co-located with the HADOOP Distributed File System; and
  
  enabling concurrent execution by at least one processor of a reduce task from an iteration n and map tasks from an iteration n+1 in the iterative HADOOP application, n being an integer,wherein said enabling step comprises;
  
  replacing an invocation to a runJob ( ) function in the iterative HADOOP application by an invocation to a submitJob( ) function; and
  
  inserting a function call into the iterative HADOOP application for blocking and reading model data from a Message Passing Interface based data distribution library connected to the Message Passing Interface.
- View Dependent Claims (18)
- - 18. The method of claim 17, wherein the method is implemented on a non-transitory computer readable storage medium having computer executable code stored thereon for performing the method.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NEC Corporation
Original Assignee
NEC Laboratories America Inc (NEC Corporation)
Inventors
Ravi, Nishkam, Chakradhar, Srimat T., Verma, Abhishek

Granted Patent

US 9,201,638 B2
Time in Patent Office

Days
Field of Search
US Class Current

717/151
CPC Class Codes

G06F 8/443   Optimisation

G06F 9/52   Program synchronisation; Mu...

G06F 9/546   Message passing systems or ...

COMPILER-GUIDED SOFTWARE ACCELERATOR FOR ITERATIVE HADOOP JOBS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

32 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

COMPILER-GUIDED SOFTWARE ACCELERATOR FOR ITERATIVE HADOOP JOBS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

32 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links