DISTRIBUTED DATA REORGANIZATION FOR PARALLEL EXECUTION ENGINES

US 20100281078A1
Filed: 04/30/2009
Published: 11/04/2010
Est. Priority Date: 04/30/2009
Status: Abandoned Application

First Claim

Patent Images

1. A method implemented on a general-purpose computing device for processing data containing a plurality of data records, comprising:

using the general-purpose computing device to perform the following;

providing a general-purpose parallel execution environment that uses an arbitrary communication acyclic graph having vertices that have multiple inputs and generate multiple outputs;

receiving a mapping criteria;

assigning each of the plurality of data records to one of a plurality of data buckets based on the mapping criteria; and

reducing data in each of the data buckets to generate reorganized data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A distributed data reorganization system and method for mapping and reducing raw data containing a plurality of data records. Embodiments of the distributed data reorganization system and method operate in a general-purpose parallel execution environment that use an arbitrary communication directed acyclic graph. The vertices of the graph accept multiple data inputs and generate multiple data inputs, and may be of different types. Embodiments of the distributed data reorganization system and method include a plurality of distributed mappers that use a mapping criteria supplied by a developer to map the plurality of data records to data buckets. The mapped data record and data bucket identifications are input for a plurality of distributed reducers. Each distributed reducer groups together data records having the same data bucket identification and then uses a merge logic supplied by the developer to reduce the grouped data records to obtain reorganized data.

Citations

20 Claims

1. A method implemented on a general-purpose computing device for processing data containing a plurality of data records, comprising:
- using the general-purpose computing device to perform the following;
  
  providing a general-purpose parallel execution environment that uses an arbitrary communication acyclic graph having vertices that have multiple inputs and generate multiple outputs;
  
  receiving a mapping criteria;
  
  assigning each of the plurality of data records to one of a plurality of data buckets based on the mapping criteria; and
  
  reducing data in each of the data buckets to generate reorganized data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising receiving the mapping criteria from an application written by a developer and running in the general-purpose parallel execution environment.
  - 3. The method of claim 2, further comprising displaying a mapper user interface to a developer such that the developer can use the mapper user interface to push the plurality of data records into a distributed mapper.
  - 4. The method of claim 3, further comprising defining the plurality of data buckets such that each of the plurality of data buckets has a unique data bucket identification.
  - 5. The method of claim 4, further comprising receiving a merge logic from an application written by a developer and running in the general-purpose parallel execution environment.
  - 6. The method of claim 5, further comprising performing data record selection in order to group together data records having a same data bucket identification to generate sets of reducable data records.
  - 7. The method of claim 6, further comprising defining a reducer user interface that allows the developer to input the merge logic.
  - 8. The method of claim 7, further comprising reducing a number of data records in each of the sets of reducable records based on the merge logic.

9. A distributed data reorganization system for mapping and reducing raw data containing a plurality of data records, comprising:
- a general-purpose parallel execution environment that uses an arbitrary communication acyclic graph;
  
  vertices of the arbitrary acyclic graph having multiple data inputs and that generate multiple data outputs;
  
  a plurality of distributed mappers in the general-purpose execution environment that take as input the plurality of data records and where each distributed mapper is represented by a vertex of the vertices;
  
  a plurality of data buckets assigned to each of the distributed mappers, where each of the data buckets corresponds to a certain type of data record;
  
  a plurality of distributed reducers in the general-purpose execution environment, where each distributed reducer takes as input data buckets having a same type of data record and where each distributed reducer is represented by a vertex of the vertices; and
  
  reorganized data that is output from the plurality of distributed reducers such that the same type of data records are grouped together and a number of the plurality of data records is reduced.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The distributed data reorganization system of claim 9, further comprising an application running the general-purpose parallel execution environment that provides instructions to the plurality of distributed mappers and the plurality of distributed reducers.
  - 11. The distributed data reorganization system of claim 10, further comprising a mapping criteria that is contained in the application that provides mapping instructions to the plurality of distributed mappers.
  - 12. The distributed data reorganization system of claim 11, further comprising a merge logic that is contained in the application that provides reducing and merging instructions to the plurality of distributed reducers.
  - 13. The distributed data reorganization system of claim 12, further comprising a reducer user interface that allows a developer to input the merge logic.
  - 14. The distributed data reorganization system of claim 9, further comprising a mapper user interface that allows a developer to push the plurality of data records into each of the plurality of distributed mappers.
  - 15. The distributed data reorganization system of claim 9, wherein the general-purpose parallel execution environment is DryadNebula.

16. A computer-implemented method for reorganizing raw data containing a plurality of data records, comprising:
- providing a DryadNebula general-purpose parallel execution environment having an arbitrary communication directed acyclic graph that contains vertices that receive multiple inputs and generate multiple outputs;
  
  displaying a mapper user interface to a developer so that the developer can use the interface to push the plurality of data records to a plurality of distributed mappers;
  
  defining a data buckets each having a unique data bucket identification;
  
  selecting a data record from the plurality of data records;
  
  assigning the selected data record to a data bucket based on a mapping criteria;
  
  repeating the selecting and assigning until each of the plurality of data records have been mapped to generate mapped data records;
  
  inputting the mapped data records and their associated data bucket identifications to a plurality of distributed reducers;
  
  grouping together those mapped data records having a same data bucket identification to obtain sets of reducable data records; and
  
  processing the sets of reducable data records to generate a reorganized plurality of data records.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer-implemented method of claim 16, further comprising:
    - defining a reducer user interface that allows the developer to input merge logic; and
      
      reducing a number of data records in each of the sets of reducable records based on the merge logic.
  - 18. The computer-implemented method of claim 16, further comprising:
    - determining whether an assigned data bucket is at or near its memory capacity; and
      
      if so, then writing data records in the assigned data bucket to a disk and then purging the data records from the assigned data bucket.
  - 19. The computer-implemented method of claim 18, further comprising:
    - determining whether a subsequent process requires sorted data; and
      
      if so, then sorting data records in the assigned data bucket.
  - 20. The computer-implemented method of claim 16, further comprising receiving the mapping criteria from an application written by the developer and running in the DryadNebula general-purpose parallel execution environment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Liu, Tie-Yan, Wang, Taifeng

Application Number

US12/433,880
Publication Number

US 20100281078A1
Time in Patent Office

Days
Field of Search
US Class Current

707/812
CPC Class Codes

G06F 16/217 Database tuning G06F16/2282...

G06F 16/24532 of parallel queries

DISTRIBUTED DATA REORGANIZATION FOR PARALLEL EXECUTION ENGINES

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

DISTRIBUTED DATA REORGANIZATION FOR PARALLEL EXECUTION ENGINES

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links