DISTRIBUTED DATA REORGANIZATION FOR PARALLEL EXECUTION ENGINES
First Claim
1. A method implemented on a general-purpose computing device for processing data containing a plurality of data records, comprising:
- using the general-purpose computing device to perform the following;
providing a general-purpose parallel execution environment that uses an arbitrary communication acyclic graph having vertices that have multiple inputs and generate multiple outputs;
receiving a mapping criteria;
assigning each of the plurality of data records to one of a plurality of data buckets based on the mapping criteria; and
reducing data in each of the data buckets to generate reorganized data.
2 Assignments
0 Petitions
Accused Products
Abstract
A distributed data reorganization system and method for mapping and reducing raw data containing a plurality of data records. Embodiments of the distributed data reorganization system and method operate in a general-purpose parallel execution environment that use an arbitrary communication directed acyclic graph. The vertices of the graph accept multiple data inputs and generate multiple data inputs, and may be of different types. Embodiments of the distributed data reorganization system and method include a plurality of distributed mappers that use a mapping criteria supplied by a developer to map the plurality of data records to data buckets. The mapped data record and data bucket identifications are input for a plurality of distributed reducers. Each distributed reducer groups together data records having the same data bucket identification and then uses a merge logic supplied by the developer to reduce the grouped data records to obtain reorganized data.
-
Citations
20 Claims
-
1. A method implemented on a general-purpose computing device for processing data containing a plurality of data records, comprising:
using the general-purpose computing device to perform the following; providing a general-purpose parallel execution environment that uses an arbitrary communication acyclic graph having vertices that have multiple inputs and generate multiple outputs; receiving a mapping criteria; assigning each of the plurality of data records to one of a plurality of data buckets based on the mapping criteria; and reducing data in each of the data buckets to generate reorganized data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
9. A distributed data reorganization system for mapping and reducing raw data containing a plurality of data records, comprising:
-
a general-purpose parallel execution environment that uses an arbitrary communication acyclic graph; vertices of the arbitrary acyclic graph having multiple data inputs and that generate multiple data outputs; a plurality of distributed mappers in the general-purpose execution environment that take as input the plurality of data records and where each distributed mapper is represented by a vertex of the vertices; a plurality of data buckets assigned to each of the distributed mappers, where each of the data buckets corresponds to a certain type of data record; a plurality of distributed reducers in the general-purpose execution environment, where each distributed reducer takes as input data buckets having a same type of data record and where each distributed reducer is represented by a vertex of the vertices; and reorganized data that is output from the plurality of distributed reducers such that the same type of data records are grouped together and a number of the plurality of data records is reduced. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A computer-implemented method for reorganizing raw data containing a plurality of data records, comprising:
-
providing a DryadNebula general-purpose parallel execution environment having an arbitrary communication directed acyclic graph that contains vertices that receive multiple inputs and generate multiple outputs; displaying a mapper user interface to a developer so that the developer can use the interface to push the plurality of data records to a plurality of distributed mappers; defining a data buckets each having a unique data bucket identification; selecting a data record from the plurality of data records; assigning the selected data record to a data bucket based on a mapping criteria; repeating the selecting and assigning until each of the plurality of data records have been mapped to generate mapped data records; inputting the mapped data records and their associated data bucket identifications to a plurality of distributed reducers; grouping together those mapped data records having a same data bucket identification to obtain sets of reducable data records; and processing the sets of reducable data records to generate a reorganized plurality of data records. - View Dependent Claims (17, 18, 19, 20)
-
Specification