Large-scale data processing in a distributed and parallel processing enviornment
First Claim
1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising:
- a plurality of processes executing on a plurality of interconnected processors;
the plurality of processes including a supervisory process for coordinating a data processing job for processing a set of input data, and a plurality of map processes and a plurality of reduce processes;
wherein the supervisory process is for assigning input data blocks of the set of input data to respective map processes of the plurality of map processes;
wherein each of the plurality of map processes includes an application-independent map module for retrieving an input data block assigned thereto by the supervisory process, reading portions of the input data block, and applying an application-specific map operation to the input data block to produce intermediate key-value pairs, wherein at least two of the plurality of map processes operate simultaneously so as to perform the map operation in parallel on multiple input data blocks;
a plurality of intermediate data structures, the intermediate data structures adapted for storing the intermediate key-value pairs; and
wherein each of the plurality of reduce processes includes an application-independent reduce modules for retrieving a respective subset of the intermediate key-value pairs from a subset of the intermediate data structures and applying an application-specific reduce operation to the respective subset of intermediate key-value pairs, including combining respective intermediate values sharing the same key, to provide output, wherein at least two of the plurality of reduce processes operate simultaneously so as to perform the reduce operation in parallel on multiple respective subsets of the intermediate key-value pairs.
2 Assignments
0 Petitions
Accused Products
Abstract
A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.
68 Citations
30 Claims
-
1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising:
-
a plurality of processes executing on a plurality of interconnected processors; the plurality of processes including a supervisory process for coordinating a data processing job for processing a set of input data, and a plurality of map processes and a plurality of reduce processes; wherein the supervisory process is for assigning input data blocks of the set of input data to respective map processes of the plurality of map processes; wherein each of the plurality of map processes includes an application-independent map module for retrieving an input data block assigned thereto by the supervisory process, reading portions of the input data block, and applying an application-specific map operation to the input data block to produce intermediate key-value pairs, wherein at least two of the plurality of map processes operate simultaneously so as to perform the map operation in parallel on multiple input data blocks; a plurality of intermediate data structures, the intermediate data structures adapted for storing the intermediate key-value pairs; and wherein each of the plurality of reduce processes includes an application-independent reduce modules for retrieving a respective subset of the intermediate key-value pairs from a subset of the intermediate data structures and applying an application-specific reduce operation to the respective subset of intermediate key-value pairs, including combining respective intermediate values sharing the same key, to provide output, wherein at least two of the plurality of reduce processes operate simultaneously so as to perform the reduce operation in parallel on multiple respective subsets of the intermediate key-value pairs. - View Dependent Claims (2, 3, 4)
-
-
5. A method of performing a large-scale data processing job, comprising, in a distributed and parallel processing environment including a set of interconnected computer systems.
executing a plurality of processes on a plurality of interconnected processors, the plurality of processes including a supervisory process for coordinating a data processing job for processing a set of input data, and a plurality of map processes and a plurality of reduce processes; -
in the supervisory process, assigning input data blocks of a set of input data to respective map processes of the plurality of map processes; in each of the plurality of map processes, executing an application-independent map module to retrieve an input data block assigned thereto by the supervisory process, to read portions of the input data block and to apply an application-specific map operation to the input data block to produce intermediate key-value pairs;
including operating at least two of the plurality of the map processes simultaneously so as to perform the map operation in parallel on multiple input data blocks;storing the intermediate key-value pairs in a plurality of intermediate data structures; and in each of the plurality of reduce processes, executing an application-independent reduce module to retrieve a respective subset of the intermediate key-value pairs from a subset of the intermediate data structures and to apply an application-specific reduce operation to the respective subset of intermediate key-value pairs, including combining respective intermediate values sharing the same key, to provide output data;
including operating at least two of the plurality of reduce processes simultaneously so as to perform the reduce operation in parallel on multiple respective subsets of the intermediate key-value pairs. - View Dependent Claims (6, 7, 8)
-
-
9. A system for large-scale processing of data in a distributed and parallel processing environment comprising:
-
a set of interconnected computing systems, each having memory and one or more processors; one or more application-independent map modules, stored in the memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for reading input data and applying at least one application-specific map operation to the input data to produce intermediate key-value pairs, wherein the one or more application-independent map modules automatically parallelize the map operation across multiple processors in the parallel processing environment; a plurality of intermediate data structures, the intermediate data structures adapted for storing the intermediate key-value pairs; and one or more application-independent reduce modules, stored in the memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for retrieving the intermediate key-value pairs from the intermediate data structures and applying at least one application-specific reduce operation to the intermediate key-value pairs, including combining respective intermediate values sharing the same key to provide output data. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising:
-
a set of interconnected computing systems, each having memory and one or more processors; a set of application independent map modules, stored in the memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for reading portions of input files containing data, and for applying at least one user-specified, application-specific map operation to the data to produce intermediate key-value pairs; a set of intermediate data structures distributed throughout the distributed and parallel processing environment for storing the intermediate key-value pairs; and a set of application independent reduce modules, stored inthe memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for applying at least one user-specified, application-specific reduce operation to the intermediate key-value pairs so as to combine intermediate values sharing the same key. - View Dependent Claims (18, 19, 20, 21)
-
-
22. A large-scale data processing method in a distributed and parallel processing environment including a set of interconnected computing systems, comprising:
-
in one or more application-independent map modules, reading input data and applying at least one application-specific map operation to the input data to produce intermediate key-value pairs, wherein the one or more application independent map modules automatically parallelize the map operation across multiple processors in the parallel processing environment; storing the intermediate key-value pairs produced by the map operation in a plurality of intermediate data structures adapted for storing the intermediate key-value pairs; and in one or more application-indpenedent reduce modules, retrieving the intermediate key-value pairs from the intermediate data structures and applying at least one application-specific reduce operation to the intermediate key-value pairs, including combining respective intermediate values sharing the same key to provide output data. - View Dependent Claims (23, 24, 25, 26, 27)
-
-
28. A computer-readable storage medium having stored thereon instructions, which, when executed by a plurality of processors, cause the plurality of processors to perform the operations of:
-
in one or more application-independent map modules, reading input data and applying at least one application-specific map operation to the input data to produce intermediate key-value pairs, wherein the one or more application-independent map modules automatically parallelize the map operation across multiple processors in a parallel processing environment; storing the intermediate key-value pairs produced by the map operation in a plurality of intermediate data structures adapted for storing the intermediate key-value pairs; and
6p1 in one or more application-independent reduce modules, retrieving the intermediate key-value pairs from the intermediate data structures and applying at least one application-specific reduce operation to the intermediate key-value pairs, including combining respective intermediate values sharing the same key to provide output data. - View Dependent Claims (29)
-
-
30. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising:
-
means for applying at least one application-specific map operation to input data to produce intermediate key-value pairs; means for automatically parallelizing the map operation across multiple processors in the parallel processing environment using an application independent methodology; means for storing the intermediate key-value pairs produced by the map operation; means for retrieving a respective subset of the intermediate data values key-value pairs and applying at least one application-specific reduce operation to the intermediate key-value pairs including combining respective intermediate values sharing the same key; means for automatically parallelizing the reduce operation across multiple processors in the parallel processing environment using an application independent methodology; and means for storing output data values produced by the reduce operation.
-
Specification