System and Method for Large-Scale Data Processing Using an Application-Independent Framework
First Claim
1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising:
- an application-independent framework for processing data, including;
a plurality of application-independent map modules; and
,a plurality of application-independent reduce modules, wherein the application-independent map modules and application-independent reduce modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations; and
a plurality of user-specified, application-specific operators, for use with the application-independent framework to perform a user-specified data processing operation on a user-specified set of input files, the application-specific operators including;
a map operator, wherein the map operator is applied by the application-independent map modules to input data in the user-specified set of input files to produce intermediate data values; and
a reduce operator, wherein the reduce operator is applied by the application-independent reduce modules to process the intermediate data values to produce final output data for the user-specified data processing operation.
2 Assignments
0 Petitions
Accused Products
Abstract
A large-scale data processing system and method for processing data in a distributed and parallel processing environment. The system includes an application-independent framework for processing data having a plurality of application-independent map modules and reduce modules. These application-independent modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations. The system also includes a plurality of user-specified, application-specific operators, for use with the application-independent framework to perform a user-specified data processing operation on a user-specified set of input files. The application-specific operators include: a map operator and a reduce operator. The map operator is applied by the application-independent map modules to input data in the user-specified set of input files to produce intermediate data values. The reduce operator is applied by the application-independent reduce modules to process the intermediate data values to produce final output data.
-
Citations
22 Claims
-
1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising:
an application-independent framework for processing data, including; a plurality of application-independent map modules; and
,a plurality of application-independent reduce modules, wherein the application-independent map modules and application-independent reduce modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations; and a plurality of user-specified, application-specific operators, for use with the application-independent framework to perform a user-specified data processing operation on a user-specified set of input files, the application-specific operators including; a map operator, wherein the map operator is applied by the application-independent map modules to input data in the user-specified set of input files to produce intermediate data values; and a reduce operator, wherein the reduce operator is applied by the application-independent reduce modules to process the intermediate data values to produce final output data for the user-specified data processing operation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
10. A method of performing large-scale processing of data in a distributed and parallel processing environment comprising:
-
receiving, from a user, a plurality of user-specified, application-specific operators, for use with an application-independent framework to perform a user-specified data processing operation on a user-specified set of input files, the application-specific operators including a map operator and a reduce operator; and processing data on a set of interconnected computer systems in the application-independent framework, each of the computer systems comprising one or more processors and memory, the application independent framework including; a plurality of application-independent map modules; and
,a plurality of application-independent reduce modules, wherein the application-independent map modules and application-independent reduce modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations; the data processing including; using the application-independent map modules to apply the map operator to input data in the user-specified set of input files to produce intermediate data values; and using the application-independent reduce modules to apply the reduce operator to process the intermediate data values to produce final output data for the user-specified data processing operation.
-
-
11. A computer-readable storage medium storing one or more programs configured for execution by a plurality of processors of a set of interconnected computer systems, the one or more programs comprising instructions to be executed by one or more of the plurality of processors so as to:
-
receive, from a user, a plurality of user-specified, application-specific operators, for use with an application-independent framework to perform a user-specified data processing operation on a user-specified set of input files, the application-specific operators including a map operator and a reduce operator; and process data on a set of interconnected computer systems in the application-independent framework, each of the computer systems comprising one or more processors and memory, the application independent framework including; a plurality of application-independent map modules; and
,a plurality of application-independent reduce modules, wherein the application-independent map modules and application-independent reduce modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations; wherein; the application-independent map modules include instructions to be executed by one or more of the plurality of processors so as to apply the map operator to input data in the user-specified set of input files to produce intermediate data values; and the application-independent reduce modules include instructions to be executed by one or more of the plurality of processors so as to apply the reduce operator to process the intermediate data values to produce final output data for the user-specified data processing operation.
-
-
12. A system for large-scale processing of data in a distributed and parallel processing environment, comprising:
a set of interconnected computing systems, each having one or more processors and memory, the set of interconnected computing systems including; a set of application-independent map modules for reading portions of input files containing data, and for applying at least one user-specified, application-specific map operation to the data to produce intermediate data values; a set of intermediate data structures distributed among a plurality of the interconnected computing systems for storing the intermediate data values; and a set of application-independent reduce modules for applying at least one user-specified, application-specific reduce operation to the intermediate data values to produce final output data. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
21. A method of performing large-scale processing of data in a distributed and parallel processing environment, comprising:
at a set of interconnected computing systems, each having one or more processors and memory; using a set of application-independent map modules to read portions of input files containing data and apply at least one user-specified, application-specific map operation to the data to produce intermediate data values; storing the intermediate data values in a set of intermediate data structures distributed among a plurality of the interconnected computing systems; and using a set of application-independent reduce modules to apply at least one user-specified, application-specific reduce operation to the intermediate data values to produce final output data.
-
22. A computer readable storage medium storing one or more programs configured for execution by a plurality processors of a set of interconnected computing systems, comprising instructions to be executed by the plurality of processors so as to:
-
use a set of application-independent map modules to read portions of input files containing data and apply at least one user-specified, application-specific map operation to the data to produce intermediate data values; store the intermediate data values in a set of intermediate data structures distributed among a plurality of the interconnected computing systems; and use a set of application-independent reduce modules to apply at least one user-specified, application-specific reduce operation to the intermediate data values to produce final output data.
-
Specification