×

System and method for efficient large-scale data processing

  • US 7,650,331 B1
  • Filed: 06/18/2004
  • Issued: 01/19/2010
  • Est. Priority Date: 06/18/2004
  • Status: Active Grant
First Claim
Patent Images

1. A system for large-scale processing of data, comprising:

  • a plurality of processes executing on a plurality of interconnected processors;

    the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes;

    the master process, in response to a request to perform the data processing job, assigning input data blocks of the set of input data to respective ones of the worker processes;

    each of a first plurality of the worker processes including an application-independent map module for retrieving a respective input data block assigned to the worker process by the master process and applying an application-specific map operation to the respective input data block to produce intermediate data values, wherein at least a subset of the intermediate data values each comprises a key/value pair, and wherein at least two of the first plurality of the worker processes operate simultaneously so as to perform the application-specific map operation in parallel on distinct, respective input data blocks;

    a partition operator for processing the produced intermediate data values to produce a plurality of intermediate data sets, wherein each respective intermediate data set includes all key/value pairs for a distinct set of respective keys, and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs produced by a plurality of the first plurality of the worker processes; and

    each of a second plurality of the worker processes including an application-independent reduce module for retrieving data, the retrieved data comprising at least a subset of the key/value pairs from a respective intermediate data set of the plurality of intermediate data sets and applying an application-specific reduce operation to the retrieved data to produce final output data corresponding to the distinct set of respective keys in the respective intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of the worker processes operate simultaneously so as to perform the application-specific reduce operation in parallel on multiple respective subsets of the produced intermediate data values.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×