×

Large-scale data processing in a distributed and parallel processing enviornment

  • US 7,756,919 B1
  • Filed: 06/18/2004
  • Issued: 07/13/2010
  • Est. Priority Date: 06/18/2004
  • Status: Active Grant
First Claim
Patent Images

1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising:

  • a plurality of processes executing on a plurality of interconnected processors;

    the plurality of processes including a supervisory process for coordinating a data processing job for processing a set of input data, and a plurality of map processes and a plurality of reduce processes;

    wherein the supervisory process is for assigning input data blocks of the set of input data to respective map processes of the plurality of map processes;

    wherein each of the plurality of map processes includes an application-independent map module for retrieving an input data block assigned thereto by the supervisory process, reading portions of the input data block, and applying an application-specific map operation to the input data block to produce intermediate key-value pairs, wherein at least two of the plurality of map processes operate simultaneously so as to perform the map operation in parallel on multiple input data blocks;

    a plurality of intermediate data structures, the intermediate data structures adapted for storing the intermediate key-value pairs; and

    wherein each of the plurality of reduce processes includes an application-independent reduce modules for retrieving a respective subset of the intermediate key-value pairs from a subset of the intermediate data structures and applying an application-specific reduce operation to the respective subset of intermediate key-value pairs, including combining respective intermediate values sharing the same key, to provide output, wherein at least two of the plurality of reduce processes operate simultaneously so as to perform the reduce operation in parallel on multiple respective subsets of the intermediate key-value pairs.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×