×

Profiling in a massive parallel processing environment

  • US 9,251,212 B2
  • Filed: 03/27/2009
  • Issued: 02/02/2016
  • Est. Priority Date: 03/27/2009
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method of profiling a data set in a parallel processing environment, comprising:

  • partitioning an initial data set stored in a row based format vertically according to multiple attribute subsets, wherein partitioning comprises extracting data for the multiple attribute subsets from the initial data set;

    storing the extracted data for the multiple attribute subsets in a plurality of staging files, the plurality of staging files including extracted data that is static;

    profiling the multiple attribute subsets based on a specific attribute value, wherein different staging files are provided to different machines in a plurality of machines for profiling;

    determining a set of row identifiers that satisfy the specific attribute value identified in the profiling of the different staging files processed on the plurality of machines;

    extracting, for each row identifier in the set of row identifiers, a row identifier, a column identifier, and a value of an attribute for a set of columns associated with each row identifier;

    merging the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier to form a profiled subset of the initial data set; and

    transmitting, displaying or storing the profiled subset of the initial data set, a further processed version, or combinations thereof.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×