Profiling in a massive parallel processing environment
First Claim
Patent Images
1. A computer-implemented method of profiling a data set in a parallel processing environment, comprising:
- partitioning an initial data set stored in a row based format vertically according to multiple attribute subsets, wherein partitioning comprises extracting data for the multiple attribute subsets from the initial data set;
storing the extracted data for the multiple attribute subsets in a plurality of staging files, the plurality of staging files including extracted data that is static;
profiling the multiple attribute subsets based on a specific attribute value, wherein different staging files are provided to different machines in a plurality of machines for profiling;
determining a set of row identifiers that satisfy the specific attribute value identified in the profiling of the different staging files processed on the plurality of machines;
extracting, for each row identifier in the set of row identifiers, a row identifier, a column identifier, and a value of an attribute for a set of columns associated with each row identifier;
merging the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier to form a profiled subset of the initial data set; and
transmitting, displaying or storing the profiled subset of the initial data set, a further processed version, or combinations thereof.
3 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method of profiling a data set in a parallel processing environment includes vertically partitioning an initial data set. One or more attribute subsets are then profiled. A list of subjects is generated each corresponding to a specific attribute value identified in the profiling. Values of multiple attributes are extracted for each identified subject, and the sample results are assembled and merged.
15 Citations
22 Claims
-
1. A computer-implemented method of profiling a data set in a parallel processing environment, comprising:
-
partitioning an initial data set stored in a row based format vertically according to multiple attribute subsets, wherein partitioning comprises extracting data for the multiple attribute subsets from the initial data set; storing the extracted data for the multiple attribute subsets in a plurality of staging files, the plurality of staging files including extracted data that is static; profiling the multiple attribute subsets based on a specific attribute value, wherein different staging files are provided to different machines in a plurality of machines for profiling; determining a set of row identifiers that satisfy the specific attribute value identified in the profiling of the different staging files processed on the plurality of machines; extracting, for each row identifier in the set of row identifiers, a row identifier, a column identifier, and a value of an attribute for a set of columns associated with each row identifier; merging the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier to form a profiled subset of the initial data set; and transmitting, displaying or storing the profiled subset of the initial data set, a further processed version, or combinations thereof. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer network including a parallel processing environment, comprising:
-
a data source; multiple projection computers; and one or more client computers connected to the multiple projection computers and having computer-readable code embedded therein for programming the projection computers to perform a method of profiling a data set in the parallel processing environment, wherein the method comprises; partitioning an initial data set stored in a row based format vertically according to multiple attribute subsets, wherein partitioning comprises extracting data for the multiple attribute subsets from the initial data set; storing the extracted data for the multiple attribute subsets in a plurality of staging files, the plurality of staging files including extracted data that is static; profiling the multiple attribute subsets based on a specific attribute value, wherein different staging files are provided to different projection computers in the multiple projection computers for profiling; determining a set of row identifiers that satisfy the specific attribute value identified in the profiling of the different staging files processed on the multiple projection computers; extracting, for each row identifier in the set of row identifiers, a row identifier, a column identifier, and a value of an attribute for a set of columns associated with each row identifier; merging the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier to form a profiled subset of the initial data set; and transmitting, displaying or storing the profiled subset of the initial data set, a further processed version, or combinations thereof. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. One or more non-transitory processor-readable media having embedded therein processor-readable code to program one or more processors to perform a method of profiling a data set in a parallel processing environment, wherein the method comprises:
-
partitioning an initial data set stored in a row based format vertically according to multiple attribute subsets, wherein partitioning comprises extracting data for the multiple attribute subsets from the initial data set; storing the extracted data for the multiple attribute subsets in a plurality of staging files, the plurality of staging files including extracted data that is static; profiling the multiple attribute subsets based on a specific attribute value, wherein different staging files are provided to different machines in a plurality of machines for profiling; determining a set of row identifiers that satisfy the specific attribute value identified in the profiling of the different staging files processed on the plurality of machines; extracting, for each row identifier in the set of row identifiers, a row identifier, a column identifier, and a value of an attribute for a set of columns associated with each row identifier; merging the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier to form a profiled subset of the initial data set; and transmitting, displaying or storing the profiled subset of the initial data set, a further processed version, or combinations thereof. - View Dependent Claims (17, 18, 19, 20, 21, 22)
-
Specification