Profiling in a massive parallel processing environment

US 9,251,212 B2
Filed: 03/27/2009
Issued: 02/02/2016
Est. Priority Date: 03/27/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of profiling a data set in a parallel processing environment, comprising:

partitioning an initial data set stored in a row based format vertically according to multiple attribute subsets, wherein partitioning comprises extracting data for the multiple attribute subsets from the initial data set;

storing the extracted data for the multiple attribute subsets in a plurality of staging files, the plurality of staging files including extracted data that is static;

profiling the multiple attribute subsets based on a specific attribute value, wherein different staging files are provided to different machines in a plurality of machines for profiling;

determining a set of row identifiers that satisfy the specific attribute value identified in the profiling of the different staging files processed on the plurality of machines;

extracting, for each row identifier in the set of row identifiers, a row identifier, a column identifier, and a value of an attribute for a set of columns associated with each row identifier;

merging the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier to form a profiled subset of the initial data set; and

transmitting, displaying or storing the profiled subset of the initial data set, a further processed version, or combinations thereof.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method of profiling a data set in a parallel processing environment includes vertically partitioning an initial data set. One or more attribute subsets are then profiled. A list of subjects is generated each corresponding to a specific attribute value identified in the profiling. Values of multiple attributes are extracted for each identified subject, and the sample results are assembled and merged.

15 Citations

View as Search Results

22 Claims

1. A computer-implemented method of profiling a data set in a parallel processing environment, comprising:
- partitioning an initial data set stored in a row based format vertically according to multiple attribute subsets, wherein partitioning comprises extracting data for the multiple attribute subsets from the initial data set;
  
  storing the extracted data for the multiple attribute subsets in a plurality of staging files, the plurality of staging files including extracted data that is static;
  
  profiling the multiple attribute subsets based on a specific attribute value, wherein different staging files are provided to different machines in a plurality of machines for profiling;
  
  determining a set of row identifiers that satisfy the specific attribute value identified in the profiling of the different staging files processed on the plurality of machines;
  
  extracting, for each row identifier in the set of row identifiers, a row identifier, a column identifier, and a value of an attribute for a set of columns associated with each row identifier;
  
  merging the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier to form a profiled subset of the initial data set; and
  
  transmitting, displaying or storing the profiled subset of the initial data set, a further processed version, or combinations thereof.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the initial data set comprises a table having m rows and n columns.
  - 3. The method of claim 2, wherein each column corresponds to a specific attribute and each row corresponds to a row identifier.
  - 4. The method of claim 1, further comprising assigning a different machine in the plurality of machines to each of the attribute subsets.
  - 5. The method of claim 4, wherein the profiling comprises concurrently profiling the attribute subsets assigned to the different machines.
  - 6. The method of claim 1, wherein the merging comprises removing duplicate subject sets of row identifiers.
  - 7. The method of claim 1, further comprising:
    - storing the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier in a file; and
      
      using the file to display at least one or more values of attributes stored in the file for a set of columns for a row identifier.
  - 8. The method of claim 1, wherein the extracted data in the staging files is in a binary format.

9. A computer network including a parallel processing environment, comprising:
- a data source;
  
  multiple projection computers; and
  
  one or more client computers connected to the multiple projection computers and having computer-readable code embedded therein for programming the projection computers to perform a method of profiling a data set in the parallel processing environment, wherein the method comprises;
  
  partitioning an initial data set stored in a row based format vertically according to multiple attribute subsets, wherein partitioning comprises extracting data for the multiple attribute subsets from the initial data set;
  
  storing the extracted data for the multiple attribute subsets in a plurality of staging files, the plurality of staging files including extracted data that is static;
  
  profiling the multiple attribute subsets based on a specific attribute value, wherein different staging files are provided to different projection computers in the multiple projection computers for profiling;
  
  determining a set of row identifiers that satisfy the specific attribute value identified in the profiling of the different staging files processed on the multiple projection computers;
  
  extracting, for each row identifier in the set of row identifiers, a row identifier, a column identifier, and a value of an attribute for a set of columns associated with each row identifier;
  
  merging the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier to form a profiled subset of the initial data set; and
  
  transmitting, displaying or storing the profiled subset of the initial data set, a further processed version, or combinations thereof.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The computer network of claim 9, wherein the initial data set comprises a table having m rows and n columns.
  - 11. The computer network of claim 10, wherein each column corresponds to a specific attribute and each row corresponds to a row identifier.
  - 12. The computer network of claim 9, wherein the method further comprises assigning a projection computer in the multiple projection computers to each of the multiple attribute data sets.
  - 13. The computer network of claim 12, wherein the profiling comprises concurrently profiling the multiple data sets assigned to the different projection computer.
  - 14. The computer network of claim 9, wherein the merging comprises removing duplicate subject sets of row identifiers.
  - 15. The computer network of claim 9, wherein the merging comprises sorting the row identifiers.

16. One or more non-transitory processor-readable media having embedded therein processor-readable code to program one or more processors to perform a method of profiling a data set in a parallel processing environment, wherein the method comprises:
- partitioning an initial data set stored in a row based format vertically according to multiple attribute subsets, wherein partitioning comprises extracting data for the multiple attribute subsets from the initial data set;
  
  storing the extracted data for the multiple attribute subsets in a plurality of staging files, the plurality of staging files including extracted data that is static;
  
  profiling the multiple attribute subsets based on a specific attribute value, wherein different staging files are provided to different machines in a plurality of machines for profiling;
  
  determining a set of row identifiers that satisfy the specific attribute value identified in the profiling of the different staging files processed on the plurality of machines;
  
  extracting, for each row identifier in the set of row identifiers, a row identifier, a column identifier, and a value of an attribute for a set of columns associated with each row identifier;
  
  merging the row identifier, the column identifier, and the value of the attribute for the set of columns associated with each row identifier to form a profiled subset of the initial data set; and
  
  transmitting, displaying or storing the profiled subset of the initial data set, a further processed version, or combinations thereof.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. The one or more processor-readable media of claim 16, wherein the initial data set comprises a table having m rows and n columns.
  - 18. The one or more processor-readable media of claim 17, wherein each column corresponds to a specific attribute and each row corresponds to a row identifier.
  - 19. The one or more processor-readable media of claim 16, wherein the method further comprises assigning a different machine in the plurality of machines to each of the attribute subsets.
  - 20. The one or more processor-readable media of claim 19, wherein the profiling comprises concurrently profiling the attribute subsets assigned to the different machines.
  - 21. The one or more processor-readable media of claim 16, wherein the merging comprises removing duplicate subject sets of row identifiers.
  - 22. The one or more processor-readable media of claim 16, wherein the merging comprises sorting the row identifiers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Business Objects Incorporated (SAP SE), SAP AG (SAP SE)
Original Assignee
Business Objects Incorporated (SAP SE)
Inventors
Cao, Wu, Ganti, Sridhar, Gadhiraju, Balaji
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
PARK, GRACE A

Application Number

US12/413,289
Publication Number

US 20100250563A1
Time in Patent Office

2,503 Days
Field of Search

707/4, 707/970, 707/974
US Class Current

1/1
CPC Class Codes

G06F 16/24532 of parallel queries

G06F 16/24545 Selectivity estimation or d...

Profiling in a massive parallel processing environment

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

15 Citations

22 Claims

Specification

Use Cases

Quick Links

Others

Profiling in a massive parallel processing environment

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

22 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others