×

Profiling data with source tracking

  • US 9,569,434 B2
  • Filed: 08/02/2013
  • Issued: 02/14/2017
  • Est. Priority Date: 10/22/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method for profiling data stored in a data storage system, the method including:

  • accessing multiple collections of records stored in the data storage system over an interface coupled to the data storage system to store quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in the selected field and a count of the number of records in which the value appears in the selected field; and

    processing the quantitative information of two or more of the collections to generate profiling summary information, the processing including;

    merging the value count entries of corresponding lists for at least one field from each of at least a first collection and a second collection of the two or more collections to generate a combined list of value count entries, andaggregating value count entries of the combined list of value count entries to generate a list of distinct field value entries, at least some of the distinct field value entries identifying a distinct value from at least one of the value count entries and including information quantifying a number of records in which the distinct value appears for each of the two or more collections;

    wherein processing the quantitative information of two or more of the collections includes processing the quantitative information of three or more of the collections; and

    the method further including;

    for a first subset of at least two of the three or more collections, generating profiling summary information from the list of distinct field value entries, the profiling summary information including multiple patterns of results of a join operation between the fields of respective collections of records in the first subset; and

    for a second subset of at least two of the three or more collections, different from the first subset, generating profiling summary information from the list of distinct field value entries, the profiling summary information including multiple patterns of results of a join operation between the fields of respective collections of records in the second subset.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×