×

Data profiling

  • US 8,868,580 B2
  • Filed: 09/15/2004
  • Issued: 10/21/2014
  • Est. Priority Date: 09/15/2003
  • Status: Active Grant
First Claim
Patent Images

1. A method for processing data including:

  • reading data records, in a dataflow graph, from a data source, the dataflow graph including components and links, wherein the links direct flows of data between components;

    profiling the data records using the data flow graph includingsending the data records on a first link to a partitioning component, the partitioning component partitioning the data records among a plurality of partitions;

    generating, in each partition, by a canonicalize component in communication with the partitioning component, a flow of census elements on a respective one of a first plurality of links including generating a plurality of census elements for each data record, each census element including;

    a field of the data,a corresponding value occurring within the field of the data record;

    generating, in each partition, by a rollup component in communication with the a corresponding canonicalize component, a flow of output census elements on a respective one of a second plurality of links including combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements;

    combining the flows of output census elements in each partition, and partitioning the combined flow of output census elements by the field and the value;

    adding counts of the number of occurrences of the same value for the same field for the partitioned flows of output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the flow of data records;

    storing profile information based on the single census elements; and

    processing data from the data source, including accessing the stored profile information, reading the data from the data source after profiling the data from the data source, processing the data according to the accessed profile information, and outputting a result of the processing.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×