Data profiling
First Claim
Patent Images
1. A method for processing data including:
- reading data records, in a dataflow graph, from a data source, the dataflow graph including components and links, wherein the links direct flows of data between components;
profiling the data records using the data flow graph includingsending the data records on a first link to a partitioning component, the partitioning component partitioning the data records among a plurality of partitions;
generating, in each partition, by a canonicalize component in communication with the partitioning component, a flow of census elements on a respective one of a first plurality of links including generating a plurality of census elements for each data record, each census element including;
a field of the data,a corresponding value occurring within the field of the data record;
generating, in each partition, by a rollup component in communication with the a corresponding canonicalize component, a flow of output census elements on a respective one of a second plurality of links including combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements;
combining the flows of output census elements in each partition, and partitioning the combined flow of output census elements by the field and the value;
adding counts of the number of occurrences of the same value for the same field for the partitioned flows of output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the flow of data records;
storing profile information based on the single census elements; and
processing data from the data source, including accessing the stored profile information, reading the data from the data source after profiling the data from the data source, processing the data according to the accessed profile information, and outputting a result of the processing.
4 Assignments
0 Petitions
Accused Products
Abstract
Processing data includes profiling data from a data source, including reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data. The data is then processed from the data source. This processing includes accessing the stored profile information and processing the data according to the accessed profile information.
74 Citations
28 Claims
-
1. A method for processing data including:
-
reading data records, in a dataflow graph, from a data source, the dataflow graph including components and links, wherein the links direct flows of data between components; profiling the data records using the data flow graph including sending the data records on a first link to a partitioning component, the partitioning component partitioning the data records among a plurality of partitions; generating, in each partition, by a canonicalize component in communication with the partitioning component, a flow of census elements on a respective one of a first plurality of links including generating a plurality of census elements for each data record, each census element including; a field of the data, a corresponding value occurring within the field of the data record; generating, in each partition, by a rollup component in communication with the a corresponding canonicalize component, a flow of output census elements on a respective one of a second plurality of links including combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements; combining the flows of output census elements in each partition, and partitioning the combined flow of output census elements by the field and the value; adding counts of the number of occurrences of the same value for the same field for the partitioned flows of output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the flow of data records; storing profile information based on the single census elements; and processing data from the data source, including accessing the stored profile information, reading the data from the data source after profiling the data from the data source, processing the data according to the accessed profile information, and outputting a result of the processing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 24, 25, 26, 27, 28)
-
-
22. Software stored on a non-transitory computer-readable storage medium including executable instructions for causing a computer system to:
-
read data records, in a dataflow graph, from a data source, the dataflow graph including components and links, wherein the links direct flows of data between components; profile the data records using the dataflow graph including sending the data records on a first link to a partitioning component, the partitioning component partitioning the data records among a plurality of partitions; generating, in each partition, a flow of census elements, by a canonicalize component in communication with the partitioning component, on a respective one of a first plurality of links including generating a plurality of census elements identifying a field and a corresponding value for each data record, each census element including; a field of the data, a corresponding value occurring within the field of the data record; generating, in each partition, a flow of output census elements, by a rollup component in communication with a corresponding canonicalize component;
on a respective one of a second plurality of links including combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements;combining the flows of output census elements in each partition, and partitioning the combined flow of output census elements by the field and the value; adding counts of the number of occurrences of the same value for the same field for the partitioned flows of output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the flow of data records; store profile information based on the single census elements; and process data from the data source by accessing the stored profile information, reading the data from the data source after profiling the data from the data source, processing the data according to the accessed profile information, and outputting a result of the processing.
-
-
23. A data processing system including:
-
a computer system, including a plurality of processors; a data source accessible to the computer system; a data storage subsystem including a non-transitory computer-readable storage medium in communication with the computer system; with a dataflow graph configured to execute on a plurality of the processors, the dataflow graph including components and links, wherein the links direct flows of data between components and the components including; a read data component configured to read data records from a data source, a first partitioning component connected to the read data component by a link and configured to partition the data records among a plurality of partitions corresponding to different processors; a plurality of canonicalize components, each canonicalize component in communication with the first partitioning component and configured to generate a flow of census elements including generating a plurality of census elements for each data record, each census element including; a field of the data, a corresponding value occurring within the of the data record; a plurality of local rollup components, each local rollup component in communication with a canonicalize component and configured to generate, in each partition, a flow of output census elements including combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements; a second partitioning component connected to each local rollup component in the plurality of local rollup components by a link and configured to combining the flows of output census elements in each partition, and partitioning the combined flow of output census elements by the field and the value; a plurality of global rollup components, each global rollup component connected to the second partitioning component by a link and configured to;
add counts of the number of occurrences of the same value for the same field for the partitioned flows of output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the flow of data records, and store profile information based on the single census elements; anda processing module connected over communication paths to the data source and the data storage subsystem and configured execute on the computer system to access the stored profile information, to read data from the data source after the profiling module reads the data from the data source, to process the data from the data source according to the accessed profile information, and to output a result of the processing.
-
Specification