Data profiling
First Claim
Patent Images
1. A method for processing data including:
- reading data records from a data source;
profiling the data records, including;
sending the data records to a partitioning component, the partitioning component partitioning the data records into a plurality of partitions;
generating, in each partition, a plurality of census elements for each data record, each census element including;
a field of the data; and
a corresponding value occurring within the field of the data record;
in each partition, combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements;
partitioning the output census elements by the field and the value included in each output census element, wherein output census elements that have the same value for the same field are partitioned into the same partition; and
adding counts of the number of occurrences of the same value for the same field for the partitioned output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the data records;
storing profile information; and
processing the data records, including;
accessing the stored profile information;
reading the data records from the data source;
processing the data records according to the profile information; and
outputting a result of the processing.
4 Assignments
0 Petitions
Accused Products
Abstract
Processing data includes profiling data from a data source, including reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data. The data is then processed from the data source. This processing includes accessing the stored profile information and processing the data according to the accessed profile information.
70 Citations
54 Claims
-
1. A method for processing data including:
-
reading data records from a data source; profiling the data records, including; sending the data records to a partitioning component, the partitioning component partitioning the data records into a plurality of partitions; generating, in each partition, a plurality of census elements for each data record, each census element including; a field of the data; and a corresponding value occurring within the field of the data record; in each partition, combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements; partitioning the output census elements by the field and the value included in each output census element, wherein output census elements that have the same value for the same field are partitioned into the same partition; and adding counts of the number of occurrences of the same value for the same field for the partitioned output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the data records; storing profile information; and processing the data records, including; accessing the stored profile information; reading the data records from the data source; processing the data records according to the profile information; and outputting a result of the processing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A non-transitory medium storing a computer program including instructions for causing a computer to:
-
read data records from a data source; profile the data records, including; sending the data records to a partitioning component, the partitioning component partitioning the data records into a plurality of partitions; generating, in each partition, a plurality of census elements for each data record, each census element including; a field of the data; and a corresponding value occurring within the field of the data record; in each partition, combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements; partitioning the output census elements by the field and the value included in each output census element, wherein output census elements that have the same value for the same field are partitioned into the same partition; and adding counts of the number of occurrences of the same value for the same field for the partitioned output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the data records; store profile information; and process the data records, including; accessing the stored profile information; reading the data records from the data source; processing the data records according to the profile information; and outputting a result of the processing. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
-
-
38. A data processing system including:
-
a computer system, including a plurality of processors; a data source accessible to the computer system; a data storage subsystem in communication with the computer system; a read data component configured to read data records from a data source; a first partitioning component in communication with the read data component, the first partitioning component configured to partition the data records into a plurality of partitions; a plurality of canonicalize components, each canonicalize component in communication with the first partitioning component and configured to generate, in each partition, a plurality of census elements for each data record, each census element including; a field of the data; and a corresponding value occurring within the field of the data record; a plurality of local rollup components, each local rollup component in communication with a canonicalize component and configured to, in each partition, combine occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements; a second partitioning component in communication with each local rollup component and configured to partition the output census elements by the field and the value included in each output census element, wherein output census elements that have the same value for the same field are partitioned into the same partition; a plurality of global rollup components, each global rollup component in communication with the second partitioning component and configured to; add counts of the number of occurrences of the same value for the same field for the partitioned output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the data records; and store profile information; and a processing module in communication with the data source and the data storage subsystem and configured to process the data records, including; accessing the stored profile information; reading the data records from the data source; processing the data records according to the profile information; and outputting a result of the processing. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
-
-
49. A data processing system including:
-
means for reading data records from a data source; means for profiling the data records, including; sending the data records to a partitioning component, the partitioning component partitioning the data records into a plurality of partitions; generating, in each partition, a plurality of census elements for each data record, each census element including; a field of the data; and a corresponding value occurring within the field of the data record; in each partition, combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements; partitioning the output census elements by the field and the value included in each output census element, wherein output census elements that have the same value for the same field are partitioned into the same partition; and adding counts of the number of occurrences of the same value for the same field for the partitioned output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the data records; means for storing profile information; and means for processing the data records, including; accessing the stored profile information; reading the data records from the data source; processing the data records according to the profile information; and outputting a result of the processing. - View Dependent Claims (50, 51, 52, 53, 54)
-
Specification