Techniques for integrating validation results in data profiling and related systems and methods
First Claim
Patent Images
1. A computer-implemented method of operating a data processing system to generate a data profile based on:
- a dataset having an associated record format defining a plurality of fields;
a value census for the dataset comprising a first plurality of values each having an associated field of the plurality of fields and a plurality of first count values, wherein a first count value indicates a number of times a respective field and value combination occurs in the at least one dataset; and
a validation specification comprising a plurality of validation rules defining criteria for invalidity for one or more fields of the plurality of fields, the method comprising;
generating a validation census based at least in part on the dataset and the validation specification, the validation census comprising a second plurality of values each having an associated field of the plurality of fields, an indication of invalidity, and a second value; and
generating a data profile of the at least one dataset based at least in part on the value census and the validation census, wherein generating the data profile comprises;
matching ones of the second plurality of values of the validation census and their associated fields with ones of the first plurality of values of the value census and their associated fields; and
for ones of the first plurality of values of the value census and their associated fields matching at least one of the second plurality of values of the validation census and their associated fields, recording in the data profile the value associated with the field, at least one indication of invalidity based on the validation census, and at least one of the second values.
3 Assignments
0 Petitions
Accused Products
Abstract
According to some aspects, techniques for configuring a data processing system are provided that increase flexibility and efficiency of generation of a data profile of a dataset. The data processing system may produce a value census and a validation census of the dataset in separate processing steps. The value census may then be enriched with contents of the validation census by processing the validation census in a manner that allows matching of field-value pairs of the dataset between the two censuses.
199 Citations
15 Claims
-
1. A computer-implemented method of operating a data processing system to generate a data profile based on:
- a dataset having an associated record format defining a plurality of fields;
a value census for the dataset comprising a first plurality of values each having an associated field of the plurality of fields and a plurality of first count values, wherein a first count value indicates a number of times a respective field and value combination occurs in the at least one dataset; and
a validation specification comprising a plurality of validation rules defining criteria for invalidity for one or more fields of the plurality of fields, the method comprising;generating a validation census based at least in part on the dataset and the validation specification, the validation census comprising a second plurality of values each having an associated field of the plurality of fields, an indication of invalidity, and a second value; and generating a data profile of the at least one dataset based at least in part on the value census and the validation census, wherein generating the data profile comprises; matching ones of the second plurality of values of the validation census and their associated fields with ones of the first plurality of values of the value census and their associated fields; and for ones of the first plurality of values of the value census and their associated fields matching at least one of the second plurality of values of the validation census and their associated fields, recording in the data profile the value associated with the field, at least one indication of invalidity based on the validation census, and at least one of the second values. - View Dependent Claims (2, 3, 4, 5, 6, 7)
- a dataset having an associated record format defining a plurality of fields;
-
8. A computer system comprising:
-
at least one processor; at least one user interface device; and at least one computer readable medium comprising processor-executable instructions that, when executed, cause the at least one processor to perform a method of generating a data profile based on;
a dataset having an associated record format defining a plurality of fields;
a value census for the dataset comprising a first plurality of values each having an associated field of the plurality of fields and a plurality of first count values, wherein a first count value indicates a number of times a respective field and value combination occurs in the at least one dataset; and
a validation specification comprising a plurality of validation rules defining criteria for invalidity for one or more fields of the plurality of fields, the method comprising;generating, based at least in part on the dataset and the validation specification, a validation census comprising a second plurality of values each having an associated field of the plurality of fields, an indication of invalidity, and a second value; and generating, based at least in part on the value census and the validation census, a data profile of the at least one dataset by; matching ones of the second plurality of values of the validation census and their associated fields with ones of the first plurality of values of the value census and their associated fields; and for ones of the first plurality of values of the value census and their associated fields matching at least one of the second plurality of values of the validation census and their associated fields, recording in the data profile the value associated with the field, at least one indication of invalidity based on the validation census, and at least one of the second values. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer system for generating a data profile based on:
- a dataset having an associated record format defining a plurality of fields;
a value census for the dataset comprising a first plurality of values each having an associated field of the plurality of fields and a plurality of first count values, wherein a first count value indicates a number of times a respective field and value combination occurs in the at least one dataset; and
a validation specification comprising a plurality of validation rules defining criteria for invalidity for one or more fields of the plurality of fields;
comprising;at least one processor; means for generating, based at least in part on the dataset and the validation specification, a validation census comprising a second plurality of values each having an associated field of the plurality of fields, and a plurality of indications an indication of invalidity, and a second value; and means for generating, based at least in part on the value census and the validation census, a data profile of the at least one dataset by; matching ones of the second plurality of values of the validation census and their associated fields with ones of the first plurality of values of the value census and their associated fields; and for ones of the first plurality of values of the value census and their associated fields matching at least one of the second plurality of values of the validation census and their associated fields, recording in the data profile the value associated with the field, at least one indication of invalidity based on the validation census, and at least one of the second values.
- a dataset having an associated record format defining a plurality of fields;
Specification