Method and system for accelerated data quality enhancement
First Claim
1. A computer-implemented method comprising:
- generating a set of candidate conditional functional dependencies based on a set of candidate seeds by using an ontology of a data set, said data set comprising records comprising a plurality of attributes and a plurality of values for said attributes, said plurality of attributes comprising attributes having multiple and different values, wherein said ontology comprises links that indicate which of said attributes are related, said candidate seeds comprising instances of related attributes;
applying said candidate conditional functional dependencies individually to said data set to obtain a set of corresponding result values for said candidate conditional functional dependencies;
refining said candidate conditional functional dependencies individually, said refining comprising, for each of said conditional functional dependencies;
incrementing a first count of records in a first subset of said plurality of records that are consistent with a conditional functional dependency, wherein all values in a pattern tuple of said conditional functional dependency match respective values in a record that is consistent with said conditional functional dependency;
incrementing a second count of records in said first subset of said plurality of records that are inconsistent with said conditional functional dependency, wherein all values in a pattern tuple of the antecedent of said conditional functional dependency match respective values, but values in said pattern tuple of the consequent of said conditional functional dependency do not match respective values, in a record that is inconsistent with said conditional functional dependency;
incrementing a third count of records in said first subset of said plurality of records that are not consistent with said conditional functional dependency and are not inconsistent with said conditional functional dependency;
determining whether a first measure based on said first and third counts satisfies a first threshold value, wherein if said first measure fails to satisfy said first threshold value then a condition is removed from said antecedent of said conditional functional dependency and said refining then continues for a second subset of said plurality of records; and
determining whether a second measure based on said second and third counts satisfies a second threshold value, wherein if said second measure fails to satisfy said second threshold value then said first measure is reduced and said refining then continues for said second subset of said plurality of records;
terminating said applying and said refining when said candidate conditional functional dependencies individually reach a quiescent state; and
selecting a relevant set of said candidate conditional functional dependencies to be used as data quality rules for said data set.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the present invention solve the technical problem of identifying, collecting, and managing rules that improve poor quality data on enterprise initiatives ranging from data governance to business intelligence. In a specific embodiment of the present invention, a method is provided for producing data quality rules for a data set. A set of candidate conditional functional dependencies are generated comprised of candidate seeds of attributes that are within a certain degree of relatedness in the ontology of the data set. The candidate conditional functional dependencies are then applied to the data refined until they reach a quiescent state where they have not been refined even though the data they have been applied to has been stable. The resulting refined candidate conditional functional dependencies are the data enhancement rules for the data set and other related data sets. In another specific embodiment of the present invention, a computer system for the development of data quality rules is provided having a rule repository, a data quality rules discovery engine, and a user interface.
19 Citations
23 Claims
-
1. A computer-implemented method comprising:
-
generating a set of candidate conditional functional dependencies based on a set of candidate seeds by using an ontology of a data set, said data set comprising records comprising a plurality of attributes and a plurality of values for said attributes, said plurality of attributes comprising attributes having multiple and different values, wherein said ontology comprises links that indicate which of said attributes are related, said candidate seeds comprising instances of related attributes; applying said candidate conditional functional dependencies individually to said data set to obtain a set of corresponding result values for said candidate conditional functional dependencies; refining said candidate conditional functional dependencies individually, said refining comprising, for each of said conditional functional dependencies; incrementing a first count of records in a first subset of said plurality of records that are consistent with a conditional functional dependency, wherein all values in a pattern tuple of said conditional functional dependency match respective values in a record that is consistent with said conditional functional dependency; incrementing a second count of records in said first subset of said plurality of records that are inconsistent with said conditional functional dependency, wherein all values in a pattern tuple of the antecedent of said conditional functional dependency match respective values, but values in said pattern tuple of the consequent of said conditional functional dependency do not match respective values, in a record that is inconsistent with said conditional functional dependency; incrementing a third count of records in said first subset of said plurality of records that are not consistent with said conditional functional dependency and are not inconsistent with said conditional functional dependency; determining whether a first measure based on said first and third counts satisfies a first threshold value, wherein if said first measure fails to satisfy said first threshold value then a condition is removed from said antecedent of said conditional functional dependency and said refining then continues for a second subset of said plurality of records; and determining whether a second measure based on said second and third counts satisfies a second threshold value, wherein if said second measure fails to satisfy said second threshold value then said first measure is reduced and said refining then continues for said second subset of said plurality of records; terminating said applying and said refining when said candidate conditional functional dependencies individually reach a quiescent state; and selecting a relevant set of said candidate conditional functional dependencies to be used as data quality rules for said data set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented method comprising:
-
generating a set of candidate conditional functional dependencies based on a set of candidate seeds by using an ontology of a data set, said data set comprising a plurality of records comprising a plurality of attributes and a plurality of values for said attributes, said plurality of attributes comprising attributes having multiple and different values, wherein said ontology comprises links that indicate which of said attributes are related, each of said candidate seeds comprising instances of related attributes; applying said candidate conditional functional dependencies individually to said data set to obtain a set of corresponding result values for each of said candidate conditional functional dependencies; refining said candidate conditional functional dependencies individually, said refining comprising, for each of said conditional functional dependencies; incrementing a first count of records in a first subset of said plurality of records that are consistent with a conditional functional dependency, wherein all values in a pattern tuple of said conditional functional dependency match respective values in a record that is consistent with said conditional functional dependency; incrementing a second count of records in said first subset of said plurality of records that are inconsistent with said conditional functional dependency, wherein all values in a pattern tuple of the antecedent of said conditional functional dependency match respective values, but values in said pattern tuple of the consequent of said conditional functional dependency do not match respective values, in a record that is inconsistent with said conditional functional dependency; incrementing a third count of records in said first subset of said plurality of records that are not consistent with said conditional functional dependency and are not inconsistent with said conditional functional dependency; determining whether a first measure based on said first and third counts satisfies a first threshold value, wherein if said first measure fails to satisfy said first threshold value then a condition is removed from said antecedent of said conditional functional dependency and said refining then continues for a second subset of said plurality of records; and determining whether a second measure based on said second and third counts satisfies a second threshold value, wherein if said second measure fails to satisfy said second threshold value then said first measure is reduced and said refining then continues for said second subset of said plurality of records; terminating said applying and said refining when said candidate conditional functional dependencies individually reach a quiescent state; selecting a relevant set of said candidate conditional functional dependencies to be used as said data quality rules for said data set; and enhancing the data quality of said data set by checking the data of said data set against said relevant set and screening said data if said data does not follow a rule contained in said relevant set. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A computer system comprising:
-
a rule repository operable for storing data quality rules; a graphical user interface comprising a display window and capable of receiving a data set, said data set comprising a plurality of records comprising a plurality of attributes and a plurality of values for said attributes, said plurality of attributes comprising attributes having multiple values, an ontology comprising links that indicate which of said attributes are related, and a set of rule generation parameters; a data quality rules discovery engine capable of receiving said data set, said ontology, and said set of rule generation parameters from said user interface, generating said set of data quality rules, and sending said set of data quality rules to said rule repository, wherein data quality rules generated by said data quality rules discovery engine are displayed in said display window; wherein said data quality rules discovery engine formulates a set of candidate conditional functional dependencies based on a set of candidate seeds by using said ontology, said candidate seeds comprising instances of related attributes; and
wherein said data quality rules discovery engine refines said set of candidate conditional functional dependencies by;incrementing a first count of records in a first subset of said plurality of records that are consistent with a conditional functional dependency, wherein all values in a pattern tuple of said conditional functional dependency match respective values in a record that is consistent with said conditional functional dependency; incrementing a second count of records in said first subset of said plurality of records that are inconsistent with said conditional functional dependency, wherein all values in a pattern tuple of the antecedent of said conditional functional dependency match respective values, but values in said pattern tuple of the consequent of said conditional functional dependency do not match respective values, in a record that is inconsistent with said conditional functional dependency; incrementing a third count of records in said first subset of said plurality of records that are not consistent with said conditional functional dependency and are not inconsistent with said conditional functional dependency; determining whether a first measure based on said first and third counts satisfies a first threshold value, wherein if said first measure fails to satisfy said first threshold value then a condition is removed from said antecedent of said conditional functional dependency and said refining then continues for a second subset of said plurality of records; and determining whether a second measure based on said second and third counts satisfies a second threshold value, wherein if said second measure fails to satisfy said second threshold value then said first measure is reduced and said refining then continues for said second subset of said plurality of records; wherein said data quality rules discovery engine terminates refining of said set of candidate conditional functional dependencies when said set of conditional functional dependencies reaches a quiescent state and becomes said data quality rules. - View Dependent Claims (20, 21, 22, 23)
-
Specification