×

Method and system for accelerated data quality enhancement

  • US 8,700,577 B2
  • Filed: 05/13/2010
  • Issued: 04/15/2014
  • Est. Priority Date: 12/07/2009
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method comprising:

  • generating a set of candidate conditional functional dependencies based on a set of candidate seeds by using an ontology of a data set, said data set comprising records comprising a plurality of attributes and a plurality of values for said attributes, said plurality of attributes comprising attributes having multiple and different values, wherein said ontology comprises links that indicate which of said attributes are related, said candidate seeds comprising instances of related attributes;

    applying said candidate conditional functional dependencies individually to said data set to obtain a set of corresponding result values for said candidate conditional functional dependencies;

    refining said candidate conditional functional dependencies individually, said refining comprising, for each of said conditional functional dependencies;

    incrementing a first count of records in a first subset of said plurality of records that are consistent with a conditional functional dependency, wherein all values in a pattern tuple of said conditional functional dependency match respective values in a record that is consistent with said conditional functional dependency;

    incrementing a second count of records in said first subset of said plurality of records that are inconsistent with said conditional functional dependency, wherein all values in a pattern tuple of the antecedent of said conditional functional dependency match respective values, but values in said pattern tuple of the consequent of said conditional functional dependency do not match respective values, in a record that is inconsistent with said conditional functional dependency;

    incrementing a third count of records in said first subset of said plurality of records that are not consistent with said conditional functional dependency and are not inconsistent with said conditional functional dependency;

    determining whether a first measure based on said first and third counts satisfies a first threshold value, wherein if said first measure fails to satisfy said first threshold value then a condition is removed from said antecedent of said conditional functional dependency and said refining then continues for a second subset of said plurality of records; and

    determining whether a second measure based on said second and third counts satisfies a second threshold value, wherein if said second measure fails to satisfy said second threshold value then said first measure is reduced and said refining then continues for said second subset of said plurality of records;

    terminating said applying and said refining when said candidate conditional functional dependencies individually reach a quiescent state; and

    selecting a relevant set of said candidate conditional functional dependencies to be used as data quality rules for said data set.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×