Method and system for accelerated data quality enhancement

US 8,700,577 B2
Filed: 05/13/2010
Issued: 04/15/2014
Est. Priority Date: 12/07/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

generating a set of candidate conditional functional dependencies based on a set of candidate seeds by using an ontology of a data set, said data set comprising records comprising a plurality of attributes and a plurality of values for said attributes, said plurality of attributes comprising attributes having multiple and different values, wherein said ontology comprises links that indicate which of said attributes are related, said candidate seeds comprising instances of related attributes;

applying said candidate conditional functional dependencies individually to said data set to obtain a set of corresponding result values for said candidate conditional functional dependencies;

refining said candidate conditional functional dependencies individually, said refining comprising, for each of said conditional functional dependencies;

incrementing a first count of records in a first subset of said plurality of records that are consistent with a conditional functional dependency, wherein all values in a pattern tuple of said conditional functional dependency match respective values in a record that is consistent with said conditional functional dependency;

incrementing a second count of records in said first subset of said plurality of records that are inconsistent with said conditional functional dependency, wherein all values in a pattern tuple of the antecedent of said conditional functional dependency match respective values, but values in said pattern tuple of the consequent of said conditional functional dependency do not match respective values, in a record that is inconsistent with said conditional functional dependency;

incrementing a third count of records in said first subset of said plurality of records that are not consistent with said conditional functional dependency and are not inconsistent with said conditional functional dependency;

determining whether a first measure based on said first and third counts satisfies a first threshold value, wherein if said first measure fails to satisfy said first threshold value then a condition is removed from said antecedent of said conditional functional dependency and said refining then continues for a second subset of said plurality of records; and

determining whether a second measure based on said second and third counts satisfies a second threshold value, wherein if said second measure fails to satisfy said second threshold value then said first measure is reduced and said refining then continues for said second subset of said plurality of records;

terminating said applying and said refining when said candidate conditional functional dependencies individually reach a quiescent state; and

selecting a relevant set of said candidate conditional functional dependencies to be used as data quality rules for said data set.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the present invention solve the technical problem of identifying, collecting, and managing rules that improve poor quality data on enterprise initiatives ranging from data governance to business intelligence. In a specific embodiment of the present invention, a method is provided for producing data quality rules for a data set. A set of candidate conditional functional dependencies are generated comprised of candidate seeds of attributes that are within a certain degree of relatedness in the ontology of the data set. The candidate conditional functional dependencies are then applied to the data refined until they reach a quiescent state where they have not been refined even though the data they have been applied to has been stable. The resulting refined candidate conditional functional dependencies are the data enhancement rules for the data set and other related data sets. In another specific embodiment of the present invention, a computer system for the development of data quality rules is provided having a rule repository, a data quality rules discovery engine, and a user interface.

19 Citations

View as Search Results

23 Claims

1. A computer-implemented method comprising:
- generating a set of candidate conditional functional dependencies based on a set of candidate seeds by using an ontology of a data set, said data set comprising records comprising a plurality of attributes and a plurality of values for said attributes, said plurality of attributes comprising attributes having multiple and different values, wherein said ontology comprises links that indicate which of said attributes are related, said candidate seeds comprising instances of related attributes;
  
  applying said candidate conditional functional dependencies individually to said data set to obtain a set of corresponding result values for said candidate conditional functional dependencies;
  
  refining said candidate conditional functional dependencies individually, said refining comprising, for each of said conditional functional dependencies;
  
  incrementing a first count of records in a first subset of said plurality of records that are consistent with a conditional functional dependency, wherein all values in a pattern tuple of said conditional functional dependency match respective values in a record that is consistent with said conditional functional dependency;
  
  incrementing a second count of records in said first subset of said plurality of records that are inconsistent with said conditional functional dependency, wherein all values in a pattern tuple of the antecedent of said conditional functional dependency match respective values, but values in said pattern tuple of the consequent of said conditional functional dependency do not match respective values, in a record that is inconsistent with said conditional functional dependency;
  
  incrementing a third count of records in said first subset of said plurality of records that are not consistent with said conditional functional dependency and are not inconsistent with said conditional functional dependency;
  
  determining whether a first measure based on said first and third counts satisfies a first threshold value, wherein if said first measure fails to satisfy said first threshold value then a condition is removed from said antecedent of said conditional functional dependency and said refining then continues for a second subset of said plurality of records; and
  
  determining whether a second measure based on said second and third counts satisfies a second threshold value, wherein if said second measure fails to satisfy said second threshold value then said first measure is reduced and said refining then continues for said second subset of said plurality of records;
  
  terminating said applying and said refining when said candidate conditional functional dependencies individually reach a quiescent state; and
  
  selecting a relevant set of said candidate conditional functional dependencies to be used as data quality rules for said data set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The computer-implemented method from claim 1, wherein:
    - a first number of conditions in each of said candidate conditional functional dependencies can be adjusted by a user prior to generating said set of candidate conditional functional dependencies; and
      
      a second number of attributes in a candidate seed of said set of candidate seeds can be adjusted by a user prior to generating said set of candidate seeds.
  - 3. The computer-implemented method from claim 1, wherein:
    - said set of candidate conditional functional dependencies has a predetermined number of said candidate conditional functional dependencies; and
      
      said predetermined number of conditional functional dependencies can be adjusted by a user.
  - 4. The computer-implemented method from claim 1, wherein said candidate conditional functional dependencies in said relevant set have a corresponding set of combined result signatures that have the best goodness of fit in terms of a maximum degree of coverage of said data set and a minimum proximity between a detected error rate and said predetermined error estimate.
  - 5. The computer-implemented method from claim 4, further comprising:
    - ranking said candidate conditional functional dependencies in said relevant set according to an interestingness factor;
      
      whereinsaid interestingness factor increases for a particular one of said candidate conditional functional dependencies as a portion of said data set consisting of a data value on which said particular one of said candidate conditional functional dependencies is based on decreases.
  - 6. The computer-implemented method from claim 1, wherein the size of a data segment of said data set that said candidate conditional functional dependencies are applied to during said applying is set by a predetermined scan period.
  - 7. The computer-implemented method from claim 6, wherein said predetermined coverage estimate, said predetermined error estimate, and said predetermined scan period can be adjusted by a user.
  - 8. The computer-implemented method from claim 1, wherein said quiescent state is achieved for a specific one of said candidate conditional functional dependencies when said specific one of said candidate conditional functional dependencies has been applied individually to a series of said data segments without said refining altering said specific candidate conditional functional dependencies, wherein said series of said data segments contain an amount of data points equal in size to a predetermined window period and contain stable data.
  - 9. The computer-implemented method from claim 8, wherein said predetermined window period can be adjusted by a user.
  - 10. The computer-implemented method from claim 1, wherein strength values are associated with said links, each of said links associated with a respective strength value, wherein said generating comprises:
    - computing an average strength value for each combination of related attributes; and
      
      discarding combinations of related attributes having an average strength value that fails to satisfy a threshold value.
  - 11. The computer-implemented method from claim 1, wherein said refining further comprises identifying and eliminating a high entropy attribute from a subset of said plurality of attributes, said subset comprising multiple attributes and associated with a candidate conditional functional dependency, said high entropy attribute having the most different values relative to any of the other attributes in said subset of said plurality of attributes.
  - 12. The computer-implemented method from claim 1, further comprising repeating said applying if said set of corresponding result values does not have a result signature that meets a predetermined expectation, wherein said predetermined expectation is set by a predetermined coverage estimate of a first portion of said data set that is covered by an individual one of said candidate conditional functional dependencies, and a predetermined error estimate of a second portion of said data set that will be erroneous.

13. A computer-implemented method comprising:
- generating a set of candidate conditional functional dependencies based on a set of candidate seeds by using an ontology of a data set, said data set comprising a plurality of records comprising a plurality of attributes and a plurality of values for said attributes, said plurality of attributes comprising attributes having multiple and different values, wherein said ontology comprises links that indicate which of said attributes are related, each of said candidate seeds comprising instances of related attributes;
  
  applying said candidate conditional functional dependencies individually to said data set to obtain a set of corresponding result values for each of said candidate conditional functional dependencies;
  
  refining said candidate conditional functional dependencies individually, said refining comprising, for each of said conditional functional dependencies;
  
  incrementing a first count of records in a first subset of said plurality of records that are consistent with a conditional functional dependency, wherein all values in a pattern tuple of said conditional functional dependency match respective values in a record that is consistent with said conditional functional dependency;
  
  incrementing a second count of records in said first subset of said plurality of records that are inconsistent with said conditional functional dependency, wherein all values in a pattern tuple of the antecedent of said conditional functional dependency match respective values, but values in said pattern tuple of the consequent of said conditional functional dependency do not match respective values, in a record that is inconsistent with said conditional functional dependency;
  
  incrementing a third count of records in said first subset of said plurality of records that are not consistent with said conditional functional dependency and are not inconsistent with said conditional functional dependency;
  
  determining whether a first measure based on said first and third counts satisfies a first threshold value, wherein if said first measure fails to satisfy said first threshold value then a condition is removed from said antecedent of said conditional functional dependency and said refining then continues for a second subset of said plurality of records; and
  
  determining whether a second measure based on said second and third counts satisfies a second threshold value, wherein if said second measure fails to satisfy said second threshold value then said first measure is reduced and said refining then continues for said second subset of said plurality of records;
  
  terminating said applying and said refining when said candidate conditional functional dependencies individually reach a quiescent state;
  
  selecting a relevant set of said candidate conditional functional dependencies to be used as said data quality rules for said data set; and
  
  enhancing the data quality of said data set by checking the data of said data set against said relevant set and screening said data if said data does not follow a rule contained in said relevant set.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer-implemented method from claim 13, further comprising continuing to apply said relevant set to enhance the data quality of a group of additional data sets that are related in content to said data set.
  - 15. The computer-implemented method from claim 13, further comprising exporting said relevant set to one of a data quality product and an external data base management system.
  - 16. The computer-implemented method from claim 15, wherein said data quality product is one of TS Discovery, Informatica IDE/IDQ and Oracle Data Integrator.
  - 17. The computer-implemented method from claim 13, wherein said refining further comprises identifying and eliminating a high entropy attribute from a subset of said plurality of attributes, said subset comprising multiple attributes and associated with a candidate conditional functional dependency, said high entropy attribute having the most different values relative to any of the other attributes in said subset of said plurality of attributes.
  - 18. The computer-implemented method from claim 13, further comprising repeating said applying if said set of corresponding result values does not have a result signature that meets a predetermined expectation, wherein said predetermined expectation is set by a predetermined coverage estimate of a first portion of said data set that is covered by an individual one of said candidate conditional functional dependencies, and a predetermined error estimate of a second portion of said data set that will be erroneous.

19. A computer system comprising:
- a rule repository operable for storing data quality rules;
  
  a graphical user interface comprising a display window and capable of receiving a data set, said data set comprising a plurality of records comprising a plurality of attributes and a plurality of values for said attributes, said plurality of attributes comprising attributes having multiple values, an ontology comprising links that indicate which of said attributes are related, and a set of rule generation parameters;
  
  a data quality rules discovery engine capable of receiving said data set, said ontology, and said set of rule generation parameters from said user interface, generating said set of data quality rules, and sending said set of data quality rules to said rule repository, wherein data quality rules generated by said data quality rules discovery engine are displayed in said display window;
  
  wherein said data quality rules discovery engine formulates a set of candidate conditional functional dependencies based on a set of candidate seeds by using said ontology, said candidate seeds comprising instances of related attributes; and
  
  wherein said data quality rules discovery engine refines said set of candidate conditional functional dependencies by;
  
  incrementing a first count of records in a first subset of said plurality of records that are consistent with a conditional functional dependency, wherein all values in a pattern tuple of said conditional functional dependency match respective values in a record that is consistent with said conditional functional dependency;
  
  incrementing a second count of records in said first subset of said plurality of records that are inconsistent with said conditional functional dependency, wherein all values in a pattern tuple of the antecedent of said conditional functional dependency match respective values, but values in said pattern tuple of the consequent of said conditional functional dependency do not match respective values, in a record that is inconsistent with said conditional functional dependency;
  
  incrementing a third count of records in said first subset of said plurality of records that are not consistent with said conditional functional dependency and are not inconsistent with said conditional functional dependency;
  
  determining whether a first measure based on said first and third counts satisfies a first threshold value, wherein if said first measure fails to satisfy said first threshold value then a condition is removed from said antecedent of said conditional functional dependency and said refining then continues for a second subset of said plurality of records; and
  
  determining whether a second measure based on said second and third counts satisfies a second threshold value, wherein if said second measure fails to satisfy said second threshold value then said first measure is reduced and said refining then continues for said second subset of said plurality of records;
  
  wherein said data quality rules discovery engine terminates refining of said set of candidate conditional functional dependencies when said set of conditional functional dependencies reaches a quiescent state and becomes said data quality rules.
- View Dependent Claims (20, 21, 22, 23)
- - 20. The computer system from claim 19, said graphical user interface further capable of displaying and receiving said rule generation parameters, an address of said data set, an address of a related data set, and an address of said ontology, and said set of data quality rules;
    - andwherein said rule generation parameters can be adjusted by the user through said graphical user interface.
  - 21. The computer system from claim 19, further comprising a data exchanger plug-in capable of exporting a relevant set of said data quality rules to one of a data quality product and an external data base management system.
  - 22. The computer system from claim 19, wherein said data quality rules discovery engine also refines said candidate conditional functional dependencies iteratively if they do not meet a predetermined expectation when applied to said data set, wherein said predetermined expectation is set by a predetermined coverage estimate of a first portion of said data set that is covered by an individual one of said candidate conditional functional dependencies and by a predetermined error estimate of a second portion of said data set that will be erroneous.
  - 23. The computer system from claim 19, wherein said data quality rules discovery engine also refines said candidate conditional functional dependencies by identifying and eliminating a high entropy attribute from a subset of said plurality of attributes, said subset comprising multiple attributes and associated with a candidate conditional functional dependency, said high entropy attribute having the most different values relative to any of the other attributes in said subset of said plurality of attributes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accenture Global Services Limited (Accenture PLC)
Original Assignee
Accenture Global Services GmbH (Accenture PLC)
Inventors
Yeh, Peter Zei-Chan, Puri, Colin Anil
Primary Examiner(s)
HICKS, MICHAEL J

Application Number

US12/779,830
Publication Number

US 20110138312A1
Time in Patent Office

1,433 Days
Field of Search

707/692
US Class Current

707/692
CPC Class Codes

G06F 16/24565 Triggers; Constraints

G06F 16/2465 Query processing support fo...

Method and system for accelerated data quality enhancement

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

19 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for accelerated data quality enhancement

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links