Detecting duplicate records in databases
First Claim
1. In a database having records stored on a medium, a method for identifying possible duplicate data records comprising:
- a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table.
2 Assignments
0 Petitions
Accused Products
Abstract
The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
-
Citations
51 Claims
-
1. In a database having records stored on a medium, a method for identifying possible duplicate data records comprising:
-
a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
- 18. The method of claim I wherein there are a plurality of hierachically related fields at multiple field levels of a hierarchy and wherein identifying possible duplicate records is performed in a top down traversal of the related fields to identify possible duplicate field contents in at least two field levels in two or more records.
-
21. In a database having records stored on a medium, a method for identifying possible duplicate records comprising:
-
a) evaluating a first containment metric for tokens contained in a first field of multiple records from one or more database tables to select a first set of candidate records as possible duplicate records;
b) evaluating a second containment metric for tokens contained in a second field from the one or more database tables that is hierarchically connected to the first set of candidate records; and
c) identifying two or more records within the one or more tables as possible duplicate records by using the first and second containment metrics to produce an output in the event a duplicate record threshold is met. - View Dependent Claims (22, 23, 24)
-
-
25. In a database having records stored on a medium, a method for identifying possible duplicate records comprising:
-
a) providing multiple tables that are related by key-foreign key relations between tables to form a hierarchy of records;
b) identifying duplicate contents within at least one field within the records of a first table;
c) grouping the records of a second table related to said first table based on the identification of the duplicate fields in the first table; and
d) identifying duplicate record pairs from the first and second tables based on a search for duplicate contents of fields in the records of the second table that were grouped based on duplicate contents of fields from records in the first table.
-
-
26. A system for evaluating records in a database to determine a presence of duplicate records comprising:
-
a) one or more computers for storing data that is organized according to a hierarchy of related fields in one or more database tables; and
b) a database management system including a processor for selectively extracting records from the one or more database tables and including processor components for evaluating the contents of said records;
c) said processor including a duplicate determination component that i) accesses multiple records in one or more database tables that include multiple fields; and
ii) identifies two or more records within a table as duplicates by measuring a co-occurence of data in hierarchically related fields of the one or more tables. - View Dependent Claims (27, 28, 29, 30, 31)
-
-
32. For use with a database having records stored on a medium, a machine readable medium including instructions for identifying possible duplicate data records comprising instructions for:
-
a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table. - View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
-
Specification