Detecting duplicate records in databases
First Claim
1. In a database having records stored on a medium, a method for identifying possible duplicate data records comprising:
- a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table.
1 Assignment
0 Petitions
Accused Products
Abstract
The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
-
Citations
1 Claim
-
1. In a database having records stored on a medium, a method for identifying possible duplicate data records comprising:
-
a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table.
-
Specification