Detecting duplicate records in database
First Claim
1. In a database having records stored on a computer readable medium, a computer implemented method for identifying possible duplicate data records comprising:
- a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;
wherein measuring co-occurrence is performed by;
a) identifying a candidate set of records having a first record field from records in the table or tables; and
b) determining a commonality between records having first record fields based on tokens in the a set of one or more child record fields related to the first record fields to identify possible duplicate records from the candidate set of records, which predicts a record v1 is a possible duplicate of another record v2 if a containment metric of tokens from a field of v1 in the record v2 is greater than or equal to a threshold value;
wherein a) a textual similarity between tokens in said first field compared in determining a token containment metric;
b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and
wherein a choice is made between a test of the token containment metric and test of the foreign key containment metric based on an information content of the tokens used to determine said token containment metric and foreign key containment metric.
2 Assignments
0 Petitions
Accused Products
Abstract
The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
-
Citations
6 Claims
-
1. In a database having records stored on a computer readable medium, a computer implemented method for identifying possible duplicate data records comprising:
-
a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;
wherein measuring co-occurrence is performed by;
a) identifying a candidate set of records having a first record field from records in the table or tables; and
b) determining a commonality between records having first record fields based on tokens in the a set of one or more child record fields related to the first record fields to identify possible duplicate records from the candidate set of records, which predicts a record v1 is a possible duplicate of another record v2 if a containment metric of tokens from a field of v1 in the record v2 is greater than or equal to a threshold value;
whereina) a textual similarity between tokens in said first field compared in determining a token containment metric;
b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and
wherein a choice is made between a test of the token containment metric and test of the foreign key containment metric based on an information content of the tokens used to determine said token containment metric and foreign key containment metric.
-
-
2. In a database having records stored on a computer readable medium, a computer implemented method for identifying possible duplicate data records comprising:
-
a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;
wherein measuring co-occurrence is performed by;
a) identifying a candidate set of records having a first record field from records in the table or tables; and
b) determining a commonality between records having first record fields based on tokens in the a set of one or more child record fields related to the first record fields to identify possible duplicate records from the candidate set of records, which predicts a record v1 is a possible duplicate of another record v2 if a containment metric of tokens from a field of v1 in the record v2 is greater than or equal to a threshold value;
whereina) a textual similarity between tokens in said first field compared in determining a token containment metric;
b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and
wherein a determination of a possible duplicate is made based on the token containment metric and the foreign key containment metric wherein tokens used to determine said token containment metric and foreign key containment metric are weighted according to their information content. - View Dependent Claims (3)
-
-
4. For use with a database having records stored on a computer readable medium, a machine readable medium including instructions for identifying possible duplicate data records comprising instructions for:
-
a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;
wherein measuring co-occurrence is performed by;
a) identifying a candidate set of records having a first record field from records in the table or tables; and
b) determining a commonality between child record fields related to the first record field based on tokens in a set of one or more child record fields to identify possible duplicate records from the candidate set of records;
wherein the instructions predict a record v1 is a possible duplicate of another record v2 if a containment metric of tokens from a field of v1 in the record v2 is greater than or equal to a threshold value;
wherein a) a textual similarity between tokens in said first field is compared in determining a token containment metric;
b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and
wherein a choice is made between a test of the token containment metric and a test of the foreign key containment metric based on an information content of the tokens use to determine said token containment metric and foreign key containment metric.
-
-
5. For use with a database having records stored on a computer readable medium, a machine readable medium including instructions for identifying possible duplicate data records comprising instructions for:
-
a) providing multiple records in one or more tables that include multiple fields; and
b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;
wherein measuring co-occurrence is performed by;
a) identifying a candidate set of records having a first record field from records in the table or tables; and
b) determining a commonality between child record fields related to the first record field based on tokens in a set of one or more child record fields to identify possible duplicate records from the candidate set of records;
wherein the instructions predict a record v1 is a possible duplicate of another record v2 if a containment metric of tokens from a field of v1 in the record v2 is greater than or equal to a threshold value;
wherein a) a textual similarity between tokens in said first field is compared in determining a token containment metric;
b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and
wherein a determination of a possible duplicate is made based on the token containment metric and the foreign key containment metric wherein tokens used to determine said token containment metric and foreign key containment metric are weighted according to their information content. - View Dependent Claims (6)
-
Specification