Detecting duplicate records in database

US 6,961,721 B2
Filed: 06/28/2002
Issued: 11/01/2005
Est. Priority Date: 06/28/2002
Status: Active Grant

First Claim

Patent Images

1. In a database having records stored on a computer readable medium, a computer implemented method for identifying possible duplicate data records comprising:

a) providing multiple records in one or more tables that include multiple fields; and

b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;

wherein measuring co-occurrence is performed by;

a) identifying a candidate set of records having a first record field from records in the table or tables; and

b) determining a commonality between records having first record fields based on tokens in the a set of one or more child record fields related to the first record fields to identify possible duplicate records from the candidate set of records, which predicts a record v₁is a possible duplicate of another record v₂if a containment metric of tokens from a field of v₁in the record v₂is greater than or equal to a threshold value;

wherein a) a textual similarity between tokens in said first field compared in determining a token containment metric;

b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and

c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and

wherein a choice is made between a test of the token containment metric and test of the foreign key containment metric based on an information content of the tokens used to determine said token containment metric and foreign key containment metric.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

Citations

6 Claims

1. In a database having records stored on a computer readable medium, a computer implemented method for identifying possible duplicate data records comprising:
- a) providing multiple records in one or more tables that include multiple fields; and
  
  b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;
  
  wherein measuring co-occurrence is performed by;
  
  a) identifying a candidate set of records having a first record field from records in the table or tables; and
  
  b) determining a commonality between records having first record fields based on tokens in the a set of one or more child record fields related to the first record fields to identify possible duplicate records from the candidate set of records, which predicts a record v₁is a possible duplicate of another record v₂if a containment metric of tokens from a field of v₁in the record v₂is greater than or equal to a threshold value;
  
  wherein a) a textual similarity between tokens in said first field compared in determining a token containment metric;
  
  b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
  
  c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and
  
  wherein a choice is made between a test of the token containment metric and test of the foreign key containment metric based on an information content of the tokens used to determine said token containment metric and foreign key containment metric.

2. In a database having records stored on a computer readable medium, a computer implemented method for identifying possible duplicate data records comprising:
- a) providing multiple records in one or more tables that include multiple fields; and
  
  b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;
  
  wherein measuring co-occurrence is performed by;
  
  a) identifying a candidate set of records having a first record field from records in the table or tables; and
  
  b) determining a commonality between records having first record fields based on tokens in the a set of one or more child record fields related to the first record fields to identify possible duplicate records from the candidate set of records, which predicts a record v₁is a possible duplicate of another record v₂if a containment metric of tokens from a field of v₁in the record v₂is greater than or equal to a threshold value;
  
  wherein a) a textual similarity between tokens in said first field compared in determining a token containment metric;
  
  b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
  
  c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and
  
  wherein a determination of a possible duplicate is made based on the token containment metric and the foreign key containment metric wherein tokens used to determine said token containment metric and foreign key containment metric are weighted according to their information content.
- View Dependent Claims (3)
- - 3. The method of claim 2 wherein tokens of linked child field are identified by grouping contents of hierarchically linked records that are possible duplicates based on a combination of containment metrics for tokens of the first field.

4. For use with a database having records stored on a computer readable medium, a machine readable medium including instructions for identifying possible duplicate data records comprising instructions for:
- a) providing multiple records in one or more tables that include multiple fields; and
  
  b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;
  
  wherein measuring co-occurrence is performed by;
  
  a) identifying a candidate set of records having a first record field from records in the table or tables; and
  
  b) determining a commonality between child record fields related to the first record field based on tokens in a set of one or more child record fields to identify possible duplicate records from the candidate set of records;
  
  wherein the instructions predict a record v₁is a possible duplicate of another record v₂if a containment metric of tokens from a field of v₁in the record v₂is greater than or equal to a threshold value;
  
  wherein a) a textual similarity between tokens in said first field is compared in determining a token containment metric;
  
  b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
  
  c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and
  
  wherein a choice is made between a test of the token containment metric and a test of the foreign key containment metric based on an information content of the tokens use to determine said token containment metric and foreign key containment metric.

5. For use with a database having records stored on a computer readable medium, a machine readable medium including instructions for identifying possible duplicate data records comprising instructions for:
- a) providing multiple records in one or more tables that include multiple fields; and
  
  b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;
  
  wherein measuring co-occurrence is performed by;
  
  a) identifying a candidate set of records having a first record field from records in the table or tables; and
  
  b) determining a commonality between child record fields related to the first record field based on tokens in a set of one or more child record fields to identify possible duplicate records from the candidate set of records;
  
  wherein the instructions predict a record v₁is a possible duplicate of another record v₂if a containment metric of tokens from a field of v₁in the record v₂is greater than or equal to a threshold value;
  
  wherein a) a textual similarity between tokens in said first field is compared in determining a token containment metric;
  
  b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
  
  c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and
  
  wherein a determination of a possible duplicate is made based on the token containment metric and the foreign key containment metric wherein tokens used to determine said token containment metric and foreign key containment metric are weighted according to their information content.
- View Dependent Claims (6)
- - 6. The medium of claim 5 wherein tokens of linked child field are identified by grouping contents of hierarchically linked records that are possible duplicates based on a combination of containment metrics for tokens of the first field.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chaudhuri, Surajit, Ganti, Venkatesh, Ananthakrishna, Rohit
Primary Examiner(s)
Kindred, Alford
Assistant Examiner(s)
DANG, THANH HA T

Application Number

US10/186,031
Publication Number

US 20040003005A1
Time in Patent Office

1,222 Days
Field of Search

707/101, 707/1
US Class Current

1/1
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

Y10S 707/99931   Database or file accessing

Y10S 707/99942   Manipulating data structure...

Detecting duplicate records in database

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting duplicate records in database

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links