Detecting duplicate records in databases

US 20040003005A1
Filed: 06/28/2002
Published: 01/01/2004
Est. Priority Date: 06/28/2002
Status: Active Grant

First Claim

Patent Images

1. In a database having records stored on a medium, a method for identifying possible duplicate data records comprising:

a) providing multiple records in one or more tables that include multiple fields; and

b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

Citations

51 Claims

1. In a database having records stored on a medium, a method for identifying possible duplicate data records comprising:
- a) providing multiple records in one or more tables that include multiple fields; and
  
  b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1 wherein measuring co-occurrence is performed by:
    - a) identifying a candidate set of records having a first record field from records in the table or tables; and
      
      b) determining a commonality between records having first record fields based on tokens in the a set of one or more child record fields related to the first record fields to identify possible duplicate records from the candidate set of records.
  - 3. The method of claim 2 wherein determining the commonality is performed by determining a containment metric between corresponding fields of two records.
  - 4. The method of claim 3 wherein the containment metric is compared to a dynamically determined containment metric threshold, said containment metric threshold determined based upon the token content of records that are grouped together based on the contents of the first record field.
  - 5. The method of claim 2 which predicts a record v₁is a possible duplicate of another record v₂if a containment metric of tokens from a field of v₁in the record v₂is greater than or equal to a threshold value.
  - 6. The method of claim 5 wherein a textual similarity between tokens in said first record field is compared in determining a token containment metric.
  - 7. The method of claim 6 wherein a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric.
  - 8. The method of claim 5 wherein a) a textual similarity between tokens in said first field is compared in determining a token containment metric;
    - b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
      
      c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records.
  - 9. The method of claim 8 wherein the token containment metric is compared with a first threshold and the foreign key containment metric is compared with a second threshold.
  - 10. The method of claim 8 wherein a choice is made between a test of the token containment metric and a test of the foreign key containment metric based on an information content of the tokens used to determine said token containment metric and foreign key containment metric.
  - 11. The method of claim 8 wherein a determination of a possible duplicate is made based on the token containment metric and the foreign key containment metric wherein tokens used to determine said token containment metric and foreign key containment metric are weighted according to their information content.
  - 12. The method of claim 2 wherein identifying the candidate set of one or more records is performed by evaluating tokens in the first field of a multiple number of records and recording data in a token table for tokens having a frequency of greater than one.
  - 13. The method of claim 2 wherein determining commonality is performed by evaluating tokens in the child field of a multiple number of records and recording data in a children table for child records having a frequency of greater than one.
  - 14. The method of claim 11 wherein tokens of a linked child field are identified by grouping contents of hierarchically linked records that are possible duplicates based on a combination of containment metrics for tokens of the first field.
  - 15. The method of claim 2 additionally comprising maintaining a translation table of possible duplicate records which is updated as candidate records are eliminated while traversing the hierarchy of fields.
  - 16. The method of claim 15 which simulates the replacement of possible duplicate records by canonical records in the database.
  - 17. The method of claim 15 wherein the translation table is obtained using views and queries over tables in the database.

18. The method of claim I wherein there are a plurality of hierachically related fields at multiple field levels of a hierarchy and wherein identifying possible duplicate records is performed in a top down traversal of the related fields to identify possible duplicate field contents in at least two field levels in two or more records.
- View Dependent Claims (19, 20)
- - 19. The method of claim 18 wherein there is more than one table in the database and multiple tables are linked by means of a key foreign key relationship.
  - 20. The method of claim 18 which predicts a record v₁is a possible duplicate of another record v₂if a token containment metric between tokens from one or more fields of a hierarchy of fields in v₁and those for the record v₂is greater than or equal to a threshold value.

21. In a database having records stored on a medium, a method for identifying possible duplicate records comprising:
- a) evaluating a first containment metric for tokens contained in a first field of multiple records from one or more database tables to select a first set of candidate records as possible duplicate records;
  
  b) evaluating a second containment metric for tokens contained in a second field from the one or more database tables that is hierarchically connected to the first set of candidate records; and
  
  c) identifying two or more records within the one or more tables as possible duplicate records by using the first and second containment metrics to produce an output in the event a duplicate record threshold is met.
- View Dependent Claims (22, 23, 24)
- - 22. The method of claim 21 wherein multiple tables are related by relations between tables to form a hierarchy which combines to form a set of multiple attribute records.
  - 23. The method of claim 21 wherein the first and second containment metrics are weighted by a weighing factor that is based on contents of the first and second fields.
  - 24. The method of claim 21 wherein the containment metric for the second set is based upon a group of records that are related to the candidate records used in determining the first set.

25. In a database having records stored on a medium, a method for identifying possible duplicate records comprising:
- a) providing multiple tables that are related by key-foreign key relations between tables to form a hierarchy of records;
  
  b) identifying duplicate contents within at least one field within the records of a first table;
  
  c) grouping the records of a second table related to said first table based on the identification of the duplicate fields in the first table; and
  
  d) identifying duplicate record pairs from the first and second tables based on a search for duplicate contents of fields in the records of the second table that were grouped based on duplicate contents of fields from records in the first table.

26. A system for evaluating records in a database to determine a presence of duplicate records comprising:
- a) one or more computers for storing data that is organized according to a hierarchy of related fields in one or more database tables; and
  
  b) a database management system including a processor for selectively extracting records from the one or more database tables and including processor components for evaluating the contents of said records;
  
  c) said processor including a duplicate determination component that i) accesses multiple records in one or more database tables that include multiple fields; and
  
  ii) identifies two or more records within a table as duplicates by measuring a co-occurence of data in hierarchically related fields of the one or more tables.
- View Dependent Claims (27, 28, 29, 30, 31)
- - 27. The system of claim 26 wherein the duplicate determination component:
    - a) identifies a candidate set of records having a first record field from records in the table or tables;
      
      b) identifies a set of tokens from other child record fields that are hierarchically linked to the first record field in the candidate set of records; and
      
      c) determines a commonality between tokens in the child record fields to identify possible duplicate records from the candidate set of records.
  - 28. The system of claim 26 wherein the duplicate determination component predicts a record v₁is a possible duplicate of another record v₂if a containment metric of tokens from a field of v₁in the record v₂is greater than or equal to a threshold value.
  - 29. The system of claim 27 wherein the candidate set of one or more records is identified by the duplicate determination component by evaluating tokens in the first field of a multiple number of records and additionally comprising a token table for maintaining a listing of tokens having a frequency of greater than one.
  - 30. The system of claim 27 wherein determining commonality is performed by evaluating tokens in the child field of a multiple number of records and recording data in a children table for child records having a frequency of greater than one
  - 31. The system of claim 27 additionally comprising a translation table of possible duplicate records which is updated by the duplicate determination component as candidate records are eliminated as the hierarchy of fields is traversed by the duplicate detection component.

32. For use with a database having records stored on a medium, a machine readable medium including instructions for identifying possible duplicate data records comprising instructions for:
- a) providing multiple records in one or more tables that include multiple fields; and
  
  b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table.
- View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
- - 33. The medium of claim 32 wherein measuring co-occurrence is performed by:
    - a) identifying a candidate set of records having a first record field from records in the table or tables; and
      
      b) determining a commonality between child record fields related to the first record field based on tokens in a set of one or more child record fields to identify possible duplicate records from the candidate set of records.
  - 34. The medium of claim 33 wherein determining the commonality is performed by determining a containment metric between corresponding fields of two records.
  - 35. The medium of claim 34 wherein the containment metric is compared to a dynamically determined containment metric threshold, said containment metric threshold determined based upon the token content of records that are grouped together based on the contents of the first record field.
  - 36. The medium of claim 33 wherein the instructions predict a record v₁is a possible duplicate of another record v₂if a containment metric of tokens from a field of v₁in the record v₂is greater than or equal to a threshold value.
  - 37. The medium of claim 36 wherein a textual similarity between tokens in said first record field is compared in determining a token containment metric.
  - 38. The medium of claim 37 wherein a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric.
  - 39. The medium of claim 36 wherein a) a textual similarity between tokens in said first field is compared in determining a token containment metric;
    - b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and
      
      c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records.
  - 40. The medium of claim 39 wherein the token containment metric is compared with a first threshold and the foreign key containment metric is compared with a second threshold.
  - 41. The medium of claim 39 wherein a choice is made between a test of the token containment metric and a test of the foreign key containment metric based on an information content of the tokens used to determine said token containment metric and foreign key containment metric.
  - 42. The medium of claim 39 wherein a determination of a possible duplicate is made based on the token containment metric and the foreign key containment metric wherein tokens used to determine said token containment metric and foreign key containment metric are weighted according to their information content.
  - 43. The medium of claim 33 wherein identifying the candidate set of one or more records is performed by evaluating tokens in the first field of a multiple number of records and recording data in a token table for tokens having a frequency of greater than one.
  - 44. The medium of claim 33 wherein determining commonality is performed by evaluating tokens in the child field of a multiple number of records and recording data in a children table for child records having a frequency of greater than one.
  - 45. The medium of claim 42 wherein tokens of a linked child field are identified by grouping contents of hierarchically linked records that are possible duplicates based on a combination of containment metrics for tokens of the first field.
  - 46. The medium of claim 33 additionally comprising maintaining a translation table of possible duplicate records which is updated as candidate records are eliminated while traversing the hierarchy of fields.
  - 47. The medium of claim 46 which simulates the replacement of possible duplicate records by canonical records in the database.
  - 48. The medium of claim 46 wherein the translation table is obtained using views and queries over tables in the database.
  - 49. The medium of claim 32 wherein there are a plurality of hierachically related fields at multiple field levels of a hierarchy and wherein identifying possible duplicate records is performed in a top down traversal of the related fields to identify possible duplicate field contents in at least two field levels in two or more records.
  - 50. The medium of claim 49 wherein there is more than one table in the database and multiple tables are linked by means of a key foreign key relationship.
  - 51. The medium of claim 49 which predicts a record v₁is a possible duplicate of another record v₂if a token containment metric between tokens from one or more fields of a hierarchy of fields in v₁and those for the record v₂is greater than or equal to a threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chaudhuri, Surajit, Ganti, Venkatesh, Ananthakrishna, Rohit

Granted Patent

US 6,961,721 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/200
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

Y10S 707/99931   Database or file accessing

Y10S 707/99942   Manipulating data structure...

Detecting duplicate records in databases

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

51 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting duplicate records in databases

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

51 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links