Duplicate data elimination system

US 20040249789A1
Filed: 06/04/2003
Published: 12/09/2004
Est. Priority Date: 06/04/2003
Status: Active Grant

First Claim

Patent Images

1. A process for finding a similar data records from a set of data records comprising:

providing a number of data records from which one or more canonical data records are identified;

determining a similarity score for data records based on the contents of the records;

grouping together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records that form nodes of a graph wherein edges between nodes represent a similarity score between records of a group; and

within each said group, identifying a canonical record based on the similarity of data records to each other within the group.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.

109 Citations

21 Claims

1. A process for finding a similar data records from a set of data records comprising:
- providing a number of data records from which one or more canonical data records are identified;
  
  determining a similarity score for data records based on the contents of the records;
  
  grouping together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records that form nodes of a graph wherein edges between nodes represent a similarity score between records of a group; and
  
  within each said group, identifying a canonical record based on the similarity of data records to each other within the group.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The process of claim 1 wherein inter record scores are determined for all records in the reference table and a first group is identified as a group having a record node with a highest total similarity score.
  - 3. The process of claim 2 wherein all records within a group are removed from the graph before additional groups and canonical records are identified.
  - 4. The method of claim 1 wherein the canonical record is chosen from data records in a group by summing the similarity scores of records found to be similar to each other based on the threshold.
  - 5. The method of claim 1 wherein the set of records is from a database table and data records similar to the canonical record of a group are removed from the database table while leaving the canonical record.
  - 6. The method of claim 1 wherein the similarity score is determining by identifying tokens contained within the data records and classifying the tokens according to attribute field and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records;

7. A process for finding similar data records from a set of data records comprising:
- providing a number of data records from which one or more canonical data records are identified;
  
  identifying tokens contained within the data records and classifying the tokens according to attribute field; and
  
  preparing a graph by assigning a similarity score to data records in the reference table in relation to other records based on a similarity between tokens of said data records and assigning records to nodes of said graph and similarity scores to edges that interconnect the nodes;
  
  partitioning the graph by grouping together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records; and
  
  within a group, identifying a canonical record based on the similarity of data records to each other within the group.
- View Dependent Claims (8, 9, 10)
- - 8. The method of claim 7 wherein the step of partitioning comprises dividing the nodes of said graph into disjoint groups.
  - 9. The method of claim 8 wherein the step of partitioning is performed to providing overlapping groups and wherein canonical records are chosen based on a total score of records found to be similar to the canonical records including records belonging to more than one group.
  - 10. The method of claim 7 wherein the data records are part of a database table and additionally comprising removing from the database table records in a group that are represented by a canonical record.

11. A system for process for removing duplicate data records from a set of data records comprising:
- a database management system containing a number of data records contained in one or more tables from which one or more data records are removed;
  
  a processor for identifying tokens contained within the data records and classifying the tokens according to attribute field and wherein said processor assigns a similarity score to data records in the reference table in relation to other data records based on a similarity between tokens of said data records; and
  
  wherein said processor groups together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records that form nodes of a graph wherein edges between nodes represent a similarity score between records of a group; and
  
  then identifies a canonical record based on the similarity of data records to each other within the group.
- View Dependent Claims (12, 13, 14)
- - 12. The system of claim 11 wherein inter record scores are determined by the processor for all records in the reference table and a first group is identified as a group having a record node with a highest total similarity score.
  - 13. The system of claim 12 wherein all records within a group are removed from the graph before an addition groups and canonical records are identified.
  - 14. The system of claim 11 wherein the canonical record is chosen from data records in a group by summing the similarity scores of records found to be similar to each other based on the threshold.

15. Apparatus for finding a canonical data record for a set of two or more data records comprising:
- means for providing a reference table having a number of reference records from which canonical data records are identified;
  
  means for identifying reference table tokens contained within the reference records of the reference table and classifying the reference table tokens according to attribute field; and
  
  means for assigning a similarity score to evaluation data records in the reference table in relation to other records based on a similarity between tokens of said evaluation data records;
  
  means for grouping together evaluation records whose similarity score with respect to each other is greater than a threshold to form groups of records that form nodes of a graph wherein edges between nodes represent a similarity between records of a group wherein each said group identifying a canonical record based on the similarity of data records to each other within the group.

16. A machine readable medium including instructions for executing on a computer for finding a similar data records from a set of data records comprising instructions for:
- obtaining a number of data records from which one or more canonical data records are identified;
  
  determining a similarity score for data records based on the contents of the records;
  
  grouping together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records that form nodes of a graph wherein edges between nodes represent a similarity score between records of a group; and
  
  identifying a canonical record within each group based on the similarity of data records to each other within the group.
- View Dependent Claims (17, 18, 20, 21)
- - 17. The machine readable medium of claim 16 wherein inter record scores are determined for all records in the reference table and a first group is identified as a group containing a record node having a highest total similarity score determined from its similarity score with other record nodes.
  - 18. The machine readable medium of claim 17 wherein all records within a group are removed from the graph before an addition groups and canonical records are identified.
  - 20. The machine readable medium of claim 16 wherein the set of records is from a database table and data records similar to the canonical record of a group are removed from the database table while leaving the canonical record.
  - 21. The machine readable medium of claim 16 wherein the similarity score is determined by identifying tokens contained within the data records and classifying the tokens according to attribute field and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records;

19. The machine readable of medium 16 wherein the canonical record is chosen from data records in a group by summing the similarity scores of records found to be similar to each other based on the threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chaudhuri, Surajit, Ganti, Venkatesh, Kapoor, Rahul

Granted Patent

US 7,287,019 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/2
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06F 2216/03   Data mining

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Duplicate data elimination system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

109 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Duplicate data elimination system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

109 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links