Duplicate data elimination system

US 7,287,019 B2
Filed: 06/04/2003
Issued: 10/23/2007
Est. Priority Date: 06/04/2003
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for removing similar data records from a set of data records comprising:

providing a number of data records from the set of data records, from which one or more canonical data records are to be identified;

determining similarity scores for the number of data records based on the contents of the data records;

grouping together data records whose similarity score with respect to each other is greater than a threshold, wherein one or more groups of data records form nodes of a graph, and further wherein edges between nodes represent a similarity score between data records of a group; and

within each said group of data records, identifying a canonical record based on the similarity of data records to each other within each group,wherein the set of data records is from a database table and within each group of data records, data records similar to the canonical record of the group are removed from the database table while leaving the canonical record stored in the database table,wherein the similarity scores are determined by identifying tokens contained within data records and distinguishing the tokens according to respective attribute fields associated with the tokens and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records and said other data records, wherein a same token in two different data records is evaluated differently when assigning similarity score if the same token in the two different data records is associated with two different attribute fields respectively.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.

67 Citations

View as Search Results

12 Claims

1. A computer implemented method for removing similar data records from a set of data records comprising:
- providing a number of data records from the set of data records, from which one or more canonical data records are to be identified;
  
  determining similarity scores for the number of data records based on the contents of the data records;
  
  grouping together data records whose similarity score with respect to each other is greater than a threshold, wherein one or more groups of data records form nodes of a graph, and further wherein edges between nodes represent a similarity score between data records of a group; and
  
  within each said group of data records, identifying a canonical record based on the similarity of data records to each other within each group,wherein the set of data records is from a database table and within each group of data records, data records similar to the canonical record of the group are removed from the database table while leaving the canonical record stored in the database table,wherein the similarity scores are determined by identifying tokens contained within data records and distinguishing the tokens according to respective attribute fields associated with the tokens and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records and said other data records, wherein a same token in two different data records is evaluated differently when assigning similarity score if the same token in the two different data records is associated with two different attribute fields respectively.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1 wherein inter record similarity scores are determined for all data records in a reference table and a first group is identified as a group having a record node with a highest total similarity score.
  - 3. The method of claim 2 wherein all data records within a group are removed from the graph before additional groups and canonical records are identified.
  - 4. The method of claim 1 wherein the canonical record is chosen from data records in each group by summing the similarity scores of data records found to be similar to each other based on the threshold.

5. A system for removing duplicate data records from a set of data records comprising:
- a processor;
  
  an application program executed by the processor for;
  
  providing a number of data records from the set of data records, from which one or more canonical data records are to be identified;
  
  determining similarity scores for the number of data records based on the contents of the data records;
  
  grouping together data records whose similarity score with respect to each other is greater than a threshold, wherein one or more groups of data records form nodes of a graph, and further wherein edges between nodes represent a similarity score between data records of a group; and
  
  within each said group of data records, identifying a canonical record based on the similarity of data records to each other within each group,wherein the set of data records is from a database table and within each group of data records, data records similar to the canonical record of the group are removed from the database table while leaving the canonical record stored in the database table,wherein the similarity scores are determined by identifying tokens contained within data records and distinguishing the tokens according to respective attribute fields associated with the tokens and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records and said other data records, wherein a same token in two different data records is evaluated differently when assigning similarity score if the same token in the two different data records is associated with two different attribute fields respectively.
- View Dependent Claims (6, 7, 8)
- - 6. The system of claim 5 wherein inter record scores are determined for all records in a reference table and a first group is identified as a group having a record node with a highest total similarity score.
  - 7. The system of claim 6 wherein all records within a group are removed from the graph before additional groups and canonical records are identified.
  - 8. The system of claim 5 wherein the canonical record is chosen from data records in a group by summing the similarity scores of records found to be similar to each other based on the threshold.

9. A machine readable medium including instructions for executing on a computer for finding and removing similar data records from a set of data records comprising instructions for:
- providing a number of data records from the set of data records, from which one or more canonical data records are to be identified;
  
  determining similarity scores for the number of data records based on the contents of the data records;
  
  grouping together data records whose similarity score with respect to each other is greater than a threshold, wherein one or more groups of data records form nodes of a graph, and further wherein edges between nodes represent a similarity score between data records of a group; and
  
  within each said group of data records, identifying a canonical record based on the similarity of data records to each other within each group,wherein the set of data records is from a database table and within each group of data records, data records similar to the canonical record of the group are removed from the database table while leaving the canonical record stored in the database table,wherein the similarity scores are determined by identifying tokens contained within data records and distinguishing the tokens according to respective attribute fields associated with the tokens and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records and said other data records, wherein a same token in two different data records is evaluated differently when assigning similarity score if the same token in the two different data records is associated with two different attribute fields respectively.
- View Dependent Claims (10, 11)
- - 10. The machine readable medium of claim 9 wherein inter record scores are determined for all records in a reference table and a first group is identified as a group containing a record node having a highest total similarity score determined from its similarity score with other record nodes.
  - 11. The machine readable medium of claim 10 wherein all records within a group are removed from the graph before any additional groups and canonical records are identified.

12. The machine readable of medium 9 wherein the canonical record is chosen from data records in a group by summing the similarity scores of records found to be similar to each other based on the threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chaudhuri, Surajit, Ganti, Venkatesh, Kapoor, Rahul
Primary Examiner(s)
Pham; Hung Q

Application Number

US10/453,992
Publication Number

US 20040249789A1
Time in Patent Office

1,602 Days
Field of Search

707/2, 707/6, 707/7, 707/10, 707/3, 707/4, 707/5, 707/200
US Class Current

1/1
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06F 2216/03   Data mining

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Duplicate data elimination system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

67 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Duplicate data elimination system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

67 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links