×

Duplicate data elimination system

  • US 7,287,019 B2
  • Filed: 06/04/2003
  • Issued: 10/23/2007
  • Est. Priority Date: 06/04/2003
  • Status: Active Grant
First Claim
Patent Images

1. A computer implemented method for removing similar data records from a set of data records comprising:

  • providing a number of data records from the set of data records, from which one or more canonical data records are to be identified;

    determining similarity scores for the number of data records based on the contents of the data records;

    grouping together data records whose similarity score with respect to each other is greater than a threshold, wherein one or more groups of data records form nodes of a graph, and further wherein edges between nodes represent a similarity score between data records of a group; and

    within each said group of data records, identifying a canonical record based on the similarity of data records to each other within each group,wherein the set of data records is from a database table and within each group of data records, data records similar to the canonical record of the group are removed from the database table while leaving the canonical record stored in the database table,wherein the similarity scores are determined by identifying tokens contained within data records and distinguishing the tokens according to respective attribute fields associated with the tokens and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records and said other data records, wherein a same token in two different data records is evaluated differently when assigning similarity score if the same token in the two different data records is associated with two different attribute fields respectively.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×