×

Batch automated blocking and record matching

  • US 7,899,796 B1
  • Filed: 11/23/2005
  • Issued: 03/01/2011
  • Est. Priority Date: 11/23/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method of identifying duplicate records in a database comprised of a plurality of records arranged in rows and columns, the method comprising:

  • assigning a unique identifier to all records in the database that do not already have a unique identifier;

    creating a blocking subset of between 1 and all of the columns in the database;

    creating a set (S) of subsets (s), the subsets (s) consisting of the unique identifiers of records (r) from the database wherein the number of subsets (s) is less than or equal to a heuristic value wherein the heuristic value is a positive integer, m; and

    each record (r) in subset (s) has the same value in at least one of the columns in the blocking subset, the creating a set (S) of subsets (s) step constructs set (S) by first constructing a set (T) which may contain both sets with more than m members and sets with fewer than m members,for every subset (s) in set (S), applying a pair-wise matching algorithm to compare every record (r) in the subset (s) to one another; and

    outputting the unique identifiers of record matches identified by the pair-wise matching algorithm.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×