Batch automated blocking and record matching
First Claim
Patent Images
1. A method of identifying duplicate records in a database comprised of a plurality of records arranged in rows and columns, the method comprising:
- assigning a unique identifier to all records in the database that do not already have a unique identifier;
creating a blocking subset of between 1 and all of the columns in the database;
creating a set (S) of subsets (s), the subsets (s) consisting of the unique identifiers of records (r) from the database wherein the number of subsets (s) is less than or equal to a heuristic value wherein the heuristic value is a positive integer, m; and
each record (r) in subset (s) has the same value in at least one of the columns in the blocking subset, the creating a set (S) of subsets (s) step constructs set (S) by first constructing a set (T) which may contain both sets with more than m members and sets with fewer than m members,for every subset (s) in set (S), applying a pair-wise matching algorithm to compare every record (r) in the subset (s) to one another; and
outputting the unique identifiers of record matches identified by the pair-wise matching algorithm.
4 Assignments
0 Petitions
Accused Products
Abstract
Batch, or “offline”, blocking takes a set of records and generates sets (or blocks, hence the name blocking) of potentially matching records for the entire set. The blocks of potential matches are then passed to a matching process to evaluate which records match. Applications include but are not limited to individual matching such as student identification, householding, business matching, supply chain matching, financial matching, news or text matching, and other applications.
-
Citations
16 Claims
-
1. A method of identifying duplicate records in a database comprised of a plurality of records arranged in rows and columns, the method comprising:
-
assigning a unique identifier to all records in the database that do not already have a unique identifier; creating a blocking subset of between 1 and all of the columns in the database; creating a set (S) of subsets (s), the subsets (s) consisting of the unique identifiers of records (r) from the database wherein the number of subsets (s) is less than or equal to a heuristic value wherein the heuristic value is a positive integer, m; and
each record (r) in subset (s) has the same value in at least one of the columns in the blocking subset, the creating a set (S) of subsets (s) step constructs set (S) by first constructing a set (T) which may contain both sets with more than m members and sets with fewer than m members,for every subset (s) in set (S), applying a pair-wise matching algorithm to compare every record (r) in the subset (s) to one another; and outputting the unique identifiers of record matches identified by the pair-wise matching algorithm. - View Dependent Claims (2, 3, 4)
-
-
5. A method of identifying duplicate records between a first database and a second database, each database comprised of a plurality of records arranged in rows and columns comprising:
-
assigning a unique identifier to all records in the first database and second database that do not already have a unique identifier; creating a blocking subset of between 1 and all of the columns in the second database; mapping corresponding columns in the first database onto the blocking subset; creating a set (S) of subsets (s), the subsets (s) consisting of the unique identifiers of records (r) from the first database and/or second database wherein the number of subsets (s) is less than or equal to a heuristic value wherein the heuristic value is a positive integer, m; and
each record (r) in subset (s) has the same value in at least one of the columns in the blocking subset, the creating a set (S) of subsets (s) step constructs set (S) by first constructing a set (T) which may contain both sets with more than m members and sets with fewer than m members;for every subset (s) in set (S), applying a pair-wise matching algorithm to compare every record (r) in the subset (s) to one another, optionally omitting comparisons of records (r) which were already compared in other subsets (s); and outputting the unique identifiers of record matches identified by the pair-wise matching algorithm, optionally omitting duplicate matches. - View Dependent Claims (6, 7, 8, 9, 10)
-
-
11. A method of processing a bag of sets (Q) that produces a set of sets (R) comprising:
-
constructing a tree data structure, which has a one-to-one correspondence between the nodes on every path between the root of the tree data structure and a leaf of the tree data structure and the elements of a set in (R); and if two distinct sets t and u, both in (R), contain identical prefixes, then one path in the tree data structure, starting at the root of the tree data structure, maps to the prefix, wherein elements of a set of sets (V) in (Q) are represented by totally ordered values, where identical elements are represented by identical values, and distinct elements are represented by distinct values; and further wherein for every unique set (s) in (Q), if (Q) does not contain a superset of (s), then (R) contains (s), and, if (Q) does contain a superset of (s) then R does not contain (s), if (s) s is a superset of a set of sets (U) in the tree data structure, removing sets (U) from the tree data structure; and
adding (s) s to the tree data structure. - View Dependent Claims (12, 13, 14, 15, 16)
-
Specification