Batch automated blocking and record matching

US 7,899,796 B1
Filed: 11/23/2005
Issued: 03/01/2011
Est. Priority Date: 11/23/2004
Status: Active Grant

First Claim

Patent Images

1. A method of identifying duplicate records in a database comprised of a plurality of records arranged in rows and columns, the method comprising:

assigning a unique identifier to all records in the database that do not already have a unique identifier;

creating a blocking subset of between 1 and all of the columns in the database;

creating a set (S) of subsets (s), the subsets (s) consisting of the unique identifiers of records (r) from the database wherein the number of subsets (s) is less than or equal to a heuristic value wherein the heuristic value is a positive integer, m; and

each record (r) in subset (s) has the same value in at least one of the columns in the blocking subset, the creating a set (S) of subsets (s) step constructs set (S) by first constructing a set (T) which may contain both sets with more than m members and sets with fewer than m members,for every subset (s) in set (S), applying a pair-wise matching algorithm to compare every record (r) in the subset (s) to one another; and

outputting the unique identifiers of record matches identified by the pair-wise matching algorithm.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Batch, or “offline”, blocking takes a set of records and generates sets (or blocks, hence the name blocking) of potentially matching records for the entire set. The blocks of potential matches are then passed to a matching process to evaluate which records match. Applications include but are not limited to individual matching such as student identification, householding, business matching, supply chain matching, financial matching, news or text matching, and other applications.

Citations

16 Claims

1. A method of identifying duplicate records in a database comprised of a plurality of records arranged in rows and columns, the method comprising:
- assigning a unique identifier to all records in the database that do not already have a unique identifier;
  
  creating a blocking subset of between 1 and all of the columns in the database;
  
  creating a set (S) of subsets (s), the subsets (s) consisting of the unique identifiers of records (r) from the database wherein the number of subsets (s) is less than or equal to a heuristic value wherein the heuristic value is a positive integer, m; and
  
  each record (r) in subset (s) has the same value in at least one of the columns in the blocking subset, the creating a set (S) of subsets (s) step constructs set (S) by first constructing a set (T) which may contain both sets with more than m members and sets with fewer than m members,for every subset (s) in set (S), applying a pair-wise matching algorithm to compare every record (r) in the subset (s) to one another; and
  
  outputting the unique identifiers of record matches identified by the pair-wise matching algorithm.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1 further comprising removing any subset (s) which is a subset of another subset (s).
  - 3. The method of claim 1 wherein the applying a pair-wise matching algorithm step further comprises omitting comparisons of records (r) which were already compared in other subsets (s).
  - 4. The method of claim 1 wherein the outputting step further comprises omitting duplicate matches.

5. A method of identifying duplicate records between a first database and a second database, each database comprised of a plurality of records arranged in rows and columns comprising:
- assigning a unique identifier to all records in the first database and second database that do not already have a unique identifier;
  
  creating a blocking subset of between 1 and all of the columns in the second database;
  
  mapping corresponding columns in the first database onto the blocking subset;
  
  creating a set (S) of subsets (s), the subsets (s) consisting of the unique identifiers of records (r) from the first database and/or second database wherein the number of subsets (s) is less than or equal to a heuristic value wherein the heuristic value is a positive integer, m; and
  
  each record (r) in subset (s) has the same value in at least one of the columns in the blocking subset, the creating a set (S) of subsets (s) step constructs set (S) by first constructing a set (T) which may contain both sets with more than m members and sets with fewer than m members;
  
  for every subset (s) in set (S), applying a pair-wise matching algorithm to compare every record (r) in the subset (s) to one another, optionally omitting comparisons of records (r) which were already compared in other subsets (s); and
  
  outputting the unique identifiers of record matches identified by the pair-wise matching algorithm, optionally omitting duplicate matches.
- View Dependent Claims (6, 7, 8, 9, 10)
- - 6. The method of claim 5 where the second database is presumed to be free of duplicate records.
  - 7. The method of claim 6 where every subset (s) is required to contain at least one unique identifier from the first database.
  - 8. The method of claim 5 further comprising removing any subset (s) which is a subset of another subset (s).
  - 9. The method of claim 5 wherein the applying a pair-wise matching algorithm step further comprises omitting comparisons of records (r) which were already compared in other subsets (s).
  - 10. The method of claim 5 wherein the outputting step further comprises omitting duplicate matches.

11. A method of processing a bag of sets (Q) that produces a set of sets (R) comprising:
- constructing a tree data structure, which has a one-to-one correspondence between the nodes on every path between the root of the tree data structure and a leaf of the tree data structure and the elements of a set in (R); and
  
  if two distinct sets t and u, both in (R), contain identical prefixes, then one path in the tree data structure, starting at the root of the tree data structure, maps to the prefix,wherein elements of a set of sets (V) in (Q) are represented by totally ordered values, where identical elements are represented by identical values, and distinct elements are represented by distinct values; and
  
  further wherein for every unique set (s) in (Q), if (Q) does not contain a superset of (s), then (R) contains (s), and, if (Q) does contain a superset of (s) then R does not contain (s), if (s) s is a superset of a set of sets (U) in the tree data structure, removing sets (U) from the tree data structure; and
  
  adding (s) s to the tree data structure.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The method of claim 11 in which the construction of (R) comprises adding sets to the tree data structure in non-increasing order of the size of the sets.
  - 13. The method of claim 12 in which adding a set (s) to the tree data structure comprises adding (s) to the tree data structure if (s) is not a subset of some set in the tree data structure.
  - 14. The method of claim 11 in which the construction of (R) comprises adding sets to the tree data structure in non-decreasing order of the size of the sets.
  - 15. The method of claim 11 in which the elements of each set (V) in (Q) are sorted in increasing order.
  - 16. The method of claim 11 in which the elements of each set (V) in (Q) are sorted in decreasing order.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
Open Invention Network LLC
Inventors
Goldberg, Arthur, Borthwick, Andrew, Cheung, Put, Winkel, Adam
Primary Examiner(s)
Abel-Jalil; Neveen
Assistant Examiner(s)
Hoffler; Raheem

Application Number

US11/285,215
Time in Patent Office

1,924 Days
Field of Search

None
US Class Current

707/692
CPC Class Codes

G06F 16/24556 Aggregation; Duplicate elim...

Batch automated blocking and record matching

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Batch automated blocking and record matching

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links