Dynamic record blocking
First Claim
Patent Images
1. A method of using a plurality of distributed computing nodes each comprising a processor and associated storage to dynamically block records, comprising:
- (a) using the distributed computing nodes comprising a processor and associated storage, grouping together records of a data set with a first set of shared properties into blocks; and
(b) using the distributed computing nodes comprising a processor and associated storage, for a block that is intractably large and therefore requires further grouping records into sub-blocks, automatically discovering based on the contents of the intractably large block, at least one second set of shared properties that enables creation of sub-blocks of tractable size, wherein the second set is different from the first set.
9 Assignments
0 Petitions
Accused Products
Abstract
Dynamic blocking determines which pairs of records in a data set should be examined as potential duplicates. Records are grouped together into blocks by shared properties that are indicators of duplication. Blocks that are too large to be efficiently processed are further subdivided by other properties chosen in a data-driven way. We demonstrate the viability of this algorithm for large data sets. We have scaled this system up to work on billions of records on an 80 node Hadoop cluster.
-
Citations
22 Claims
-
1. A method of using a plurality of distributed computing nodes each comprising a processor and associated storage to dynamically block records, comprising:
-
(a) using the distributed computing nodes comprising a processor and associated storage, grouping together records of a data set with a first set of shared properties into blocks; and (b) using the distributed computing nodes comprising a processor and associated storage, for a block that is intractably large and therefore requires further grouping records into sub-blocks, automatically discovering based on the contents of the intractably large block, at least one second set of shared properties that enables creation of sub-blocks of tractable size, wherein the second set is different from the first set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for dynamically blocking records, comprising:
a plurality of distributed computing nodes each comprising a processor and associated non-transitory memory, the nodes each storing in non-transitory storage codes that when executed in parallel; groups together records of a data set with a first set of similar properties into blocks; and for a block that is intractably large and therefore requires further grouping records into sub-blocks, automatically discovers based on the contents of the intractably large block, at least one second set of shared properties that enables creation of sub-blocks of tractable size, wherein the second set is different from the first set. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
Specification