DYNAMIC RECORD BLOCKING
First Claim
Patent Images
1. A method of using a plurality of distributed computing nodes each comprising a processor and associated storage to dynamically block, comprising:
- (a) grouping together records of a data set with similar properties into blocks; and
(b) for a block that is intractably large and therefore requires further blocking, automatically discovering based on the contents of the intractably large block, sets of shared properties that enable creation of sub-blocks of tractable size.
9 Assignments
0 Petitions
Accused Products
Abstract
Dynamic blocking determines which pairs of records in a data set should be examined as potential duplicates. Records are grouped together into blocks by shared properties that are indicators of duplication. Blocks that are too large to be efficiently processed are further subdivided by other properties chosen in a data-driven way. We demonstrate the viability of this algorithm for large data sets. We have scaled this system up to work on billions of records on an 80 node Hadoop cluster.
-
Citations
22 Claims
-
1. A method of using a plurality of distributed computing nodes each comprising a processor and associated storage to dynamically block, comprising:
-
(a) grouping together records of a data set with similar properties into blocks; and (b) for a block that is intractably large and therefore requires further blocking, automatically discovering based on the contents of the intractably large block, sets of shared properties that enable creation of sub-blocks of tractable size. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for dynamically blocking comprising:
a plurality of distributed computing nodes each comprising a processor and associated non-transitory memory, the nodes each storing in non-transitory storage codes that when executed in parallel; groups together records of a data set with similar properties into blocks; and for a block that is intractably large and therefore requires further blocking, automatically discovers based on the contents of the intractably large block, sets of shared properties that enable creation of sub-blocks of tractable size. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
Specification