Dynamic record blocking

US 8,645,399 B2
Filed: 01/12/2012
Issued: 02/04/2014
Est. Priority Date: 01/03/2012
Status: Active Grant

First Claim

Patent Images

1. A method of using a plurality of distributed computing nodes each comprising a processor and associated storage to dynamically block records, comprising:

(a) using the distributed computing nodes comprising a processor and associated storage, grouping together records of a data set with a first set of shared properties into blocks; and

(b) using the distributed computing nodes comprising a processor and associated storage, for a block that is intractably large and therefore requires further grouping records into sub-blocks, automatically discovering based on the contents of the intractably large block, at least one second set of shared properties that enables creation of sub-blocks of tractable size, wherein the second set is different from the first set.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Dynamic blocking determines which pairs of records in a data set should be examined as potential duplicates. Records are grouped together into blocks by shared properties that are indicators of duplication. Blocks that are too large to be efficiently processed are further subdivided by other properties chosen in a data-driven way. We demonstrate the viability of this algorithm for large data sets. We have scaled this system up to work on billions of records on an 80 node Hadoop cluster.

Citations

22 Claims

1. A method of using a plurality of distributed computing nodes each comprising a processor and associated storage to dynamically block records, comprising:
- (a) using the distributed computing nodes comprising a processor and associated storage, grouping together records of a data set with a first set of shared properties into blocks; and
  
  (b) using the distributed computing nodes comprising a processor and associated storage, for a block that is intractably large and therefore requires further grouping records into sub-blocks, automatically discovering based on the contents of the intractably large block, at least one second set of shared properties that enables creation of sub-blocks of tractable size, wherein the second set is different from the first set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1 wherein automatically discovering comprises automatically analyzing the set of properties present in records in an oversized block to dynamically guide subdivision of that oversized block.
  - 3. The method of claim 1 further including processing the discovered sets of shared record properties in parallel using the plurality of distributed computing nodes to recursively or iteratively subdivide or partition, based on similar properties, at least one intractably large block into blocks of tractable size.
  - 4. The method of claim 1 further including dynamically adjusting the discovered second set of shared properties in response to the composition of the data set and block size.
  - 5. The method of claim 1 wherein the data set comprises a massive database of personal information from diverse data sources for an online people search.
  - 6. The method of claim 1 wherein discovering applies a ramp parameter by which the maximum number of comparisons in the block of intractable size is increased with each recursion or iteration to provide a data-driven way to trade off between sub-blocking and linkage.
  - 7. The method of claim 1 further including allowing sets of records to overlap.
  - 8. The method of claim 1 further including applying the method to data sets when there is no obvious quickly-calculable metric between records and the number of records makes even a fast calculation for all pairs intractable.
  - 9. The method of claim 1 further including creating multiple top-level blocks that can be worked on independently and in parallel.
  - 10. The method of claim 1 further including dynamically adjusting the creation of sub-blocks based on several record property dimensions along which the records may vary, thereby avoiding the need to define a single ordering that places similar records next to each other.
  - 11. The method of claim 1 further including allowing maximum block size to be a function of block key length.

12. A system for dynamically blocking records, comprising:
- a plurality of distributed computing nodes each comprising a processor and associated non-transitory memory, the nodes each storing in non-transitory storage codes that when executed in parallel;
  
  groups together records of a data set with a first set of similar properties into blocks; and
  
  for a block that is intractably large and therefore requires further grouping records into sub-blocks, automatically discovers based on the contents of the intractably large block, at least one second set of shared properties that enables creation of sub-blocks of tractable size, wherein the second set is different from the first set.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The system of claim 12 wherein the code automatically analyzes the set of properties present in records in an oversized block to dynamically guide subdivision into sub-blocks.
  - 14. The system of claim 12 wherein the processors process the discovered sets of shared record properties in parallel to recursively or iteratively subdivide or partition, based on similar properties, at least one intractably large block into blocks of tractable size.
  - 15. The system of claim 12 wherein the processors dynamically adjust the discovered second set of shared properties in response to the composition of the data set and block size.
  - 16. The system of claim 12 wherein the data set comprises a massive database of personal information from diverse data sources for an online people search.
  - 17. The system of claim 12 wherein the processors apply a ramp parameter by which the maximum number of comparisons in the block of intractable size is increased with each recursion or iteration to provide a data-driven way to trade off between sub-blocking and linkage.
  - 18. The system of claim 12 wherein the processors allow sets of records to overlap.
  - 19. The system of claim 12 wherein the processors apply reduction to data sets when there is no obvious quickly-calculable metric between records and the number of records makes even a fast calculation for all pairs intractable.
  - 20. The system of claim 12 wherein the processors create multiple top-level blocks that can be worked on independently and in parallel.
  - 21. The system of claim 12 wherein the processors dynamically adjust the creation of sub-blocks based on several record property dimensions along which the records may vary, thereby avoiding the need to define a single ordering that places similar records next to each other.
  - 22. The system of claim 12 further including allowing maximum block size to be a function of block key length.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Peopleconnect Incorporated
Original Assignee
Intelius Incorporated
Inventors
McNeill, William P., Borthwick, Andrew
Primary Examiner(s)
Coby, Frantz

Application Number

US13/349,414
Publication Number

US 20130173560A1
Time in Patent Office

754 Days
Field of Search

707/738, 707/737, 707/692, 707/698, 707/812, 717/159, 711/162, 711/147, 382/305, 375/240.24, 375/240, 375/24
US Class Current

707/752
CPC Class Codes

G06F 16/215 Improving data quality; Dat...

Dynamic record blocking

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Dynamic record blocking

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links