RECORD LINKAGE BASED ON A TRAINED BLOCKING SCHEME
First Claim
1. A method under control of one or more processors configured with executable instructions, the method comprising:
- selecting a set of unlabeled data;
selecting a set of labeled data;
learning one or more conjunctions;
identifying matches in the labeled data and the unlabeled data that are uncovered by each of the one or more conjunctions;
identifying the matches in the labeled data and the unlabeled data that are covered by each of the one or more conjunctions; and
combining the one or more conjunctions to create a blocking scheme.
2 Assignments
0 Petitions
Accused Products
Abstract
Some implementations disclosed herein provide techniques and arrangements to train a blocking scheme using both labeled data and unlabeled data. For example, training the blocking scheme may include iteratively: learning a conjunction, identifying first matches in the labeled data and the unlabeled data that are uncovered by the conjunction, and identifying second matches in the labeled data and the unlabeled data that are covered by the conjunction. The conjunction learned in each iteration may be combined using a disjunction. A search engine may use the search engine when searching for records that match an entity.
31 Citations
20 Claims
-
1. A method under control of one or more processors configured with executable instructions, the method comprising:
-
selecting a set of unlabeled data; selecting a set of labeled data; learning one or more conjunctions; identifying matches in the labeled data and the unlabeled data that are uncovered by each of the one or more conjunctions; identifying the matches in the labeled data and the unlabeled data that are covered by each of the one or more conjunctions; and combining the one or more conjunctions to create a blocking scheme. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. Computer-readable media including instructions executable by one or more processors to perform operations comprising:
-
identifying multiple sets of records; grouping, based on a blocking scheme, records selected from the multiple sets of records to create blocks of records, the blocking scheme trained using training data that comprises at least a sample of unlabeled data; comparing the records within each block of the blocks of records; and identifying matching records that refer to a same entity based on the comparing to create linked records. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A computing device comprising:
-
one or more processors; computer-readable media including instructions executable by the one or more processors to perform acts comprising; learning a blocking scheme from a set of labeled data and unlabeled data, the learning comprising repeatedly performing the following acts until particular criteria are satisfied; learning a conjunction; identifying first matches in the labeled data and the unlabeled data that are uncovered by the conjunction; identifying second matches in the labeled data and the unlabeled data that are covered by the conjunction; and combining the conjunction learned in each iteration.
-
-
17. The computing device of claim 16, wherein:
-
a first conjunction is learned in a first iteration and a second conjunction is learned in the second iteration that occurs after the first iteration; and the first conjunction includes one or more predicates. - View Dependent Claims (18, 19)
-
-
20. The computing device of claim 16, further comprising performing a search on multiple sets of records to identify records that match an entity based on the blocking scheme.
Specification