Systems and methods for record linkage and paraphrase generation using surrogate learning
First Claim
Patent Images
1. A method of using a processor and a memory for classifying data associated with a feature space X to a set of classes y={0,1}, wherein features defining the feature space X are partitioned into X=X1×
- X2, a random feature vector xε
X is denoted correspondingly as x=(x1, x2), and feature x1 is a binary random variable related to a block size, the method comprising;
estimating P(x1|x2) from a set of unlabeled data;
estimating P(x1=0|x2) from a set of labeled data;
determining whether to classify a portion of the data to y=0 or y=1 based on the estimated P(x1=0|x2);
logically associating the portion of the data in the memory with the class y=0 or the class y=1 based on the determination; and
linking data based at least in part on the logically associating step.
5 Assignments
0 Petitions
Accused Products
Abstract
A method of using unlabeled data to train a classifier is disclosed. In one embodiment related to record linkage, the method entails retrieving a set of candidate data records from a master database based on a least one update record. Next, a surrogate learning technique is used to identify one of the candidate data records as a match for the one update record. Lastly, the exemplary method links or merges the update record and the identified one of the candidate data records.
-
Citations
15 Claims
-
1. A method of using a processor and a memory for classifying data associated with a feature space X to a set of classes y={0,1}, wherein features defining the feature space X are partitioned into X=X1×
- X2, a random feature vector xε
X is denoted correspondingly as x=(x1, x2), and feature x1 is a binary random variable related to a block size, the method comprising;estimating P(x1|x2) from a set of unlabeled data; estimating P(x1=0|x2) from a set of labeled data; determining whether to classify a portion of the data to y=0 or y=1 based on the estimated P(x1=0|x2); logically associating the portion of the data in the memory with the class y=0 or the class y=1 based on the determination; and linking data based at least in part on the logically associating step. - View Dependent Claims (2, 3, 4, 5)
- X2, a random feature vector xε
-
6. A system having a processor and a memory for classifying data associated with a feature space X to a set of classes y={0,1}, wherein features defining the feature space X are partitioned into X=X1×
- X2, a random feature vector xε
X is denoted correspondingly as x=(x1, x2), and feature x1 is a binary random variable related to a block size, the system further comprising;means for estimating P(x1|x2) from a set of unlabeled data; means for estimating P(x1=0|x2) from a set of labeled data; means for determining whether to classify a portion of the data to y=0 or y=1 based on the estimated P(x1=0\x2); means, responsive to the determination, for logically associating the portion of the data in the memory with the class y=0 or the class y=1; and means, responsive to the logical association, for linking data. - View Dependent Claims (7, 8, 9)
- X2, a random feature vector xε
-
10. A method of using a processor and a memory for linking or merging update records with a master database of data records, the method comprising:
-
performing a blocking operation and retrieving a set of candidate data records from the master database based on at least one update record; using surrogate learning to identify one of the candidate data records as a match for the one update record, the surrogate learning being based at least in part on a feature representing the inverse block size resulting from the blocking operation; and linking or merging the update record and the identified one of the candidate data records. - View Dependent Claims (11, 12)
-
-
13. A system having a processor and a memory for linking or merging update records with a master database of data records, the system comprising:
-
a blocking module comprising a set of code stored in the memory and executed by the processor and adapted to perform a blocking operation and retrieve a set of candidate data records from the master database based on at least one update record; a surrogate-learning-based module comprising a set of code stored in the memory and executed by the processor and adapted to identify one of the candidate data records as a match for the one update record based at least in part on a feature representing the inverse block size resulting from the blocking operation; and a linking or merging module comprising a set of code stored in the memory and executed by the processor and adapted to link or merge the update record and the identified one of the candidate data records. - View Dependent Claims (14, 15)
-
Specification