Probabilistic record linkage model derived from training data
First Claim
1. A computer-assisted process for determining linkages between data records comprising:
- constructing a predictive model based at least in part on a product divided by a sum of products;
training said predictive model with record pair linkage data, including the step of applying at least one machine learning method on a corpus of record pairs presented so as to indicate decisions made by at least one human decision maker as to whether said record pairs should be linked; and
using said trained predictive model to automatically identify records that have a predetermined type of similarity to other data.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of training a system from examples achieves high accuracy by finding the optimal weighting of different clues indicating whether two data items such as database records should be matched or linked. The trained system provides three possible outputs when presented with two data items: yes, no or I don'"'"'t know (human intervention required). A maximum entropy model can be used to determine whether the two records should be linked or matched. Using the trained maximum entropy model, a high probability indicates that the pair should be linked, a low probability indicates that the pair should not be linked, and intermediate probabilities are generally held for human review.
-
Citations
19 Claims
-
1. A computer-assisted process for determining linkages between data records comprising:
-
constructing a predictive model based at least in part on a product divided by a sum of products;
training said predictive model with record pair linkage data, including the step of applying at least one machine learning method on a corpus of record pairs presented so as to indicate decisions made by at least one human decision maker as to whether said record pairs should be linked; and
using said trained predictive model to automatically identify records that have a predetermined type of similarity to other data. - View Dependent Claims (2, 8, 9, 10, 19)
said predictive model comprises a minimum divergence model.
-
-
9. A method as in claim 8 wherein said minimum divergence model comprises a maximum entropy model.
-
10. A method as in claim 8 wherein said training step includes calculating a probability L/(L+N) where L is the product of the weights of all features indicating that first and second data items bear a predetermined relationship, and N is the product of the weights of all features indicating that said first and second data items do not bear said predetermined relationship.
-
19. The process of claim 1 further including determining a set of weights each corresponding to features empirically selected to indicate either that a pair of data items bear said predetermined relationship or that said plural data items do not bear said predetermined relationship, said features and said set of weights providing a
-
3. A computer-assisted process for linking records in at least one database including:
-
assigning weights to plural different factors predicting a link or non-link decision, using said assigned weights to calculate a probability=L/(L+N) where L=product of the weights of all features indicating link, and N=product of the weights of all features indicating no-link; and
using said calculated probability to generate a predictive model; and
applying said predictive model to automatically identify records within said at least one database that bear a predetermined relationship to one another. - View Dependent Claims (4, 5, 6, 7)
-
-
11. Apparatus for training a computer-based predictive model based at least in part on a product divided by a sum of products for determining whether at least two data items have a predetermined relationship, said apparatus comprising:
-
an input device that accepts a training corpus comprising plural pairs of data items and an indication as to whether each of said plural pairs bears a predetermined relationship;
a feature filter that accepts a pool of possible features, and outputs, in response to said training corpus, a filtered feature pool comprising a subset of said pool; and
a maximum entropy parameter estimator responsive to said training corpus, said estimator developing weights for each of said features within said filtered feature pool for use with said computer-based predictive model. - View Dependent Claims (12, 13, 14, 15)
-
-
16. Apparatus for determining whether pairs of data items bear a predetermined relationship, said apparatus comprising:
-
an input system that accepts pairs of data items; and
a discriminator that determines whether each pair of data items bears a predetermined relationship, said discriminator including a trained computer-based minimum divergence model based at least in part on a product divided by a sum of products, wherein said discriminator computes the probability that said pair of data items bears said predetermined relationship. - View Dependent Claims (17, 18)
-
Specification