Probabilistic record linkage model derived from training data

US 6,523,019 B1
Filed: 10/28/1999
Issued: 02/18/2003
Est. Priority Date: 09/21/1999
Status: Expired due to Term

First Claim

Patent Images

1. A computer-assisted process for determining linkages between data records comprising:

constructing a predictive model based at least in part on a product divided by a sum of products;

training said predictive model with record pair linkage data, including the step of applying at least one machine learning method on a corpus of record pairs presented so as to indicate decisions made by at least one human decision maker as to whether said record pairs should be linked; and

using said trained predictive model to automatically identify records that have a predetermined type of similarity to other data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of training a system from examples achieves high accuracy by finding the optimal weighting of different clues indicating whether two data items such as database records should be matched or linked. The trained system provides three possible outputs when presented with two data items: yes, no or I don'"'"'t know (human intervention required). A maximum entropy model can be used to determine whether the two records should be linked or matched. Using the trained maximum entropy model, a high probability indicates that the pair should be linked, a low probability indicates that the pair should not be linked, and intermediate probabilities are generally held for human review.

Citations

19 Claims

1. A computer-assisted process for determining linkages between data records comprising:
- constructing a predictive model based at least in part on a product divided by a sum of products;
  
  training said predictive model with record pair linkage data, including the step of applying at least one machine learning method on a corpus of record pairs presented so as to indicate decisions made by at least one human decision maker as to whether said record pairs should be linked; and
  
  using said trained predictive model to automatically identify records that have a predetermined type of similarity to other data.
- View Dependent Claims (2, 8, 9, 10, 19)
- - 2. A process as in claim 1 wherein said predictive model comprises a maximum entropy model.
  - 8. The process of claim 1 wherein:
9. A method as in claim 8 wherein said minimum divergence model comprises a maximum entropy model.
10. A method as in claim 8 wherein said training step includes calculating a probability L/(L+N) where L is the product of the weights of all features indicating that first and second data items bear a predetermined relationship, and N is the product of the weights of all features indicating that said first and second data items do not bear said predetermined relationship.
19. The process of claim 1 further including determining a set of weights each corresponding to features empirically selected to indicate either that a pair of data items bear said predetermined relationship or that said plural data items do not bear said predetermined relationship, said features and said set of weights providing a

3. A computer-assisted process for linking records in at least one database including:
- assigning weights to plural different factors predicting a link or non-link decision, using said assigned weights to calculate a probability=L/(L+N) where L=product of the weights of all features indicating link, and N=product of the weights of all features indicating no-link; and
  
  using said calculated probability to generate a predictive model; and
  
  applying said predictive model to automatically identify records within said at least one database that bear a predetermined relationship to one another.
- View Dependent Claims (4, 5, 6, 7)
- - 4. The process of claim 3 further including constructing said predictive model using the maximum entropy modeling technique.
  - 5. The process of claim 4 further including executing said maximum entropy modeling technique on a corpus of record pairs which have been marked by at least one person with a decision as to that person'"'"'s degree of certainty that the record pair should be linked.
  - 6. The process of claim 3 further including creating a predictive model based on said calculated probability, including constructing said predictive model using a machine learning technique.
  - 7. The process of claim 6 further including executing said machine learning technique on a corpus of record pairs which have been marked by at least one person with a decision as to that person'"'"'s degree of certainty that each record pair should be linked.

11. Apparatus for training a computer-based predictive model based at least in part on a product divided by a sum of products for determining whether at least two data items have a predetermined relationship, said apparatus comprising:
- an input device that accepts a training corpus comprising plural pairs of data items and an indication as to whether each of said plural pairs bears a predetermined relationship;
  
  a feature filter that accepts a pool of possible features, and outputs, in response to said training corpus, a filtered feature pool comprising a subset of said pool; and
  
  a maximum entropy parameter estimator responsive to said training corpus, said estimator developing weights for each of said features within said filtered feature pool for use with said computer-based predictive model.
- View Dependent Claims (12, 13, 14, 15)
- - 12. Apparatus as in claim 11 wherein said feature filter discards features not useful in discriminating between plural pairs of data items that bear a predetermined relationship and plural pairs of data items that may not bear a predetermined relationship.
  - 13. Apparatus as in claim 11 wherein said feature filter discards features not useful in discriminating between plural pairs of data items that do not bear a predetermined relationship and plural pairs of data items that may bear a predetermined relationship.
  - 14. Apparatus as in claim 11 wherein said estimator constructs a model which calculates a linkage probability based on features within the filtered feature pool that indicate an absence of linkage and features within the filtered feature pool that indicate linkage.
  - 15. Apparatus as in claim 11 wherein said estimator outputs a real-number parameter for each feature in the filtered feature pool, said real-number parameter indicating a weight.

16. Apparatus for determining whether pairs of data items bear a predetermined relationship, said apparatus comprising:
- an input system that accepts pairs of data items; and
  
  a discriminator that determines whether each pair of data items bears a predetermined relationship, said discriminator including a trained computer-based minimum divergence model based at least in part on a product divided by a sum of products, wherein said discriminator computes the probability that said pair of data items bears said predetermined relationship.
- View Dependent Claims (17, 18)
- - 17. Apparatus as in claim 16 wherein said computer-based minimum divergence model comprises a trained maximum entropy model.
  - 18. Apparatus as in claim 16 wherein said discriminator calculates the probability of linkage as L/(N+L) where L is the sum of weighted features indicating that said data items bear said predetermined relationship, and N is the product of weighted features indicating said plural data items do not bear said predetermined relationship.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Open Invention Network LLC
Original Assignee
Choicemaker Technologies, Inc.
Inventors
Borthwick, Andrew E.
Primary Examiner(s)
Follansbee, John A.
Assistant Examiner(s)
Hirl, Joseph P.

Application Number

US09/429,514
Time in Patent Office

1,209 Days
Field of Search

706/45, 706/46, 706/15, 707/2
US Class Current

706/45
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06N 20/00   Machine learning

G06N 7/01   Probabilistic graphical mod...

Y10S 707/99932   Access augmentation or opti...

Probabilistic record linkage model derived from training data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Probabilistic record linkage model derived from training data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links