System and method for automatic weight generation for probabilistic matching

US 8,332,366 B2
Filed: 06/01/2007
Issued: 12/11/2012
Est. Priority Date: 06/02/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of automatically generating weights for associating a plurality of data records from one or more data sources at one or more physical locations, comprising:

at one or more computer systems or devices coupled to the one or more databases at the one or more physical locations;

generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records;

determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter;

calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and

iterating a process comprising the steps of;

comparing each pair of data records in the set of candidate data records using the initial weights per attribute;

determining a candidate matched set with results from the comparing step;

generating true discrepancy probabilities with scoring information from the candidate matched set;

calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and

testing for weight convergence and using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the invention provide a system and method of automatically generating weights for matching data records. Each field of a record may be compared by an exact match and/or close matches and each comparison can result in a mathematical score which is the sum of the field comparisons. To sum up the field scores accurately, the automatic weight generation process comprises an iterative process. In one embodiment, initial weights are computed based upon unmatched-set probabilities and default discrepancy weights associated with attributes in the comparison algorithm. A bulk cross-match is performed across the records using the initial weights and a candidate matched set is computed for updating the discrepancy probabilities. New weights are computed based upon the unmatched probabilities and the updated discrepancy probabilities. Test for convergence between the new weights and the old weights. Repeat with the new weight table until the weights converge to their final value.

286 Citations

27 Claims

1. A computer-implemented method of automatically generating weights for associating a plurality of data records from one or more data sources at one or more physical locations, comprising:
- at one or more computer systems or devices coupled to the one or more databases at the one or more physical locations;
  
  generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records;
  
  determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter;
  
  calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and
  
  iterating a process comprising the steps of;
  
  comparing each pair of data records in the set of candidate data records using the initial weights per attribute;
  
  determining a candidate matched set with results from the comparing step;
  
  generating true discrepancy probabilities with scoring information from the candidate matched set;
  
  calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and
  
  testing for weight convergence and using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, further comprising loading all or a portion of the data records from a plurality of data sources.
  - 3. The method of claim 1, further comprising generating frequency data for one or more attributes contained in the set of candidate data records.
  - 4. The method of claim 1, further comprising generating candidate anonymous data.
  - 5. The method of claim 4, further comprising enabling a user to review and modify a list identifying the candidate anonymous data.
  - 6. The method of claim 1, wherein the step of determining a candidate matched set further comprises determining, for each data record, whether it scores at or above a first threshold and a second threshold, wherein the first threshold pertains to an overall match score and wherein the second threshold pertains to a percentage of a possible match per attribute type.
  - 7. The method of claim 1, wherein the step of generating true discrepancy probabilities further comprises taking each pair of data records in the candidate matched set and calculating probabilities of matching with respect to each attribute.
  - 8. The method of claim 1, wherein the step of testing for weight convergence further comprises:
    - determining the difference between the initial weights and the new weights; and
      
      repeating the process using the new weights if the difference is larger than the predetermined amount.
  - 9. The method of claim 1, further comprising calculating candidate thresholds, wherein the candidate thresholds include auto-link and clerical-review thresholds.
  - 10. The method of claim 1, further comprising:
    - recording review results, wherein the review results include information on matched pairs and unmatched pairs in the plurality of data records; and
      
      iterating the process using the recorded review results.
  - 11. The method of claim 1, wherein the unmatched probabilities are generated by frequency counts, bootstrap sampling, or a combination thereof.
  - 12. The method of claim 1, wherein the default discrepancy probabilities are determined using stored weights.

13. A system, comprising:
- at least one processor; and
  
  one or more computer readable storage media storing program instructions translatable by the at least one processor to implement a method of automatically generating weights for associating a plurality of data records from one or more data sources at one or more physical locations, comprising;
  
  generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records;
  
  determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter;
  
  calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and
  
  iterating a process comprising the steps of;
  
  comparing each pair of data records in the set of candidate data records using the initial weights per attribute;
  
  determining a candidate matched set with results from the comparing step;
  
  generating true discrepancy probabilities with scoring information from the candidate matched set;
  
  calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and
  
  testing for weight convergence and using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount.

14. A computer readable memory device carrying computer-executable program instructions translatable by one or more processors to implement a method of automatically generating weights for associating a plurality of data records from one or more data sources at one or more physical locations, comprising:
- generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records;
  
  determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter;
  
  calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and
  
  iterating a process comprising the steps of;
  
  comparing each pair of data records in the set of candidate data records using the initial weights per attribute;
  
  determining a candidate matched set with results from the comparing step;
  
  generating true discrepancy probabilities with scoring information from the candidate matched set;
  
  calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and
  
  testing for weight convergence and using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount.

15. A system, comprising:
- at least one processor; and
  
  one or more computer readable storage media storing program instructions executable by the at least one processor to automatically generate weights for associating data records from a plurality of data sources at one or more physical locations, wherein the program instructions when executed cause the at least one processor to perform an iteration process comprising the steps of;
  
  comparing each pair of data records in a set of candidate data records using current weights per attribute;
  
  determining a candidate matched set with results from the comparing step;
  
  generating true discrepancy probabilities with scoring information from the candidate matched set;
  
  calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and
  
  repeating the iteration process using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The system of claim 15, wherein the set of candidate data records is a subset of the data records from the plurality of data sources.
  - 17. The system of claim 15, wherein the program instructions when executed further cause the at least one processor to perform:
    - generating unmatched probabilities for the set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records;
      
      determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter; and
      
      calculating the current weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities.
  - 18. The system of claim 15, further comprising a user interface configured to enable a user to review and modify a list identifying candidate anonymous data derived from the data records from the plurality of data sources or a subset thereof.
  - 19. The system of claim 15, further comprising a non-volatile memory for storing review results, wherein the review results include information on matched pairs and unmatched pairs in the data records from the plurality of data sources or a subset thereof, wherein the program instructions when executed further cause the at least one processor to perform the iteration process using the stored review results.

20. A computer readable memory device storing program instructions executable by a processor to automatically generate weights for associating data records from a plurality of data sources at one or more physical locations, wherein the program instructions when executed cause the processor to perform an iteration process comprising the steps of:
- comparing each pair of data records in a set of candidate data records using current weights per attribute;
  
  determining a candidate matched set with results from the comparing step;
  
  generating true discrepancy probabilities with scoring information from the candidate matched set;
  
  calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and
  
  repeating the iteration process using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
- - 21. The computer readable memory device of claim 20, wherein the program instructions when executed further cause the processor to perform:
    - generating unmatched probabilities for the set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records;
      
      determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter; and
      
      calculating the current weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities.
  - 22. The computer readable memory device of claim 20, wherein the program instructions when executed further cause the processor to load the set of candidate data records.
  - 23. The computer readable memory device of claim 22, wherein the set of candidate data records includes all or a subset of the data records from the plurality of data sources.
  - 24. The computer readable memory device of claim 20, wherein the program instructions when executed further cause the processor to determine, for each data record, whether it scores at or above a first threshold and a second threshold, wherein the first threshold pertains to an overall match score and wherein the second threshold pertains to a percentage of a possible match per attribute type.
  - 25. The computer readable memory device of claim 20, wherein the program instructions when executed further cause the processor to calculate candidate thresholds, wherein the candidate thresholds include auto-link and clerical-review thresholds.
  - 26. The computer readable memory device of claim 20, wherein the program instructions when executed further cause the processor to perform the iteration process using recorded review results, wherein the recorded review results include information on matched pairs and unmatched pairs in the data records from the plurality of data source.
  - 27. The computer readable memory device of claim 20, wherein the program instructions when executed further cause the processor to generate the unmatched probabilities by frequency counts, bootstrap sampling, and a hybrid of frequency counts and bootstrap sampling.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Schumacher, Scott, Ellard, Scott, Adams, Norman S.
Primary Examiner(s)
Vy, Hung T

Application Number

US11/809,792
Publication Number

US 20080005106A1
Time in Patent Office

2,020 Days
Field of Search

707 2- 7, 707/101, 707/688, 707/736, 707/737, 707/748, 703/2, 706/45, 706/52, 702/179, 704/243, 382/170, 705/33
US Class Current

707/688
CPC Class Codes

G06F 16/24556 Aggregation; Duplicate elim...

System and method for automatic weight generation for probabilistic matching

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

286 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for automatic weight generation for probabilistic matching

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

286 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links