System and method for automatic weight generation for probabilistic matching
First Claim
1. A method of automatically generating weights for associating a plurality of data records, comprising:
- generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records;
determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter;
calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and
iterating a process comprising the steps of;
comparing each pair of data records in the set of candidate data records using the initial weights;
determining a candidate matched set with results from the comparing step;
generating true discrepancy probabilities with scoring information from the candidate matched set;
calculating new weights based upon the unmatched probabilities and the true discrepancy probabilities; and
testing for weight convergence.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the invention provide a system and method of automatically generating weights for matching data records. Each field of a record may be compared by an exact match and/or close matches and each comparison can result in a mathematical score which is the sum of the field comparisons. To sum up the field scores accurately, the automatic weight generation process comprises an iterative process. In one embodiment, initial weights are computed based upon unmatched-set probabilities and default discrepancy weights associated with attributes in the comparison algorithm. A bulk cross-match is performed across the records using the initial weights and a candidate matched set is computed for updating the discrepancy probabilities. New weights are computed based upon the unmatched probabilities and the updated discrepancy probabilities. Test for convergence between the new weights and the old weights. Repeat with the new weight table until the weights converge to their final value.
333 Citations
27 Claims
-
1. A method of automatically generating weights for associating a plurality of data records, comprising:
-
generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records;
determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter;
calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and
iterating a process comprising the steps of;
comparing each pair of data records in the set of candidate data records using the initial weights;
determining a candidate matched set with results from the comparing step;
generating true discrepancy probabilities with scoring information from the candidate matched set;
calculating new weights based upon the unmatched probabilities and the true discrepancy probabilities; and
testing for weight convergence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A system, comprising:
-
at least one processor;
one or more computer readable media storing program instructions executable by the at least one processor to automatically generate weights for associating data records from a plurality of data sources, wherein the program instructions when executed cause the at least one processor to perform an iteration process comprising the steps of;
comparing each pair of data records in a set of candidate data records using current weights;
determining a candidate matched set with results from the comparing step;
generating true discrepancy probabilities with scoring information from the candidate matched set;
calculating new weights based upon the unmatched probabilities and the true discrepancy probabilities; and
repeating the iteration process using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A computer readable medium storing program instructions executable by a processor to automatically generate weights for associating data records from a plurality of data sources, wherein the program instructions when executed cause the processor to perform an iteration process comprising the steps of:
-
comparing each pair of data records in a set of candidate data records using current weights;
determining a candidate matched set with results from the comparing step;
generating true discrepancy probabilities with scoring information from the candidate matched set;
calculating new weights based upon the unmatched probabilities and the true discrepancy probabilities; and
repeating the iteration process using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
-
Specification