System and method for automatic weight generation for probabilistic matching
First Claim
1. A computer-implemented method of automatically generating weights for associating a plurality of data records from one or more data sources at one or more physical locations, comprising:
- at one or more computer systems or devices coupled to the one or more databases at the one or more physical locations;
generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records;
determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter;
calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and
iterating a process comprising the steps of;
comparing each pair of data records in the set of candidate data records using the initial weights per attribute;
determining a candidate matched set with results from the comparing step;
generating true discrepancy probabilities with scoring information from the candidate matched set;
calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and
testing for weight convergence and using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the invention provide a system and method of automatically generating weights for matching data records. Each field of a record may be compared by an exact match and/or close matches and each comparison can result in a mathematical score which is the sum of the field comparisons. To sum up the field scores accurately, the automatic weight generation process comprises an iterative process. In one embodiment, initial weights are computed based upon unmatched-set probabilities and default discrepancy weights associated with attributes in the comparison algorithm. A bulk cross-match is performed across the records using the initial weights and a candidate matched set is computed for updating the discrepancy probabilities. New weights are computed based upon the unmatched probabilities and the updated discrepancy probabilities. Test for convergence between the new weights and the old weights. Repeat with the new weight table until the weights converge to their final value.
286 Citations
27 Claims
-
1. A computer-implemented method of automatically generating weights for associating a plurality of data records from one or more data sources at one or more physical locations, comprising:
-
at one or more computer systems or devices coupled to the one or more databases at the one or more physical locations; generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records; determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter; calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and iterating a process comprising the steps of; comparing each pair of data records in the set of candidate data records using the initial weights per attribute; determining a candidate matched set with results from the comparing step; generating true discrepancy probabilities with scoring information from the candidate matched set; calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and testing for weight convergence and using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system, comprising:
-
at least one processor; and one or more computer readable storage media storing program instructions translatable by the at least one processor to implement a method of automatically generating weights for associating a plurality of data records from one or more data sources at one or more physical locations, comprising; generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records; determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter; calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and iterating a process comprising the steps of; comparing each pair of data records in the set of candidate data records using the initial weights per attribute; determining a candidate matched set with results from the comparing step; generating true discrepancy probabilities with scoring information from the candidate matched set; calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and testing for weight convergence and using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount.
-
-
14. A computer readable memory device carrying computer-executable program instructions translatable by one or more processors to implement a method of automatically generating weights for associating a plurality of data records from one or more data sources at one or more physical locations, comprising:
-
generating unmatched probabilities for a set of candidate data records, wherein the unmatched probabilities are computed per attribute for each pair of data records in the set of candidate data records; determining default discrepancy probabilities per attribute for each pair of data records in the set of candidate data records based upon a data quality parameter; calculating initial weights per attribute based upon the unmatched probabilities and the default discrepancy probabilities; and iterating a process comprising the steps of; comparing each pair of data records in the set of candidate data records using the initial weights per attribute; determining a candidate matched set with results from the comparing step; generating true discrepancy probabilities with scoring information from the candidate matched set; calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and testing for weight convergence and using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount.
-
-
15. A system, comprising:
-
at least one processor; and one or more computer readable storage media storing program instructions executable by the at least one processor to automatically generate weights for associating data records from a plurality of data sources at one or more physical locations, wherein the program instructions when executed cause the at least one processor to perform an iteration process comprising the steps of; comparing each pair of data records in a set of candidate data records using current weights per attribute; determining a candidate matched set with results from the comparing step; generating true discrepancy probabilities with scoring information from the candidate matched set; calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and repeating the iteration process using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A computer readable memory device storing program instructions executable by a processor to automatically generate weights for associating data records from a plurality of data sources at one or more physical locations, wherein the program instructions when executed cause the processor to perform an iteration process comprising the steps of:
-
comparing each pair of data records in a set of candidate data records using current weights per attribute; determining a candidate matched set with results from the comparing step; generating true discrepancy probabilities with scoring information from the candidate matched set; calculating new weights per attribute based upon the unmatched probabilities and the true discrepancy probabilities to adjust performance of the association of data records; and repeating the iteration process using the new weights if a difference between the current weights and the new weights is larger than a predetermined amount. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
-
Specification