Statistical record linkage calibration for reflexive, symmetric and transitive distance measures at the field and field value levels without the need for human interaction
First Claim
1. A computer implemented iterative process for generating entity representations in a computer implemented database using a record matching formula and for generating parameters for the record matching formula, each entity representation comprising at least one record, the database comprising a plurality of records, each record comprising a plurality of fields, each field capable of containing a field value, wherein at least a portion of parameters for the record matching formula are configured for a particular field value associated with a selected field, and wherein the process provides for linking records or entity representations with non-identical field values, the process comprising:
- applying a symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database, whereby applying the symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database defines a partition of the plurality of records, wherein the partition of the plurality of records comprises a plurality of parts, each of the parts associated with at least one field value appearing in the selected field;
calculating a first logarithm of a first probability that an arbitrary record in the database is in a part associated with the particular field value, wherein the first probability comprises a ratio of records in the part associated with the particular field value to a total number of records in the database;
forming a plurality of entity representations in the database, each entity representation comprising at least two records linked using a first instance of the record matching formula, at least one entity representation comprising at least two records linked using a first instance of the record matching formula that comprises the first logarithm of the first probability;
calculating a second logarithm of a second probability that an arbitrary entity representation in the database comprises a record that is in the part associated with the particular field value, wherein the second probability comprises a ratio of entity representations in the part associated with the particular field value to a total number of entity representations in the database;
linking at least two entity representations in the database based on a second instance of the record matching formula, wherein the second instance of the record matching formula comprises the second logarithm of the second probability, whereby a number of entity representations in the database is reduced by the linking entity representations relative to a number of entity representations in the database prior to the linking at least two entity representations; and
retrieving information from at least one record in the database.
2 Assignments
0 Petitions
Accused Products
Abstract
Disclosed is a system for, and method of, calculating parameters used to determine whether records and entity representations should be linked. The system and method use a symmetric, transitive and reflexive function to allow for linking records and entity representations whose field values differ. The system and method apply iterative techniques such that parameters from each linking iteration are used in the next linking iteration. The system and method need no human interaction in order to calibrate and utilize record matching formulas used for the linking decisions.
149 Citations
12 Claims
-
1. A computer implemented iterative process for generating entity representations in a computer implemented database using a record matching formula and for generating parameters for the record matching formula, each entity representation comprising at least one record, the database comprising a plurality of records, each record comprising a plurality of fields, each field capable of containing a field value, wherein at least a portion of parameters for the record matching formula are configured for a particular field value associated with a selected field, and wherein the process provides for linking records or entity representations with non-identical field values, the process comprising:
-
applying a symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database, whereby applying the symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database defines a partition of the plurality of records, wherein the partition of the plurality of records comprises a plurality of parts, each of the parts associated with at least one field value appearing in the selected field; calculating a first logarithm of a first probability that an arbitrary record in the database is in a part associated with the particular field value, wherein the first probability comprises a ratio of records in the part associated with the particular field value to a total number of records in the database; forming a plurality of entity representations in the database, each entity representation comprising at least two records linked using a first instance of the record matching formula, at least one entity representation comprising at least two records linked using a first instance of the record matching formula that comprises the first logarithm of the first probability; calculating a second logarithm of a second probability that an arbitrary entity representation in the database comprises a record that is in the part associated with the particular field value, wherein the second probability comprises a ratio of entity representations in the part associated with the particular field value to a total number of entity representations in the database; linking at least two entity representations in the database based on a second instance of the record matching formula, wherein the second instance of the record matching formula comprises the second logarithm of the second probability, whereby a number of entity representations in the database is reduced by the linking entity representations relative to a number of entity representations in the database prior to the linking at least two entity representations; and retrieving information from at least one record in the database. - View Dependent Claims (2, 3)
-
-
4. A computer implemented iterative process for generating entity representations in a computer implemented database using a record matching formula and for generating parameters for the record matching formula, each entity representation comprising at least one record, the database comprising a plurality of records, each record comprising a plurality of fields, each field capable of containing a field value, wherein at least a portion of parameters for the record matching formula are configured for a selected field and independent of any particular field value in the selected field, and wherein the process provides for linking records and entity representations with non-identical field values, the process comprising:
-
applying a symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database, whereby applying the symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database defines a partition of the plurality of records, wherein the partition of the plurality of records comprises a plurality of parts, each of the parts associated with at least one field value appearing in the particular field; calculating a first plurality of logarithms of first probabilities that an arbitrary record in the database is in a different first part, wherein each first probability comprises a ratio of records in a particular part to a total number of records in the database; calculating a first parameter derived from a weighted sum of the first plurality of logarithms of first probabilities; forming a plurality of entity representations in the database, each entity representation comprising at least two records linked using a first instance of the record matching formula, at least one entity representation comprising at least two records linked using a first instance of the record matching formula that comprises the first parameter; calculating a second plurality of logarithms of second probabilities that an arbitrary entity representation in the database comprises a record that is in a different part, wherein each second probability comprises a ratio of entity representations comprising a record in a particular part to a total number of entity representations in the database; calculating a second parameter derived from a weighted sum of the second plurality of logarithms of second probabilities; linking at least two entity representations in the database based on a second instance of the record matching formula, wherein the second instance of the record matching formula comprises the second parameter, whereby a number of entity representations in the database is reduced by the linking entity representations relative to a number of entity representations in the database prior to the linking at least two entity representations; and retrieving information from at least one record in the database. - View Dependent Claims (5, 6)
-
-
7. A computer system for iteratively generating entity representations in a computer implemented database using a record matching formula and for generating parameters for the record matching formula, each entity representation comprising at least one record, the database comprising a plurality of records, each record comprising a plurality of fields, each field capable of containing a field value, wherein at least a portion of parameters for the record matching formula are configured for a particular field value associated with a selected field, and wherein the process provides for linking records or entity representations with non-identical field values, the system comprising:
-
a database comprising a plurality of records, each record comprising a plurality of fields, each field capable of containing a field value; a processor programmed to apply a symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database, whereby applying the symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database defines a partition of the plurality of records, wherein the partition of the plurality of records comprises a plurality of parts, each of the parts associated with at least one field value appearing in the selected field; a processor programmed to calculate a first logarithm of a first probability that an arbitrary record in the database is in a part associated with the particular field value, wherein the first probability comprises a ratio of records in the part associated with the particular field value to a total number of records in the database; a processor programmed to form and store a plurality of entity representations in the database, each entity representation comprising at least two records linked using a first instance of the record matching formula, at least one entity representation comprising at least two records linked using a first instance of the record matching formula that comprises the first logarithm of the first probability; a processor programmed to calculate a second logarithm of a second probability that an arbitrary entity representation in the database comprises a record that is in the part associated with the particular field value, wherein the second probability comprises a ratio of entity representations in the part associated with the particular field value to a total number of entity representations in the database; and a processor programmed to link and store at least two entity representations in the database based on a second instance of the record matching formula, wherein the second instance of the record matching formula comprises a second parameter derived from the second probability, whereby a number of entity representations in the database is reduced by the linking entity representations relative to a number of entity representations in the database prior to the linking at least two entity representations. - View Dependent Claims (8, 9)
-
-
10. A computer system for iteratively generating entity representations in a computer implemented database using a record matching formula and for generating parameters for the record matching formula, each entity representation comprising at least one record, the database comprising a plurality of records, each record comprising a plurality of fields, each field capable of containing a field value, wherein at least a portion of parameters for the record matching formula are configured for a selected field and independent of any particular field value in the selected field, and wherein the process provides for linking records and entity representations with non-identical field values, the system comprising:
-
a database comprising a plurality of records, each record comprising a plurality of fields, each field capable of containing a field value; a processor programmed to apply a symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database, whereby applying the symmetric, reflexive and transitive function to each field value in the selected field of each of a plurality of records in the database defines a partition of the plurality of records, wherein the partition of the plurality of records comprises a plurality of parts, each of the parts associated with at least one field value appearing in the particular field; a processor programmed to calculate a first plurality of logarithms of first probabilities that an arbitrary record in the database is in a different first part, wherein each first probability comprises a ratio of records in a particular part to a total number of records in the database; a processor programmed to calculate a first parameter derived from a weighted sum of the first plurality of logarithms of first probabilities; a processor programmed to form and store a plurality of entity representations in the database, each entity representation comprising at least two records linked using a first instance of the record matching formula, at least one entity representation comprising at least two records linked using a first instance of the record matching formula that comprises the first parameter; a processor programmed to calculate a second plurality of logarithms of second probabilities that an arbitrary entity representation in the database comprises a record that is in a different part, wherein each second probability comprises a ratio of entity representations comprising a record in a particular part to a total number of entity representations in the database; a processor programmed to calculate a second parameter derived from a weighted sum of the second plurality of logarithms of second probabilities; and a processor programmed to link and store at least two entity representations in the database based on a second instance of the record matching formula, wherein the second instance of the record matching formula comprises the second parameter, whereby a number of entity representations in the database is reduced by the linking entity representations relative to a number of entity representations in the database prior to the linking at least two entity representations. - View Dependent Claims (11, 12)
-
Specification