Automated selection of generic blocking criteria
First Claim
1. A computer implemented method of identifying a set of fields applicable to partition a plurality of records in an electronic database into one or more blocks based on a desired block size and independent of specific queries against the database, the method comprising:
- receiving a desired block size;
calculating field probabilities for a plurality of fields in the database, wherein each field probability represents an average cohort size for a field divided by the number of records in the database, each of the field probabilities associated with one of the fields in the database, and wherein the average cohort size for each field corresponds to the average number of records containing a same field value in the respective field;
determining a set of fields by combining the field probabilities of one or more fields by mathematical calculation, wherein a product of the combined field probabilities and the number of records in the database is less than or equal to the desired block size, and wherein the set of fields is determined independent of specific queries against the database; and
outputting the set of fields.
2 Assignments
0 Petitions
Accused Products
Abstract
Field probabilities associated with fields in a database may be used to create one or more blocking criteria. The blocking criteria may be a set of fields that should be equal among two or more records in a database, so that a search of the records in the database according to the blocking criteria yields a subset of records approximately equal to or less than the specified maximum block size. Generic blocking criteria may also be created. The generic blocking criteria may be used for a batch comparison or batch linking operation within the records of the database.
-
Citations
21 Claims
-
1. A computer implemented method of identifying a set of fields applicable to partition a plurality of records in an electronic database into one or more blocks based on a desired block size and independent of specific queries against the database, the method comprising:
-
receiving a desired block size; calculating field probabilities for a plurality of fields in the database, wherein each field probability represents an average cohort size for a field divided by the number of records in the database, each of the field probabilities associated with one of the fields in the database, and wherein the average cohort size for each field corresponds to the average number of records containing a same field value in the respective field; determining a set of fields by combining the field probabilities of one or more fields by mathematical calculation, wherein a product of the combined field probabilities and the number of records in the database is less than or equal to the desired block size, and wherein the set of fields is determined independent of specific queries against the database; and outputting the set of fields. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer implemented method of creating blocking criteria based on a desired block size, the method comprising:
-
calculating, using a programmed computer, one or more field probabilities for one or more fields in an electronic database, wherein each field probability represents an average cohort size for a field divided by the number of records in the database, each of the field probabilities associated with one of the fields in the database, and wherein the average cohort size for each field corresponds to the average number of records containing a same field value in the respective field; determining, using a programmed computer, one or more fields by combining the field probabilities of one or more fields by mathematical calculation, wherein a product of the combined field probabilities and a number of records in the database is less than or equal to the desired block size, and wherein the one or more fields is determined independent of specific queries against the database; grouping, using a programmed computer, the one or more fields into one or more blocking criteria; outputting the one or more blocking criteria; and applying, using a programmed computer, at least one of the one or more blocking criteria to the records of the database to create a smaller group of records in the database. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A system for identifying a set of fields applicable to partition a plurality of records in an electronic database into one or more blocks based on a desired block size and independent of specific queries against the database, comprising:
-
an electronic processor configured to receive a desired block size; an electronic processor configured to calculate field probabilities for a plurality of fields in the database, wherein each field probability represents an average cohort size for a field divided by the number of records in the database, each of the field probabilities associated with one of the fields in the database, and wherein the average cohort size for each field corresponds to the average number of records containing a same field value in the respective field; an electronic processor configured to determine a set of fields by combining the field probabilities of one or more fields by mathematical calculation, wherein a product of the combined field probabilities and the number of records in the database is less than or equal to the desired block size, and wherein the set of fields is determined independent of specific queries against the database; and an electronic processor configured to output the set of fields. - View Dependent Claims (18, 19, 20, 21)
-
Specification