APPARATUS AND METHOD FOR PROBABILISTIC POPULATION SIZE AND OVERLAP DETERMINATION, REMOTE PROCESSING OF PRIVATE DATA AND PROBABILISTIC POPULATION SIZE AND OVERLAP DETERMINATION FOR THREE OR MORE DATA SETS
First Claim
1. A method of determining population size of unique entities in data comprising aggregate data of two different data sets that overlap and containing records on unique entities without unique identifiers for the unique entities, comprising:
- receiving the aggregate data comprising data set subset information and distribution occurrence information from a holder of the data; and
determining a population size estimate from the aggregate data.
0 Assignments
0 Petitions
Accused Products
Abstract
The invention determines the population size and population overlap in data containing records on the unique entities without unique identifiers for the unique entities and having at least one common type of information with a known distribution of finite expectation by decomposing probabilistic calculations. The computer determines population overlap of unique entities between the data sets by subtracting a probabilistic incremental number of unique entities needed for a larger total number of values of the information with the known distribution from the data sets. The invention can also maintain the security of private data by allowing a remote computer where the original data is stored to download diagnostic and aggregation procedures from another computer over a network. The remote computer performs the functions on the data and forwards the results to the estimate processor computer over the network. The estimate processor determines population size and overlap from aggregate results and forwards this information back to the remote computer over the network. The invention also determines the overlap of three or more data sets by concatenating all combinations of the data sets and determining estimates for all subsets of the combinations of the data sets. The operations involve the cancellation of equivalent terms that have opposite signs.
13 Citations
25 Claims
-
1. A method of determining population size of unique entities in data comprising aggregate data of two different data sets that overlap and containing records on unique entities without unique identifiers for the unique entities, comprising:
-
receiving the aggregate data comprising data set subset information and distribution occurrence information from a holder of the data; and
determining a population size estimate from the aggregate data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A distributed method of determining population size in overlapping data sets, comprising:
-
performing data diagnostics and aggregation of data in the data sets at a first location; and
;
determining population size at a second location different from the first location.
-
-
10. A computer readable storage controlling a computer via a process of receiving aggregate data comprising data set subset and distribution occurrence information from a holder of the data and determining a population size estimate from the aggregate data.
-
11. A computer readable storage controlling a computer via a data structure comprising a list of records comprising for a record containing subset identifier fields and records containing subset labels and a number of occurrences fields.
-
12. A method of diagnosing errors in data sets containing records on unique entities without unique identifiers for the unique entities and having at least one common type of information with a known distribution of finite expectation, comprising:
-
determining whether the data has valid values;
determining whether the records indicate a variance from the known distribution; and
determining whether the data fills a maximum number of subdivisions of the data in the data sets. - View Dependent Claims (13)
-
-
14. A method of aggregating data of data sets containing records on unique entities without unique identifiers for the unique entities and having at least one common type of information with a known distribution of finite expectation, comprising:
-
creating a combined data set comprising all records from the previous data set; and
producing aggregate data comprising data set subset and distribution occurrence information.
-
-
15. A method of determining a population size of unique entities in three or more data sets containing records on the unique entities without unique identifiers for the unique entities, said method comprising:
-
producing all of the N-way concatenations of the data sets where N equals from 2 to a total number of data sets; and
performing population size estimate calculations using the concatenations of the data sets. - View Dependent Claims (16, 17, 18)
-
-
19. A method of determining a population size of unique entities in three or more data sets containing records on the unique entities without unique identifiers for the unique entities, said method comprising:
-
producing all of the N-way concatenations of the data sets where N equals from 2 to a total number of data sets; and
performing population overlap and size estimate calculations using the concatenations of the data sets comprising canceling equivalent terms of opposite sign for all of the sets of the data sets.
-
-
20. A computer readable storage controlling a computer via a process of producing all of the N-way concatenations of the data sets where N equals from 2 to a total number of data sets and performing estimate calculations using the concatenations of the data sets.
-
21. An apparatus for population size and overlap determination of three or more data sets, comprising a computer producing all of the N-way concatenations of the data sets where N equals from 2 to a total number of data sets and performing population size and overlap estimate calculations using the concatenations of the data sets for all subsets of the data sets.
-
22. An apparatus for probabilistic population size determination, comprising a computer probabilistically calculating the population size of unique entities in data, containing records on unique entities without unique identifiers for the unique entities and having at least one common type of information with a known distribution of finite expectation, using decomposed probabilistic calculations based on values of the information with the known distribution.
-
23. A method using a computer to probabilistically determine a population size of unique entities in data containing records on the unique entities without unique identifiers for the unique entities and having at least one common type of information with a known distribution of finite expectation, said method comprising decomposing probabilistic calculations based on values of the information with the known distribution.
-
24. A computer program embodied on a computer-readable medium for probabilistically calculating a population size and a population overlap of unique entities in first and second data sets containing records on the unique entities without unique identifiers for the unique entities and having at least one common type of information with a known distribution of finite expectation, said computer program comprising:
-
a data preparation segment combining the first and second data sets into a combined data set;
a population size measurement segment probabilistically calculating the population size for the first and second data sets using decomposed probabilistic calculations based on values of the information with the known distribution; and
, a population overlap measurement segment determining the population overlap of the unique entities between the first and second data sets by subtracting a probabilistic incremental number of unique entities needed for a larger total number of values of the information with the known distribution from either of the first and second data sets to increase to a total number of values of the information with the known distribution in the combined data set from a smaller of the population size of the first and second data sets.
-
-
25. An apparatus for population size determination, comprising a computer calculating the population size of unique entities in a database, containing records on the unique entities without unique identifiers for the unique entities and having at least one common type of information with a known distribution of finite expectation, using decomposed probabilistic calculations based on values of the information with the known distribution.
Specification