Method for statistical disclosure limitation
First Claim
1. A computer-implemented method of processing an original database comprising a plurality of records, comprising:
- partitioning the plurality of records into a plurality of risk strata based on a plurality of identifying variables, wherein each risk stratum includes at least one record; and
modifying the plurality of records based on the plurality of risk strata to create a disclosure-limited data file,wherein the partitioning step comprises;
determining a core risk stratum comprising those records in the plurality of records that have unique data values for identifying variables in a core subset of the plurality of identifying variables; and
determining a further risk stratum comprising those records in the plurality of records that have unique data values for identifying variables in a selected subset of the plurality of identifying variables, the selected subset including each identifying variable in the core subset and at least one identifying variable not in the core subset.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for ensuring statistical disclosure limitation (SDL) of categorical or continuous micro data, while maintaining the analytical quality of the micro data. The new SDL methodology exploits the analogy between (1) taking a sample (instead of a census,) along with some adjustments, including imputation, for missing information, and (2) releasing a subset, instead of the original data set, along with some adjustments for records still at disclosure risk. Survey sampling reduces monetary cost in comparison to a census, but entails some loss of information. Similarly, releasing a subset reduces disclosure cost in comparison to the full database, but entails some loss of information. Thus, optimal survey sampling methods can be used for statistical disclosure limitation. The method includes partitioning the database into risk strata, optimal probabilistic substitution, optimal probabilistic subsampling, and optimal sampling weight calibration.
37 Citations
24 Claims
-
1. A computer-implemented method of processing an original database comprising a plurality of records, comprising:
-
partitioning the plurality of records into a plurality of risk strata based on a plurality of identifying variables, wherein each risk stratum includes at least one record; and modifying the plurality of records based on the plurality of risk strata to create a disclosure-limited data file, wherein the partitioning step comprises; determining a core risk stratum comprising those records in the plurality of records that have unique data values for identifying variables in a core subset of the plurality of identifying variables; and determining a further risk stratum comprising those records in the plurality of records that have unique data values for identifying variables in a selected subset of the plurality of identifying variables, the selected subset including each identifying variable in the core subset and at least one identifying variable not in the core subset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22)
-
-
19. A method of creating a disclosure-limited data file by substituting at least one data value in at least one record in a database comprising a plurality of records, comprising:
-
selecting a partner record for each record in the plurality of records; partitioning the plurality of records into a plurality of risk strata based on a plurality of identifying variables; determining a respective substitution probability for each risk stratum in the plurality of risk strata by minimizing a cost function for substitution subject to a bias constraint; and replacing data associated with at least one of the plurality of identifying variables in each record in a sample of records selected from the plurality of records to create the disclosure-limited data file, wherein (1) the sample of records is chosen based on the respective substitution probabilities, and (2) the replaced data is obtained from the corresponding partner record.
-
-
20. A method of creating a disclosure-limited data file by selecting a subsample of records from a database comprising a plurality of records, comprising:
-
partitioning the plurality of records into a plurality of risk strata based on a plurality of identifying variables; determining a respective subsampling probability for each risk stratum in the plurality of risk strata by minimizing a cost function for subsampling subject to a variance constraint; and selecting, from the plurality of records, the subsample of records based on the respective subsampling probabilities and the plurality of risk strata to create the disclosure-limited data file.
-
-
23. A computer-implemented system for creating a disclosure-limited data file by substituting at least one data value in at least one record in a database comprising a plurality of records, comprising:
-
a mechanism configured to select a partner record for each record in the plurality of records; a mechanism configured to partition the plurality of records into a plurality of risk strata based on a plurality of identifying variables; a mechanism configured to determine a respective substitution probability for each risk stratum in the plurality of risk strata by minimizing a cost function for substitution subject to a bias constraint; and a mechanism configured to replace data associated with at least one of the plurality of identifying variables in each record in a sample of records selected from the plurality of records to create the disclosure-limited data file, wherein (1) the sample of records is chosen based on the respective substitution probabilities, and (2) the replaced data is obtained from the corresponding partner record.
-
-
24. A computer-implemented system for creating a disclosure-limited data file by selecting a subsample of records from a database comprising a plurality of records, comprising:
-
a mechanism configured to partition the plurality of records into a plurality of risk strata based on a plurality of identifying variables; a mechanism configured to determine a respective subsampling probability for each risk stratum in the plurality of risk strata by minimizing a cost function for subsampling subject to a variance constraint; and a mechanism configured to select from the plurality of records, the subsample of records based on the respective subsampling probabilities and the plurality of risk strata to create the disclosure-limited data file.
-
Specification