Methods and apparatus for privacy preserving data mining using statistical condensing approach
First Claim
1. A method of generating at least one output data set from at least one multidimensional static input data set for use in association with a data mining process, comprising the steps of:
- generating data statistics from the at least one multidimensional static input data set in an iterative manner in accordance with one or more records from the at least one static input data set included in at least one condensed data group, and further comprising the steps of forming at least one condensed data group having a specific number of records from the static data set closest to a given record of the static data set, generating first order statistics and second order statistics for the at least one condensed data group, deleting records from the static data set that are included in the at least one condensed data group, determining if records remain in the static data set, and forming additional condensed data groups if records remain in the static data set;
generating the at least one output data set from the data statistics, wherein the output data set differs from the static input data set but maintains one or more correlations from within the static input data set; and
storing the at least one output data set in a storage device for use by a user.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and apparatus for generating at least one output data set from at least one input data set for use in association with a data mining process are provided. First, data statistics are constructed from the at least one input data set. Then, an output data set is generated from the data statistics. The output data set differs from the input data set but maintains one or more correlations from within the input data set. The correlations may be the inherent correlations between different dimensions of a multidimensional input data set. A significant amount of information from the input data set may be hidden so that the privacy level of the data mining process may be increased.
-
Citations
15 Claims
-
1. A method of generating at least one output data set from at least one multidimensional static input data set for use in association with a data mining process, comprising the steps of:
-
generating data statistics from the at least one multidimensional static input data set in an iterative manner in accordance with one or more records from the at least one static input data set included in at least one condensed data group, and further comprising the steps of forming at least one condensed data group having a specific number of records from the static data set closest to a given record of the static data set, generating first order statistics and second order statistics for the at least one condensed data group, deleting records from the static data set that are included in the at least one condensed data group, determining if records remain in the static data set, and forming additional condensed data groups if records remain in the static data set; generating the at least one output data set from the data statistics, wherein the output data set differs from the static input data set but maintains one or more correlations from within the static input data set; and storing the at least one output data set in a storage device for use by a user. - View Dependent Claims (2, 3, 4, 5)
-
-
6. Apparatus for generating at least one output data set from at least one multidimensional static input data set for use in association with a data mining process, the apparatus comprising:
-
a memory; and at least one processor coupled to the memory operative to;
(i) generate data statistics from the at least one multidimensional static input data set in an iterative manner in accordance with one or more records from the at least one static input data set included in at least one condensed data group, and further comprising the steps of forming at least one condensed data group having a specific number of records from the static data set closest to a given record of the static data set, generating first order statistics and second order statistics for the at least one condensed data group, deleting records from the static data set that are included in the at least one condensed data group, determining if records remain in the static data set, and forming additional condensed data groups if records remain in the static data set;
(ii) generate the at least one output data set from the data statistics, wherein the output data set differs from the static input data set but maintains one or more correlations from within the static input data set; and
(iii) store the at least one output data set in a storage device for use by a user. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A method for making a computer implemented process to enable generation of at least one output data set from at least one multidimensional static input data set for use in association with a data mining process, the method comprising the steps of:
-
instantiating first computer instructions onto a computer readable medium, the first computer instructions configured to generate data statistics from the at least one multidimensional static input data set in an iterative manner in accordance with one or more records from the at least one static input data set included in at least one condensed data group; instantiating second computer instructions onto a computer readable medium, the second computer instructions configured to generate the at least one output data set from the data statistics, wherein the output data set differs from the static input data set but maintains one or more correlations from within the static input data set, and further comprising the stens of forming at least one condensed data group having a specific number of records from the static data set closest to a given record of the static data set, generating first order statistics and second order statistics for the at least one condensed data group, deleting records from the static data set that are included in the at least one condensed data group, determining if records remain in the static data set, and forming additional condensed data groups if records remain in the static data set; and instantiating third computer instructions onto a computer readable medium, the third computer instructions configured to store the at least one output data set on a storage device for use by a user.
-
-
12. A method of generating at least one output data set from at least one multidimensional dynamic input data set for use in association with a data mining process, comprising the steps of:
-
generating data statistics from the at least one multidimensional dynamic input data set in an iterative manner in accordance with one or more records from the at least one dynamic input data set included in at least one condensed data group, and further comprising the steps of receiving a record from the dynamic data set, finding a closest condensed data group to add the record to or creating a condensed group having the record if the record is the first received from the dynamic data set, generating first order statistics and second order statistics for the closest condensed data group, determining if the number of records in the closest condensed data group is larger than an indistinguishability factor, and splitting the closest condensed data group into two groups and updating the first order statistics and second order statistics if the number of records in the closest condensed data group is larger than the indistinguishability factor, generating the at least one output data set from the data statistics, wherein the output data set differs from the dynamic input data set but maintains one or more correlations from within the dynamic input data set; and storing the at least one output data set in a storage device for use by a user. - View Dependent Claims (13)
-
-
14. Apparatus for generating at least one output data set from at least one multidimensional dynamic input data set for use in association with a data mining process, the apparatus comprising:
-
a memory; and at least one processor coupled to the memory operative to;
(i) generate data statistics from the at least one multidimensional dynamic input data set in an iterative manner in accordance with one or more records from the at least one dynamic input data set included in at least one condensed data group, and further comprising the steps of receiving a record from the dynamic data set, finding a closest condensed data group to add the record to or creating a condensed group having the record if the record is the first received from the dynamic data set, generating first order statistics and second order statistics for the closest condensed data group, determining if the number of records in the closest condensed data group is larger than an indistinguishability factor, and splitting the closest condensed data group into two groups and updating the first order statistics and second order statistics if the number of records in the closest condensed data group is larger than the indistinguishability factor;
(ii) generate the at least one output data set from the data statistics, wherein the output data set differs from the dynamic input data set but maintains one or more correlations from within the dynamic input data set; and
(iii) store the at least one output data set in a storage device for use by a user. - View Dependent Claims (15)
-
Specification