Privacy Preserving Statistical Analysis on Distributed Databases

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
30Forward
Citations 
0
Petitions 
1
Assignment
First Claim
1. A method for securely determining aggregate statistics on private data, comprising the steps of:
 sampling, at one or more clients, data X^{n }and Y^{n }to obtain sampled data {tilde over (X)}^{m }and {tilde over (Y)}^{m}, wherein m is a sampling parameter substantially smaller than a length n of the data;
encrypting the sampled data {tilde over (X)}^{m }and {tilde over (Y)}^{m }to obtain encrypted data {hacek over (X)}^{m }and {hacek over (Y)}^{m};
combining the encrypted data {hacek over (X)}^{m }and {hacek over (Y)}^{m }to obtain combined encrypted data;
randomizing the combined encrypted data to obtain randomized data X^{m}, Y^{m};
estimating, at an authorized thirdparty processor, a joint distribution {circumflex over (T)}_{X}_{n}_{,Y}_{n }of the data X^{n }and Y^{n }from the randomized encrypted data X^{m}, Y^{m}, such that a differential privacy requirement on the data X^{n }and Y^{n }is satisfied.
1 Assignment
0 Petitions
Accused Products
Abstract
Aggregate statistics are securely determined on private data by first sampling independent first and second data at one or more clients to obtain sampled data, wherein a sampling parameter substantially smaller than a length of the data. The sampled data are encrypted to obtain encrypted data, which are then combined. The combined encrypted data are randomized to obtain randomized data. At an authorized thirdparty processor, a joint distribution of the first and second data is estimated from the randomized encrypted data, such that a differential privacy requirement of the first and second is satisfied.
42 Citations
View as Search Results
PRIVACY MEASUREMENT AND QUANTIFICATION  
Patent #
US 20150261959A1
Filed 02/20/2015

Current Assignee
TATA Consultancy Services Limited

Sponsoring Entity
TATA Consultancy Services Limited

Privacy measurement and quantification  
Patent #
US 9,547,768 B2
Filed 02/20/2015

Current Assignee
TATA Consultancy Services Limited

Sponsoring Entity
TATA Consultancy Services Limited

PRIVACYAWARE QUERY MANAGEMENT SYSTEM  
Patent #
US 20170169253A1
Filed 12/10/2015

Current Assignee
Neustar Incorporated

Sponsoring Entity
Neustar Incorporated

Emoji frequency detection and deep link frequency  
Patent #
US 9,705,908 B1
Filed 09/24/2016

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Emoji frequency detection and deep link frequency  
Patent #
US 9,712,550 B1
Filed 09/24/2016

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Emoji frequency detection and deep link frequency  
Patent #
US 9,894,089 B2
Filed 06/19/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Information determination apparatus, information determination method and recording medium  
Patent #
US 9,959,427 B2
Filed 04/14/2015

Current Assignee
NEC Corporation

Sponsoring Entity
NEC Corporation

DIFFERENTIAL PRIVACY WITH CLOUD DATA  
Patent #
US 20180198602A1
Filed 11/16/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Privacyaware query management system  
Patent #
US 10,108,818 B2
Filed 12/10/2015

Current Assignee
Neustar Incorporated

Sponsoring Entity
Neustar Incorporated

Learning new words  
Patent #
US 10,133,725 B2
Filed 04/03/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Emoji frequency detection and deep link frequency  
Patent #
US 10,154,054 B2
Filed 06/30/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Differentially private processing and database storage  
Patent #
US 10,192,069 B2
Filed 07/07/2016

Current Assignee
LeapYear Technologies Inc.

Sponsoring Entity
LeapYear Technologies Inc.

Differentially private processing and database storage  
Patent #
US 10,229,287 B2
Filed 10/25/2017

Current Assignee
LeapYear Technologies Inc.

Sponsoring Entity
LeapYear Technologies Inc.

Efficient implementation for differential privacy using cryptographic functions  
Patent #
US 10,229,282 B2
Filed 09/23/2016

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Differentially private processing and database storage  
Patent #
US 10,242,224 B2
Filed 10/25/2017

Current Assignee
LeapYear Technologies Inc.

Sponsoring Entity
LeapYear Technologies Inc.

System and method for privacy management of infinite data streams  
Patent #
US 10,366,249 B2
Filed 10/14/2016

Current Assignee
Samsung Electronics Co. Ltd.

Sponsoring Entity
Samsung Electronics Co. Ltd.

Tracking privacy budget with distributed ledger  
Patent #
US 10,380,366 B2
Filed 04/25/2017

Current Assignee
SAP SE

Sponsoring Entity
SAP SE

Differentially private database permissions system  
Patent #
US 10,430,605 B1
Filed 11/29/2018

Current Assignee
LeapYear Technologies Inc.

Sponsoring Entity
LeapYear Technologies Inc.

Differential privacy and outlier detection within a noninteractive model  
Patent #
US 10,445,527 B2
Filed 12/21/2016

Current Assignee
SAP SE

Sponsoring Entity
SAP SE

Emoji frequency detection and deep link frequency  
Patent #
US 10,454,962 B2
Filed 10/12/2018

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Differentially private database queries involving rank statistics  
Patent #
US 10,467,234 B2
Filed 07/19/2018

Current Assignee
LeapYear Technologies Inc.

Sponsoring Entity
LeapYear Technologies Inc.

Differentially private density plots  
Patent #
US 10,489,605 B2
Filed 04/23/2018

Current Assignee
LeapYear Technologies Inc.

Sponsoring Entity
LeapYear Technologies Inc.

Efficient implementation for differential privacy using cryptographic functions  
Patent #
US 10,552,631 B2
Filed 03/08/2019

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Differentially private processing and database storage  
Patent #
US 10,586,068 B2
Filed 01/02/2019

Current Assignee
LeapYear Technologies Inc.

Sponsoring Entity
LeapYear Technologies Inc.

User experience using privatized crowdsourced data  
Patent #
US 10,599,867 B2
Filed 11/07/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

User experience using privatized crowdsourced data  
Patent #
US 10,599,868 B2
Filed 11/07/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Differentially private budget tracking using Renyi divergence  
Patent #
US 10,642,847 B1
Filed 05/09/2019

Current Assignee
LeapYear Technologies Inc.

Sponsoring Entity
LeapYear Technologies Inc.

Learning new words  
Patent #
US 10,701,042 B2
Filed 10/12/2018

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Differential privacy using a multibit histogram  
Patent #
US 10,726,139 B2
Filed 09/30/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Differentially private machine learning using a random forest classifier  
Patent #
US 10,726,153 B2
Filed 09/27/2018

Current Assignee
LeapYear Technologies Inc.

Sponsoring Entity
LeapYear Technologies Inc.

DATA OBFUSCATION SYSTEM, METHOD, AND COMPUTER IMPLEMENTATION OF DATA OBFUSCATION FOR SECRET DATABASES  
Patent #
US 20110179011A1
Filed 05/12/2009

Current Assignee
New BIS Safe Luxco S.A R.L

Sponsoring Entity
New BIS Safe Luxco S.A R.L

PRIVACYPRESERVING AGGREGATION OF TIMESERIES DATA  
Patent #
US 20120204026A1
Filed 02/04/2011

Current Assignee
Palo Alto Research Center Inc.

Sponsoring Entity
Palo Alto Research Center Inc.

PrivacyPreserving Probabilistic Inference Based on Hidden Markov Models  
Patent #
US 20120254606A1
Filed 03/31/2011

Current Assignee
Mitsubishi Electric Research Laboratories

Sponsoring Entity
Mitsubishi Electric Research Laboratories

PrivacyPreserving Probabilistic Inference Based on Hidden Markov Models  
Patent #
US 20120254605A1
Filed 03/30/2011

Current Assignee
Mitsubishi Electric Research Laboratories

Sponsoring Entity
Mitsubishi Electric Research Laboratories

PrivacyPreserving Probabilistic Inference Based on Hidden Markov Models  
Patent #
US 20120254612A1
Filed 03/30/2011

Current Assignee
Mitsubishi Electric Research Laboratories

Sponsoring Entity
Mitsubishi Electric Research Laboratories

METHODS AND APPARATUS TO ANONYMIZE A DATASET OF SPATIAL DATA  
Patent #
US 20130145473A1
Filed 12/05/2011

Current Assignee
ATT Intellectual Property I LP A Nevada Limited Partnership

Sponsoring Entity
Entong Shen, Divesh Srivastava, Graham R. Cormode, Cecilia M. Procopiuc

PRIVATE DECAYED SUM ESTIMATION UNDER CONTINUAL OBSERVATION  
Patent #
US 20130212690A1
Filed 08/14/2012

Current Assignee
Thomson Licensing

Sponsoring Entity
Thomson Licensing

Privacypreserving aggregation of Timeseries data  
Patent #
US 8,555,400 B2
Filed 02/04/2011

Current Assignee
Palo Alto Research Center Inc.

Sponsoring Entity
Palo Alto Research Center Inc.

Privacy preserving statistical analysis for distributed databases  
Patent #
US 8,893,292 B2
Filed 11/14/2012

Current Assignee
Mitsubishi Electric Research Laboratories

Sponsoring Entity
Mitsubishi Electric Research Laboratories

Method And Apparatus For PrivacyPreserving Data Mapping Under A PrivacyAccuracy TradeOff  
Patent #
US 20150235051A1
Filed 08/19/2013

Current Assignee
Thomson Licensing

Sponsoring Entity
Thomson Licensing

METHOD AND APPARATUS FOR NEARLY OPTIMAL PRIVATE CONVOLUTION  
Patent #
US 20150286827A1
Filed 11/27/2013

Current Assignee
Nadia Fawaz, Aleksandar Todorov Nikolov, Thomson Licensing

Sponsoring Entity
Nadia Fawaz, Aleksandar Todorov Nikolov, Thomson Licensing

PRIVACYENHANCING TECHNOLOGIES FOR MEDICAL TESTS USING GENOMIC DATA  
Patent #
US 20160224735A1
Filed 01/12/2016

Current Assignee
Ecole polytechnique fdrale de lausanne epfl

Sponsoring Entity
Amalio Telenti, Jean Pierre Hubaux, Erman Ayday, Jean Louis Raisaro, Paul Jack Mclaren, Mathias Humbert, Jacques Rougemont, Jacques Fellay

12 Claims
 1. A method for securely determining aggregate statistics on private data, comprising the steps of:
sampling, at one or more clients, data X^{n }and Y^{n }to obtain sampled data {tilde over (X)}^{m }and {tilde over (Y)}^{m}, wherein m is a sampling parameter substantially smaller than a length n of the data; encrypting the sampled data {tilde over (X)}^{m }and {tilde over (Y)}^{m }to obtain encrypted data {hacek over (X)}^{m }and {hacek over (Y)}^{m}; combining the encrypted data {hacek over (X)}^{m }and {hacek over (Y)}^{m }to obtain combined encrypted data; randomizing the combined encrypted data to obtain randomized data X ^{m},Y ^{m};estimating, at an authorized thirdparty processor, a joint distribution {circumflex over (T)}_{X}_{n}_{,Y}_{n }of the data X^{n }and Y^{n }from the randomized encrypted data X ^{m},Y ^{m}, such that a differential privacy requirement on the data X^{n }and Y^{n }is satisfied. View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 11, 12)
 10. A method for securely determining aggregate statistics on private data, comprising the steps of:
sampling, at one or more client processors, first data and second data to obtain sampled data, wherein a sampling parameter is substantially smaller than a length of the data; encrypting the sampled data to obtain encrypted data; combining the encrypted data to obtain combined encrypted data; randomizing the combined encrypted data to obtain randomized data; estimating, at an authorized thirdparty processor, a joint distribution of the first data and the second data from the randomized encrypted data such that a differential privacy requirement of the first data and the seconds data is satisfied.
1 Specification
U.S. patent application Ser. No. 13/676,528, “Privacy Preserving Statistical Analysis for Distributed Databases,” filed Nov. 14, 2012 by Wang et al., incorporated herein by reference. That Application determines aggregate statistics by randomizing data from two sources for a server, which computes an empirical distribution, specifically a conditional and joint distribution of the data, while maintaining privacy of the data.
This invention relates generally to secure computing, and more particularly to performing secure aggregate statistical analysis on a private data by a third party.
One of the most noticeable technological trends is the emergence and proliferation of largescale distributed databases. Public and private enterprises are collecting tremendous amounts of data on individuals, their activities, their preferences, their locations, spending habits, medical and financial histories, and so on. These enterprises include government organizations, health providers, financial institutions, Internet search engines, social networks, cloud service providers, and many others. Naturally, interested parties could potentially discern meaningful patterns and gain valuable insights if they were able to access and correlate the information across the databases.
For example, a researcher may want to determine the correlations between individual income with personal characteristics such as gender, race, age, education, etc., or a medical researcher may want to study the relationships between disease prevalence and individual environmental factors. In such applications, it is imperative to maintain the privacy of individuals, while ensuring that the useful aggregate statistical information is only revealed to the authorized parties. Indeed, unless the public is satisfied that their privacy is being preserved, they would not provide their consent for the collection and use of their personal information. Additionally, the inherent distribution of this data across multiple databases present a significant challenge, as privacy concerns and policy would likely prevent direct sharing of data to facilitate statistical analysis in a centralized fashion. Thus, tools must be developed for preforming statistical analysis on large and distributed databases, while addressing these privacy and policy concerns.
It is known that conventional mechanisms for privacy, such as kanonymization do not provide adequate privacy. Specifically, an informed adversary can link an arbitrary amount of side information to anonymized database, and defeat the anonymization mechanism. In response to vulnerabilities of simple anonymization mechanisms, a stricter notion of privacy, known as differential privacy, has been developed. Informally, differential privacy ensures that the result of a function computed on a database of respondents is almost insensitive to the presence or absence of a particular respondent. A more formal way states that when the function is evaluated on databases, differing in only one respondent, the probability of outputting the same result is almost unchanged.
Mechanisms that provide differential privacy typically involve output perturbation, e.g., when Laplacian noise is added to the result of a function computed on a database, the noise provides differential privacy to the individual respondents in the database. Nevertheless, it can be shown that input perturbation approaches, such as the randomized response mechanism, where noise is added to the data, also provide differential privacy to the respondents.
It is desired to protect the privacy of individual respondents in a database, to prevents unauthorized parties from computing a joint or marginal empirical probability distributions of the data, and to achieves a superior tradeoff between privacy and utility compared to simply performing post randomization (PRAM) on the database.
Sampling can be used for crowdblending privacy. This is a strictly relaxed version of differential privacy, but it is known that a presampling step applied to a crowdblending privacy mechanism can achieve a desired amount of differential privacy.
The related application Ser. No. 13/676,528 first randomizes independently data X and Y to obtain randomized data {circumflex over (X)} and Ŷ. The first randomizing preserves the privacy of the data X and Y. Then, the randomized data {circumflex over (X)} and Ŷ are randomized secondly to obtain randomized data {tilde over (X)} and {tilde over (Y)} for a server, and helper information T_{{tilde over (X)}{circumflex over (X)} }and T_{ŶŶ }for a client, where T represents an empirical distribution, and where the randomizing secondly preserves the privacy of the aggregate statistics of the data X and Y. The server then determines statistics T_{{tilde over (X)},{tilde over (Y)}}. Last, the client applies the helper information T_{{tilde over (X)}{circumflex over (X)} }and T_{ŶŶ }to T_{{tilde over (X)},{tilde over (Y)} }to obtain an estimated {dot over (T)}_{X,Y}, wherein “” and “,” between X and Y represent a conditional and joint distribution, respectively.
The embodiments of the invention provide a method for obtaining aggregate statistics from client data stored at a server, while preserving the privacy of data. The data are vertically partitioned.
The method uses a data release mechanism based on post randomization (PRAM), encryption and random sampling to maintain the privacy, while allowing the server to perform an accurate statistical analysis of the data. The encryption ensures that the server obtains no information about the client data, while the PRAM and sampling ensures individual privacy is maintained against third parties receiving the statistics.
Means are provided to characterize how much the composition of random sampling with PRAM increases the differential privacy compared to using PRAM alone. The statistical utility of the method is analyzed by bounding the estimation error, i.e., the expected l_{2 }norm error between a true empirical distribution and an estimated distribution, as a function of the number of samples, PRAM noise, and other parameters.
The analysis indicates a tradeoff between increasing PRAM noise versus decreasing the number of samples to maintain a desired level of privacy ε. An optimal number of samples m* that balances this tradeoff and maximizes the utility is determined.
Our mechanism maintains the privacy of client data stored and analyzed by a server. We achieve this with a privacy mechanism involving sampling and post randomization (PRAM), which is a generalization of randomized response. The mechanism prevents unauthorized parties from computing the joint or marginal empirical probability distributions of the data. We achieve this using random encrypting pads, which can only be reversed by authorized parties. The mechanism achieves a superior tradeoff between privacy and utility compared to simply performing PRAM on the data.
Sampling the client data enhances privacy with respect to the individual respondents, while retaining the utility provided to an authorized third party interested in the joint and marginal empirical probability distributions.
We enhance differential privacy by sampling data uniformly at a fixed rate without replacement. We formulate privacy using the definition of differential privacy wherein neighboring (adjacent) databases differ by a single replacement rather than by a single addition or deletion. Furthermore, we consider vertically partitioned distributed databases, which are maintained by mutually untrusting clients. To compute joint statistics, a join operation is required on the databases, which implies that individual clients cannot independently blend their respondents without altering the joint statistics across all databases.
The databases are sanitized and combined 110, and stored at a server 120 so that an authorized third party can analyze, e.g., salaries in the population, while ensuring that the privacy of the individual respondents is maintained. It is understood that other data can be maintained by the clients.
Conceptually, a data release mechanism involves the “sanitization” of the data, via some form of perturbation or transform, to preserve individual privacy, before making the data are made available to the server. The suitability of the method used to sanitize the data is determined by the extent to which rigorously defined privacy constraints are met.
The embodiments of our invention provide a method for obtaining aggregate statistics from the data stored at a server, while preserving the privacy of data.
Problem Formulation
In this section, we describe the general setup. The clients, Alice and Bob, wish to release their data to enable privacypreserving data analysis by an authorized third party. For simplicity, we described our problem formulation and results for two clients. However, our methods can be generalized to more than two clients.
It should also be understood, that the third party can be an authorized third party accessing a third party processor. That is, the authorized third party directs the processor to perform certain functions. The third party also supplies the appropriate keys to the processor so that the processor can perform decryption as necessary. Therefore, the terms authorized third party and authorized third party processor as described herein are essentially synonymous. Those of ordinary skill in the arts would understand the actual computations are performed by the processor under the direction of the third party,
Consider a data mining application in which Alice and Bob are mutually untrusting clients. The two databases are to be made available for a third party analysis with authorization granted by the clients, such that statistical measures can be computed either on the individual databases, or on some combination of the two databases. The clients should have flexible access control over the data. For example, if the third party is granted access by Alice but not by Bob, then the third party can only access Alice'"'"'s data. In addition, the server should only host the data and not be able access the underlying information. Therefore, the data are sanitized, before released, to protect individual privacy.
We have the following privacy and utility requirements.
Database Security:
Only third parties authorized by the clients can extract statistical information from the database.
Respondent Privacy:
Individual privacy of the respondents related to the data is maintained against the server as well as the third parties.
Statistical Utility:
The authorized third party, i.e., one possessing appropriate keys, can compute joint and marginal distributions of the data provided by the clients.
Complexity:
The overall communication and computation requirements of the system should be reasonable.
In the following sections, we describe our system framework and formalize the notions of privacy and utility.
Type and Matrix Notation
The type, or empirical distribution of a sequence data X^{n}, is defined as a mapping T_{X}_{n}:X→[0,1] given by
A joint type of two sequences X^{n }and Y^{n }is defined as a mapping T_{X}_{n}_{,Y}_{n}:X×Y→[0,1] given by
For notational convenience, when working with finitedomain type/distribution functions, we drop the arguments to represent and use these functions as vectors and matrices. For example, we can represent a distribution function P_{X}:X→[0,1] as the X×1 columnvector P_{X}, with values arranged according to a fixed consistent ordering of X. Thus, with a slight abuse of notation, using the values of X to index the vector, the x^{th }element of the vector P_{X}[x], is given by P_{X}(x). Similarly, a conditional distribution function P_{YX}:Y×X→[0,1] can be represented as a Y×X matrix P_{YX}, defined by P_{YX}[y,x]:=P_{YX}(yx). For example, by using this notation, the elementary probability identity
can be written in matrix form as P_{Y}=P_{YX}P_{X}.
System Framework
Database Model:
The data 101 maintained by Alice is modeled as a sequence X^{n}:=(X_{1}, X_{2}, . . . , X_{n}), with each X taking values in a finitealphabet X. Likewise, Bob'"'"'s data 102 are modeled as a sequence of random variables Y^{n}:=(Y_{1}, Y_{2}, . . . , Y_{n}), with each Y_{i }taking values in a finitealphabet Y. The length n of the sequences represents the total number of respondents in the database, and each (X_{i},Y_{i}) pair represents the data of the respondent i collectively maintained by Alice and Bob, with the alphabet X×Y representing the domain of the data of each respondent.
Data Processing and Release:
Each clients applies a data release mechanism to their respective data to produce an encryption of the data for the server and a decryption key for the third party. These mechanisms are denoted by the randomized mappings F_{A}:X^{n}→O_{A}×K_{A }and F_{B}:Y^{n}→O_{B}×K_{B}, where K_{A }201 and K_{B }202 are suitable key spaces, and O_{B }203 and O_{A }204 are suitable encryption spaces. The encryptions and keys are produced and given by
(O_{A},K_{A}):=F_{A}(X^{n}),
(O_{B},K_{B}):=F_{B}(Y^{n}).
The encryptions O_{A }and O_{B }are passed to the server, which performs processing, and the keys K_{A }and K_{B }are provided sent to the third party 225. The server processes O_{A }and O_{B}, producing an output O via a random mapping M:O_{A}×O_{B}→O 220, as given by
O:=M(O_{A},O_{B}).
Statistical Recovery:
To enable the recovery of the statistics of the database, the processed output C) is provided to the third party via the server, and the encryption keys K_{A }and K_{B }are provided to the third party by the clients. The third party produces an estimate (̂) of the joint type (empirical distribution) of Alice and Bob'"'"'s data sequences, denoted by {circumflex over (T)}_{X}_{n}_{,Y}_{n }250, as a function of O, K_{A}, and K_{B}, as given by
{circumflex over (T)}_{X}_{n}_{,Y}_{n}:=g(O,K_{A},K_{B}),
where g:O×K_{A}×K_{B}→[0,1]^{X×Y }is a reconstruction function.
The object is to design a system within the above framework, by specifying the mappings F_{A}, F_{B}, M, and g, that optimize the system performance requirements, which are formulated below.
Privacy and Utility Conditions
We now formulate the privacy and utility requirements for our framework.
Privacy Against the Server:
During the course of system operation, the clients do not want reveal any information about their data to the server, not even aggregate statistics. A strong statistical condition that guarantees this security is the requirement of statistical independence between the data and the encrypted versions stored at the server. The statistical requirement of independence guarantees security even against an adversarial server with unbounded resources, and does not require any unproved assumptions.
Respondent Privacy:
The data should be kept private from all other parties, including any authorized third parties who attempts to recover the statistics. We formalize this notion using εdifferential privacy for the respondents as follows:
Given the above framework, the system provides εdifferential privacy if for all data (x^{n},y^{n}) and ({dot over (x)}^{n},{dot over (y)}^{n}) in X^{n}×Y^{n}, within Hamming distance d_{H}((x^{n},y^{n}),({dot over (x)}^{n},{dot over (y)}^{n}))≦1, and all SO×K_{A}×K_{B},
Pr[(O,K_{A},K_{B})εS(X^{n},Y^{n})=(x^{n},y^{n})]≦e^{ε}Pr[(O,K_{A},K_{B})εS(X^{n},Y^{n})=({dot over (x)}^{n},{dot over (y)}^{n})].
This rigorous definition of privacy satisfies the privacy axioms. Under the assumption that the data are i.i.d., this definition results in a strong privacy guarantee. An attacker with knowledge of all except one of the respondents cannot recover the data of the sole missing respondent.
Utility for Authorized Third Parties:
The utility of the estimate is measured by the expected l_{2}norm error of this estimated type vector, given by
E∥T_{X}_{n}_{,Y}_{n}−T_{X}_{n}_{,Y}_{n}∥2,
with the goal being the minimization of this error.
System Complexity:
The communication and computational complexity of the system are also of concern. The computational complexity can be represented by the complexity of implementing the mappings (F_{A}, F_{B}, M and g) that specify a given system. Ideally, we try to minimize the computational complexity of all of these mappings, and simplifying the operations that each party performs. The communication requirements is given by the cardinalities of the symbol alphabets (O_{A}, O_{B}, K_{A}, K_{B}, and O). The logarithms of these alphabet sizes indicate the sufficient length for the messages that must be transmitted in our system.
Proposed System and Analysis
We describe the details of our system, and analyze its privacy and utility performance. First, we describe how our system utilizes sampling and additive encryption, enabling the server to join and perturb encrypted data in order to facilitate the release of sanitized data to the third party. Next, we analyze the privacy of our system and show that our sampling enhances privacy, thereby reducing the amount of noise that is added during the perturbation step to obtain a desired level of privacy. Finally, we analyze the accuracy of the joint type reconstruction, producing a bound on the utility as a function of the system parameters, that is to say, the noise added during perturbation, and the sampling factor.
System Architecture
1. Sampling: The clients respectively randomly sample 205 their data X and Y, producing shortened sequences X_{n }and Y_{n }
2. Encryption: The clients encrypt 210 the shortened sequence before passing the shortened sequences to the server.
3. Perturbation: The third party processor combines and perturbs 220 the encrypted sequences.
4. Release: The third party processor 225 obtains the sanitized data from the server and the encryption keys from the clients, allowing the approximate recovery of data statistics T_{X, Y }230.
It is assumed that the clients, server and third party each have at least one processor to perform the steps outlined above. The processors are connected to memories, e.g., databases, and input/output interfaces for communicating with each other.
A key aspect of these steps is that the encryption and perturbation schemes commute, thus allowing the server to essentially perform perturbation on the encrypted sequences, and for the authorized third party to subsequently decrypt the perturbed data. In this section, we describe the details of each step from a theoretical perspective by applying mathematical abstractions and assumptions. Later, we describe practical implementations towards the realizing our system.
Another aspect of the invention is that the sequence of sampling, randomization and encryption can be altered so long as the encryption parameters are determined by the client processors and the decryption parameters are provided to the authorized thirdparty processor.
Sampling:
The data clients sample 205 the data to reduced the length of database sequences (X^{n}, Y^{n}) to i randomly drawn samples, where m is substantially smaller than n. The samples are drawn uniformly without replacement and that both the clients sample at the same locations. We let
({tilde over (X)}^{m},{tilde over (Y)}^{m}):=({tilde over (X)}_{1}, . . . , {tilde over ({circumflex over (X)}_{m},{tilde over (Y)}_{1}, . . . , {tilde over (Y)}_{m})
denote the intermediate result after sampling. Mathematically, the sampling result is described by
({tilde over (X)}_{i},{tilde over (Y)}_{i})=(X_{I}_{i},Y_{I}_{i}) for all i in {1, . . . , m},
where I_{1}, . . . , I_{m }are drawn uniformly without replacement from {1, . . . , n}.
Encryption:
The clients independently encrypt 210 the sampled data sequences with an additive (onetime pad) encryption scheme. Alice uses an independent uniform key sequence V^{m}εX^{m }301, and produces the encrypted sequence
{tilde over (X)}^{m}:={tilde over (X)}^{m}⊕V^{m}:=({tilde over (X)}_{1}+V_{1}, . . . , {tilde over (X)}_{m}+V_{m}),
where ⊕ denotes addition applied elementbyelement over the sequences. The addition operation can be any suitably defined group addition operation over the finite set.
Similarly, Bob encrypts his data, with the independent uniform key sequence W^{m}εY^{m }302, to produce the encrypted sequence
{tilde over (Y)}^{m}:={tilde over (Y)}^{m}⊕W^{m}:=({tilde over (Y)}_{1}+W_{1}, . . . , {tilde over (Y)}_{m}+W_{m}).
Alice and Bob send these encrypted sequences to the server, and provide the keys to the third party to enable data release.
Perturbation:
The server joins the encrypted data sequences, forming (({hacek over (X)}_{1},{hacek over (Y)}_{1}), . . . , ({hacek over (X)}_{m},{hacek over (Y)}_{m})), and perturbs 220 the sequences by applying an independent PRAM mechanism, producing the perturbed results (
Using the matrix A:=P_{X, Y{hacek over (X)},{hacek over (Y)} }to represent the conditional distribution, this operation can be represented by
By design, we specify that A is a γdiagonal matrix, for a parameter γ>1, given by
and
where q:=(γ+X∥Y−1) is a normalizing constant.
Release:
In order to recover the statistics as shown in
({circumflex over (X)}^{m},Ŷ^{m}):=(
which is effectively the data sanitized by the sampling and the PRAM. The third party produces the joint type estimate by inverting the matrix A and multiplying the matrix A with the joint type of the sanitized data as follows
{circumflex over (T)}_{X}_{n}_{,Y}_{n}:=A^{−1}T_{{circumflex over (X)}}_{m}_{,Ŷ}_{m}.
Due to the γdiagonal property of the matrix A, the PRAM perturbation is essentially an additive operation that commutes with the additive encryption. This allows the server to perturb the encrypted data while preserving the perturbation when the encryption is removed. The decrypted, sanitized data ({circumflex over (X)}^{m},Ŷ^{m}) recovered by the third party are essentially the sampled data perturbed by the PRAM.
Sampling Enhances Privacy
We analyze the privacy of our system. Specifically, we show how sampling in conjunction with PRAM enhances the overall privacy for the respondents in comparison to using PRAM alone. Note that if PRAM, with the γdiagonal matrix A, was applied alone to the full databases, the resulting perturbed data would have ln(γ)differential privacy. In the following, we show that the combination of sampling and PRAM results in sampled and perturbed data with enhanced privacy. Our system provides εdifferential privacy for the respondents, where
We use the following notation for the set of all possible samplings
Θ:={ππ:=(π_{1}, . . . , π_{m})ε{1, . . . , n}^{m},π_{i}≠π_{j},∀i≠j}.
The sampling locations (I_{1}, . . . , I_{m}) are uniformly drawn from the set Θ. We define Θ_{k}:={πεΘ∃i,π_{i}=k} to denote the subset of samplings that select location k, and Θ_{k}^{c}:=Θ\Θ_{k }to denote the subset of samplings that do not select location k. For πεΘ_{k}.
We define Θ_{k}(π):={π′εΘ_{k}^{c}d_{H}(π,π′)=1} as the subset of Θ_{k}^{c }that replaces the selection of location k with any other nonselected location. We denote πεΘ as a sampling function for the data sequences, that is, π(X^{n}):=(X_{π}_{1}, . . . , X_{πdi m}), and similarly for π(Y^{n}). Using the above notation, we can rewrite the following conditional probability,
where in the last equality we have rearranged the summations to embed the summation over π′εΘ_{k}^{c }into the summation over πεΘ_{k}.
Note that summing over all π′εΘ_{k}(π) within a summation over all πεΘ_{k }covers all π′εΘ_{k}^{c}, but overcounts each π′ exactly m times because each π′εΘ_{k}^{c }belongs to m of the Θ_{k}(π) sets across all πεΘ_{k}. Hence, a (1/m) term has been added to account for this overcount.
For the above expansion, we use the following shorthand notation for the summation terms:
α(π):=P_{{circumflex over (X)}}_{m}_{,Ŷ}_{m}_{{tilde over (X)}}_{m}_{,{tilde over (Y)}}_{m}({circumflex over (x)}^{m},ŷ^{m}π(x^{n}),π(y^{n}))
β(π):=P_{{circumflex over (X)}}_{m}_{,Ŷ}_{m}_{{tilde over (X)}}_{m}_{,{tilde over (Y)}}_{m}({circumflex over (x)}^{m},ŷ^{m}π({dot over (x)}^{n}),π({dot over (y)}^{n}))
Thus, the probability ratio can be written as
where π*εΘ_{k }denotes the sampling that maximizes the ratio. Given the γdiagonal structure of the matrix A, we have that
γ^{−1}α(π*)≦β(π*),
because (π*(x^{n}),π*(y^{n})) and (π*({dot over (x)}^{n}),π*({dot over (y)}^{n})) differ in only one location,
γ^{−1}α(π*)≦α(π′),∀π′εΘ_{k}(π*),
because (π*(x_{n}),π*(y^{n})) and (π′(x^{n}),π′(y^{n})) differ in only one location, and
α(π′)=β(π′),∀π′εΘ_{k}(π*),
because (π′(x^{n}),π′(y^{n}))=(π′({dot over (x)}^{n}),π′({dot over (y)}^{n})). Given these constraints, we can bound the likelihood ratio as
by bounding the likelihood ratio with e^{ε}.
To show εdifferential privacy for a given ε, we only need to upperbound the probability ratio by e^{ε}, as done above. A natural question is if this bound is tight, that is, whether there exists a smaller ε for which the bound also holds, hence making the system more private. The following example shows that the value for ε is tight.
Let a and b be two distinct elements in (X×Y). Let (x^{n},y^{n})=(b, a, a, . . . , a), ({dot over (x)}^{n},{dot over (y)}^{n})=(a, a, . . . , a) and (x^{m},y^{m})=(b, b, . . . , b). Let E denote the event that the first element (where the two databases differ) is sampled, which occurs with probability (m/n). We can determine the likelihood ratio as follows
Thus, the value of ε is indeed tight.
Because of the privacy analysis for given system parameters of database size n, number of samples m, and desired level privacy ε, the level of PRAM perturbation, specified by the γ parameter of the matrix A, is
Privacy against the server is obtained as a consequence of the onetimepad encryption 210 performed on the data before the transmission to the server. The encryptions received by the server are statistically independent of the original database as a consequence of the independence and uniform randomness of the keys.
Utility Analysis
In this subsection, we analyze the utility of our system. Our main result is a theoretical bound on the expected l_{2}norm of the joint type estimation error. Analysis of this bound indicates the tradeoffs between utility and privacy level ε as function of sampling parameter in and PRAM perturbation level γ. Given this error bound, we can compute the optimal sampling parameter m for minimizing the error bound while achieving a fixed privacy level ε.
For our system, the expected l_{2}norm of the joint type estimate is bounded by
where c is the condition number of the γdiagonal matrix A, given by
Applying the triangle inequality, we can bound the error as the sum of the error introduced by sampling and the error introduced by PRAM, as follows,
E∥A^{−1}T_{{circumflex over (X)}}_{m}_{,Ŷ}_{m}−T_{X}_{n}_{,Y}_{n}∥_{2}≦E∥T_{{tilde over (X)}}_{m}_{,{tilde over (Y)}}_{m}−T_{X}_{n}_{,Y}_{n}∥_{2}+E∥A^{−1}T_{{circumflex over (X)}}_{m}_{,Ŷ}_{m}−T_{{tilde over (X)}}_{m}_{,{tilde over (Y)}}_{m}∥_{2}.
We analyze and bound the sampling error,
E∥T_{{tilde over (X)}}_{m}_{,{tilde over (Y)}}_{m}−T_{X}_{n}_{,Y}_{n}∥_{2},
by bounding the conditional expectation
E[∥T_{X}_{m}_{,Y}_{m}−T_{X}_{n}_{,Y}_{n}∥_{2}T_{X}_{n}_{,Y}_{n}].
For a given (x,y)εX×Y, the sampled type, T_{{tilde over (X)}}_{m}_{,{tilde over (Y)}}_{m}(x,y), conditioned on T_{X}_{n}_{,Y}_{n}, is a hypergeometric random variable normalized by m, with expectation and variance given by
Applying Jensen'"'"'s inequality to the conditioned sampling error yields
Applying the smoothing theorem, the sampling error can be bounded by
Next, to analyze and bound the PRAM error given by
E∥A^{−1}T_{{circumflex over (X)}}_{m}_{,Ŷ}_{m}−T_{{tilde over (X)}}_{m}_{,{tilde over (Y)}}_{m}∥_{2},
we use the following linear algebra.
Let A be an invertible matrix and (x,y) be vectors that satisfy Ax=y. For any vectors ({circumflex over (x)}, ŷ) such that {circumflex over (x)}=A^{−1}ŷ, we have
where c is the condition number of the matrix A.
To bound the PRAM error, we use the following consequence,
which allows us to bound the conditional expectation of the PRAM error as follows,
For (x,y)εX×Y, the perturbed and sampled type T_{{circumflex over (X)}}_{m}_{,Ŷ}_{m}(x,y), conditioned on T_{{tilde over (X)}}_{m}_{,{tilde over (Y)}}_{m}, is a Poissonbinomial random variable normalized by m with expectation and variance given by
We can bound the following conditional expectation using Jensen'"'"'s inequality to yield
Combining equations yields the bound
which, upon applying the smoothing yields the following bound on the PRAM error
Combining the individual bounds on the sampling and PRAM error via the triangle inequality yields the following bound on expected norm2 error of the type estimate formed from the sampled and perturbed data.
Because A is a γdiagonal matrix, its condition number c is given by
Optimal Sampling Parameter m*
Given a fixed PRAM perturbation parameter γ, the error bound decays on the order of O(1/√{square root over (m)}) as a function of the sampling parameter m. However, as m increases, ε as given in Equation (1) also increases, decreasing privacy. However, when we fix the overall privacy level ε, by adjusting γ as a function of m, as given by Equation (2), to maintain the desired level of privacy, we observe that increasing m too much causes the error bound to expand, see
Intuitively, this can be explained as by having in too large, we need to increase the PRAM perturbation through lowering γ to maintain the same level of privacy, which has the adverse effect of increasing the error bound through the condition number c.
On the other hand, by having m too small, too few samples are taken resulting in an inaccurate type estimate, see
The theoretically optimal sample size m* for the error upper bound is given by the following. The optimal sampling parameter m* that optimizes the error bound of Equation (3) is
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.