PRIVACYSENSITIVE RANKING OF USER DATA

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
13Forward
Citations 
0
Petitions 
1
Assignment
First Claim
1. A computerexecutable method for privacysensitive ranking of aggregated data, comprising:
 distributing secret keys to a plurality of devices;
generating a plurality of probability density functions in a privacypreserving way using encrypted data received from a subset of the plurality of devices, wherein the encrypted data is encrypted with one or more of the secret keys;
generating a plurality of probability mass functions, each probability mass function associated with a corresponding probability density function;
computing a plurality of distance values, each respective distance value being a measure of distance from a probability mass function to a second distribution; and
ranking the probability mass functions and/or associated attributes according to their respective distance from the second distribution.
1 Assignment
0 Petitions
Accused Products
Abstract
One embodiment of the present invention provides a system for privacysensitive ranking of aggregated data. During operation, the system distributes secret keys to a plurality of devices. The system then generates a plurality of probability density functions in a privacypreserving way using encrypted data received from a subset of the plurality of devices. The encrypted data is data that has been encrypted with one or more of the secret keys by the subset of devices. The system then generates a plurality of probability mass functions, each probability mass function associated with a corresponding probability density function. Subsequently, the system computes a plurality of distance values, each respective distance value being a measure of distance from a probability mass function to a second distribution. The system then ranks the probability mass functions and/or associated attributes according to their respective distance from the second distribution.
17 Citations
View as Search Results
Emoji frequency detection and deep link frequency  
Patent #
US 9,705,908 B1
Filed 09/24/2016

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Emoji frequency detection and deep link frequency  
Patent #
US 9,712,550 B1
Filed 09/24/2016

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Emoji frequency detection and deep link frequency  
Patent #
US 9,894,089 B2
Filed 06/19/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Learning new words  
Patent #
US 10,133,725 B2
Filed 04/03/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Emoji frequency detection and deep link frequency  
Patent #
US 10,154,054 B2
Filed 06/30/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Efficient implementation for differential privacy using cryptographic functions  
Patent #
US 10,229,282 B2
Filed 09/23/2016

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Method and system for privacypreserving order statistics in a star network  
Patent #
US 10,356,056 B2
Filed 03/13/2017

Current Assignee
Palo Alto Research Center Inc.

Sponsoring Entity
Palo Alto Research Center Inc.

Emoji frequency detection and deep link frequency  
Patent #
US 10,454,962 B2
Filed 10/12/2018

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Efficient implementation for differential privacy using cryptographic functions  
Patent #
US 10,552,631 B2
Filed 03/08/2019

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

User experience using privatized crowdsourced data  
Patent #
US 10,599,867 B2
Filed 11/07/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

User experience using privatized crowdsourced data  
Patent #
US 10,599,868 B2
Filed 11/07/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Learning new words  
Patent #
US 10,701,042 B2
Filed 10/12/2018

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Differential privacy using a multibit histogram  
Patent #
US 10,726,139 B2
Filed 09/30/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Method and System for Seed Based Clustering of Categorical Data  
Patent #
US 20070271292A1
Filed 07/12/2006

Current Assignee
Qualcomm Inc.

Sponsoring Entity
Qualcomm Inc.

UNSUPERVISED PRIORITIZATION AND VISUALIZATION OF CLUSTERS  
Patent #
US 20140143249A1
Filed 03/14/2013

Current Assignee
Novantas Inc.

Sponsoring Entity
Amplero Inc.

METHODS, SYSTEMS, AND COMPUTER READABLE MEDIA FOR RAPID FILTERING OF OPAQUE DATA TRAFFIC  
Patent #
US 20150052601A1
Filed 03/13/2013

Current Assignee
University of North Carolina At Chapel Hill

Sponsoring Entity
University of North Carolina At Chapel Hill, Board of Regents of the University of Michigan, SRI International Inc.

Probabilistic Model For Cyber Risk Forecasting  
Patent #
US 20150381649A1
Filed 06/30/2014

Current Assignee
Neo Prime LLC

Sponsoring Entity
Neo Prime LLC

20 Claims
 1. A computerexecutable method for privacysensitive ranking of aggregated data, comprising:
distributing secret keys to a plurality of devices; generating a plurality of probability density functions in a privacypreserving way using encrypted data received from a subset of the plurality of devices, wherein the encrypted data is encrypted with one or more of the secret keys; generating a plurality of probability mass functions, each probability mass function associated with a corresponding probability density function; computing a plurality of distance values, each respective distance value being a measure of distance from a probability mass function to a second distribution; and ranking the probability mass functions and/or associated attributes according to their respective distance from the second distribution.  View Dependent Claims (2, 3, 4, 5, 6, 7)
 8. A computerreadable storage medium storing instructions that when executed by a computer cause the computer to perform a method for privacysensitive ranking of aggregated data, the method comprising:
distributing secret keys to a plurality of devices; generating a plurality of probability density functions in a privacypreserving way using encrypted data received from a subset of the plurality of devices, wherein the encrypted data is encrypted with one or more of the secret keys; generating a plurality of probability mass functions, each probability mass function associated with a corresponding probability density function; computing a plurality of distance values, each respective distance value being a measure of distance from a probability mass function to a second distribution; and ranking the probability mass functions and/or associated attributes according to their respective distance from the second distribution.  View Dependent Claims (9, 10, 11, 12, 13, 14)
 15. A computing system for privacysensitive ranking of aggregated data, the system comprising:
one or more processors, a computerreadable medium coupled to the one or more processors having instructions stored thereon that, when executed by the one or more processors, cause the one or more processors to perform operations comprising; distributing secret keys to a plurality of devices; generating a plurality of probability density functions in a privacypreserving way using encrypted data received from a subset of the plurality of devices, wherein the encrypted data is encrypted with one or more of the secret keys; generating a plurality of probability mass functions, each probability mass function associated with a corresponding probability density function; computing a plurality of distance values, each respective distance value being a measure of distance from a probability mass function to a second distribution; and ranking the probability mass functions and/or associated attributes according to their respective distance from the second distribution.  View Dependent Claims (16, 17, 18, 19, 20)
1 Specification
The subject matter of this application is related to the subject matter of the following application:
U.S. patent application Ser. No. 13/021,538 (Attorney Docket No. PARC20100996USNP), entitled “PRIVACYPRESERVING AGGREGATION OF TIMESERIES DATA,” by inventors Runting Shi, Richard Chow, and Tsz Hong Hubert Chan, filed Feb. 4, 2011;
the disclosure of which is incorporated by reference in its entirety herein.
1. Field
The present disclosure relates to privacypreserving data aggregation. More specifically, this disclosure relates to a method and system for ranking distributions over user data according to information leakage.
2. Related Art
A data aggregation service may collect personal information such as age, gender, income, and social preferences from a group of users. It may aggregate the data and monetize it with third parties. Such services include, for example, online social networks, “data vaults” (such as personal.com, sellboxhq.com), and remote monitoring systems. Aggregates are typically probability density functions over a specific type of data.
The data contributed by users raises privacy concerns. Data sensitivity to user privacy concerns depends on various factors: 1) the entity data is shared with (e.g., the aggregator), 2) the entity the data is sold to (e.g., the third party), 3) the type of data (e.g., age, gender), and 4) the uniqueness of data (e.g., how many others users share the information).
Points 1 and 2 above are important considerations for users to build trust in the third party and aggregator and contribute their data. Points 3 and 4 are key to convince users that the contributed intimate data is properly aggregated and anonymized. In many scenarios, users might prefer to retain data that is sensitive, and share data that is not. The challenge is to measure how sensitive one type of data is at the aggregate level when compared to another.
One embodiment of the present invention provides a system for privacysensitive ranking of aggregated data. During operation, the system distributes secret keys to a plurality of devices. The system then generates a plurality of probability density functions in a privacypreserving way using encrypted data received from a subset of the plurality of devices. The encrypted data is data that has been encrypted with one or more of the secret keys by the subset of devices. The system then generates a plurality of probability mass functions, each probability mass function associated with a corresponding probability density function. Subsequently, the system computes a plurality of distance values, each respective distance value being a measure of distance from a probability mass function to a second distribution. Note that the second distribution can be a uniform distribution. The system then ranks the probability mass functions and/or associated attributes according to their respective distance from the second distribution.
In a variation on this embodiment, computing the plurality of distance values includes computing a JensenShannon divergence for each of the distance values.
In a variation on this embodiment, the system also receives a minimum distance value λ_{i,j }from each of the plurality of devices for an attribute j. The system compares each λ_{i,j }to a respective distance value d_{j}, to determine whether a user i who contributed minimum distance value λ_{i,j }is willing to share data for attribute j. The system computes a ratio γ_{j}=S_{j}/N where S_{j}={iεU s·t·d_{j}≦1−λ_{i,j}} is the number of users that are willing to share attribute j. The system then determines that the ratio γ_{j }is greater than a predetermined threshold, and shares a probability mass function and/or a probability density function for attribute j with a customer.
In a variation on this embodiment, ranking the probability mass functions and associated attributes includes ranking distances d_{j }associated with attributes such that d_{ρ1}≦d_{ρ2}≦ . . . ≦d_{ρK}, where ρ_{1}=arg min_{j }d_{j }and ρ_{z}=arg min_{j≠{ρ}_{k}_{}}_{k=1}_{z=1}(d_{j}) for 2≦z≦K such that j represents an attribute out of a total number of K attributes.
In a variation on this embodiment, the system sends a distance value d_{j }to a user for attribute j, and receives from the user a request for an increase in monetary retribution in exchange for selling aggregate data associated with attribute j to a customer.
In a variation on this embodiment, the system recomputes a probability density function for attribute j using only data from those users that have indicated a willingness to share their attribute j data, and shares the recomputed probability density function with a customer.
In a variation on this embodiment, generating the plurality of probability density functions includes receiving at least a pair of encrypted vectors from each device of a subset of the plurality of devices. One of the encrypted vectors is associated with a respective set of numerical values and the other encrypted vector is associated with corresponding square values of the set of numerical values, each pair of encrypted vectors encrypted using a respective secret key distributed to a device of the plurality of devices. The system subsequently computes, for each pair of encrypted vector elements associated with a numerical value and a square of the numerical value, a mean and variance of a probability density function. The system then generates the plurality of probability density functions based on the computed mean and variance values.
In the figures, like reference numerals refer to the same figure elements.
The following description is presented to enable any person skilled in the art to make and use the embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Embodiments of the present invention solve the problem of determining the value of aggregates of user attribute data by ranking aggregates of attribute values according to distance values computed between an estimated probability distribution for each user attribute and another distribution. One possible distribution that can be used is the uniform restriction, as it does not reveal any particular characteristic of the underlying data except that all of their values are equally likely. These distance values can be, for example, JensenShannon divergence values. The user attributes include, for example, age, education, or income. The aggregates are probability density functions (e.g., probability distributions) generated from encrypted data received from users.
A data aggregator can use the JensenShannon divergence as a method of measuring the similarity between two probability distributions, one of which is the uniform distribution and the other is an estimated distribution. Since the JensenShannon divergence requires discrete distributions, the data aggregator can discretize distributions to compute the relative distance between a discrete distribution of each attribute and a discrete uniform distribution. It can then rank the attributes according to increasing order of information leakage or “value” of the information.
Attributes with distributions that have a greater distance from the uniform distribution offer greater value to customers (e.g., advertisers) because there is more information leakage. The uniform distribution does not reveal interesting information and is least “valuable” to customers, since all values of the uniform distribution are equally probable. Before the data aggregator can rank the estimated distributions of user attributes, the data aggregator must first collect user attribute data and generate the estimated distributions.
The data aggregator initially aggregates user data in a privacypreserving way by using encrypted user attribute data to compute probability density functions as approximations of actual distributions. A probability density function of a continuous random variable is a function that describes the relative likelihood for the random variable to take on a given value. The aggregation techniques disclosed herein include extending the summing property of secure multiparty computation functions in order to compute probability density functions in a distributed and privacypreserving way. Secure multiparty computation involves providing multiple parties with a protocol for jointly computing a function while keeping their inputs private.
The data aggregator is one component of a privacypreserving framework for aggregating user attribute data and monetizing the user data. The framework includes a protocol that is executed between users, the data aggregator, and a customer. Users contribute encrypted and differentiallyprivate data to the aggregator which extracts a statistical model of the underlying data. Differentiallyprivate means that the aggregates computed by the aggregator will not be significantly affected by whether or not a specific user contributes his or her profile data. Rather than sending only encrypted data generated from attribute values, users also send encrypted data generated from the square of the attribute values. This allows the aggregator to compute the mean and variance of probability distribution functions.
The data aggregator receives the encrypted user attribute data and can use Gaussian approximations to estimate probability density functions for the user attributes. The aggregation and monetization techniques are privacypreserving in that they do not reveal personal information to the aggregator, other users or third parties. Users only disclose an aggregate model of their profiles. This preserves data utility and provides user privacy.
The data aggregator can then rank the value of the data aggregates (e.g., estimated probability density functions). The data aggregator can dynamically assess the value of the data aggregates, using an informationtheoretic measure to compute the amount of “valuable” information that it can sell to customers. An informationtheoretic measure can provide the divergence (e.g., entropic dissimilarity) between two probability distributions. The data aggregator can use this metric to dynamically value user statistics according to their inherent amount of “valuable” information (e.g., sensitivity). For instance, data aggregators can assess whether age statistics in a group of participants are more sensitive than income statistics. One can measure the sensitivity of different data attributes in terms of the amount of information they leak.
The inventors also developed a pricing scheme that dynamically sets the price for different data attributes according to the amount of information leakage. This is a novel scheme for dynamic pricing of different data attributes based on the amount of “interesting” information they provide, which represents a more realistic estimation of the value of different data attributes compared to fixed pricing schemes.
This disclosure also describes an exemplary privacypreserving system for aggregation of smart meter data. Smart meters can measure electricity consumption levels (or consumption levels for other utilities) and send the data to a data aggregator. The data aggregator can generate aggregate values and monetize the data without access to the electricity consumption data of individual users.
The disclosed techniques do not depend on a thirdparty for differential privacy, incurs low computational overhead, and addresses linkability issues between contributions and users. To the best of the inventors knowledge, this disclosure describes the first privacypreserving aggregation scheme for personal data monetization. This disclosure also provides the first privacypreserving comparative measure of information leakage of personal data attributes based on the model parameters of data distributions.
The inventors evaluated the privacypreserving framework on a real, anonymized dataset of 100,000 users (obtained from the United States Census Bureau) with different types of attributes. The results show that the framework (i) provides accurate aggregates with as little as 100 participants, (ii) generates revenue for users and data aggregators depending on the number of contributing users and sensitivity of attributes, and (iii) has low computational overhead on user devices (e.g., 0.3 ms for each user, independently of the number of participants).
Aggregator 102 receives encrypted data from smart meters 104108 and generates a probability density function 116 using the encrypted data. Aggregator 102 may sell the probability density function 116 to a customer. Smart meter 104 sends data x_{1 }and x_{1}^{2 }encrypted using key k1: [x_{1}]_{k1 }and [x_{1}^{2}]_{k1}. Smart meter 106 sends data x_{2 }and x_{2}^{2 }encrypted using key k2: [x_{2}]_{k2 }and [x_{2}^{2}]_{k2}. Smart meter 108 sends data x_{3 }and x_{3}^{2 }encrypted using key k3: [x_{3}]_{k3 }and [x_{3}^{2}]_{k3}.
Aggregator 102 can compute the mean using the sum of the encrypted values for N devices (brackets indicate encryption):
Aggregator 102 may also compute the variance using the encrypted square values:
In some scenarios, users may be part of a group of people that agree to reveal encrypted versions of their personal data to advertisers. The advertisers do not receive each person'"'"'s individual data. Instead, advertisers only have knowledge of the aggregate data for users. In exchange for the users'"'"' encrypted data, the users may get special discounts.
Note that in some implementations, users can also contribute additional values, such as x^{3 }or x^{4 }and higher order moments and aggregator 102 may combine them into higher order approximations using momentgenerating functions. Momentgenerating functions of a random variable X are an alternative specification of probability distribution based on moments. The expected value of kth moments contributed by users is equal to the kth moment of the population.
Customer 120 is interested in buying aggregates of users'"'"' data. Customer 120 queries aggregator 122 for user information, while users 124 contribute their personal information to aggregator 122. Aggregator 122 acts as a proxy between users 124 and customer 120 by aggregating and monetizing user data. Users 124 contribute encrypted profiles to aggregator 122. Aggregator 122 combines encrypted profiles and determines model parameters (e.g., mean and variance) in cleartext, which it monetizes on behalf of users with potential customers. Below are detailed descriptions of the system model, including descriptions of users 124, aggregator 122, and the customer 120.
Users.
Users store a set of personal attributes such as age, gender, and preferences locally. Users may want to monetize their personal information. Each user iεU maintains a profile vector p_{i}=[x_{i,1}, . . . , x_{i,K}], where x_{i,j}εD is the value of attribute j of user i and D is a suitable domain for j. For example, if j represents the age of user i, then x_{i,j}ε{1, . . . , M_{j}}, M_{j}=120, and D⊂.
In practice, users can generate their personal profiles manually, or leverage profiles maintained by third parties. Several social networks notably allow subscribers to download their online profile. A Facebook profile, for example, contains numerous personally identifiable information items (such as age, gender, relationships, location), preferences (movies, music, books, tv shows, brands), media (photos and videos) and social interaction data (list of friends, wall posts, liked items).
Each user i can specify a privacysensitivity value 0≦λ_{i,j}≦1 for each attribute j. Large values of λ_{i,j }indicate high privacy sensitivity or lower willingness to disclose. In practice, λ_{i,j }can assume a limited number of discrete values, which could represent the different levels of sensitivity according to Westin'"'"'s Privacy Indexes.
Users may want to monetize their profiles while preserving their privacy. For instance, users may be willing to trade an aggregate of their online behavior, such as the frequency at which they visit different categories of websites, rather than the exact time and URLs. Also, users are associated with devices that can perform cryptographic operations including multiplication, exponentiation, and discrete logarithm.
Data Aggregator.
Aggregator 122 is an untrusted thirdparty that performs the following actions: (1) it collects encrypted attributes from users, (2) it aggregates contributed attributes in a privacypreserving way, and (3) it monetizes users'"'"' aggregates according to the amount of “valuable” information that each attribute conveys.
Users and aggregator 122 may sign an agreement upon user registration that authorizes aggregator 122 to access only the aggregated results (but not users'"'"' actual attributes), to monetize them with customers, and to take a share of the revenue from the sale. It also binds aggregator 122 to redistribute the rest of the revenue among contributing users.
Customer.
Customer 120 wants to obtain aggregate information about users and is willing to pay for it. Customer 120 can have commercial contracts with multiple data aggregators. Similarly, aggregator 122 can have contracts with multiple customers. Note that although this disclosure describes a system with one customer and one aggregator, different implementations may include any number of customers and/or aggregators. Customer 120 interacts with aggregator 122 but not directly with users. Customer 120 obtains available attributes, and may initiate an aggregation by querying aggregator 122 for specific attributes.
The proposed system model is wellsuited to many realworld scenarios, including market research and online tracking use cases. For instance, consider a car dealer that wants to assess user preferences for car brands, their demographics, and income distributions. A data aggregator might collect aggregate information about a representative set of users and monetize it with the car dealer. Companies such as Acxiom currently provide this service, but raise privacy concerns. The solution disclosed herein enables such companies to collect aggregates of personal data instead of actual values and reward users for their participation.
Another example is that of an online publisher (e.g., a news website) that wishes to know more about its online readers. In this case, the aggregator is an online advertiser that collects information about online users and monetizes it with online publishers. Similarly, one can measure the opinion of TV show audiences, target an advertisement for the highest topic of interest among a crowd, and monetize probability distribution functions to provide others with an understanding of local user preferences.
Finally, the proposed model can also be appealing to data aggregators in healthcare. Healthcare data is often fragmented in silos across different organizations and/or individuals. A healthcare aggregator can compile data from various sources and allow third parties to buy access to the data. At the same time, data contributors (e.g., users) receive a fraction of the revenue. The techniques disclosed herein thwarts privacy concerns and helps with the pricing of contributed data.
In modeling security, one may consider both passive and active adversaries.
Passive Adversaries.
Semihonest (or honestbutcurious) passive adversaries monitor user communications and try to infer the individual contributions made by other users. For instance, users may wish to obtain attribute values of other users; similarly, data aggregators and customers may try to learn the values of the attributes from aggregated results. A passive adversary executes the protocol correctly and in the correct order, without interfering with inputs or manipulating the final result.
Active Adversaries.
Active (or malicious) adversaries can deviate from the intended execution of the protocol by inserting, modifying or erasing input or output data. For instance, a subset of malicious users may collude with each other in order to obtain information about other (honest) users or to bias the result of the aggregation. To achieve their goal, malicious users may also collude with either the data aggregator or with the customer. Moreover, a malicious data aggregator may collude with a customer in order to obtain private information about the user attributes.
During operation, aggregator 102 may initially generate and distribute security keys to multiple devices, such as the smart meters depicted in
Aggregator 102 may receive encrypted input data from the multiple devices (operation 204). There may be hundreds of such devices or more. The devices may use the security keys to encrypt data. For example, the smart meters may measure and encrypt any type of data. Such data may represent consumption levels for utilities such as electricity or water.
Aggregator 102 may then compute a sum of the encrypted data and an average of the encrypted data, and determine a mean and variance of a probability density function (operation 206). Note that aggregator 102 may compute the mean and variance for different types of data. For example, aggregator 102 may compute the average consumption of electricity in a city without knowledge of each individual'"'"'s consumption levels, and then determine the mean and variance of a probability density function for the level of electricity consumption. The details for computing mean and variance from encrypted data are described with respect to
Aggregator 102 may generate a probability density function (operation 208). In some implementations, aggregator 102 may generate the probability density function of a Gaussian distribution. Aggregator 102 may generate many different probability density functions for different types of data. In some implementations, aggregator 122 may also rank the probability density functions according to information leakage. Details for generating and ranking the probability density functions are also described with respect to
Subsequently, aggregator 102 monetizes the aggregate information (operation 210). Aggregate 102 may sell the aggregate information to customers. These customers may have a contractual agreement to pay for access to the probability density functions. In some implementations, system 100 may allow a third party to access the information through an application programming interface (API). Neither the aggregator 102 nor third parties have access to original, unencrypted data for individual users, thereby protecting the privacy of the users.
The sections below describe functions and primitives for the aggregation and monetization of user attribute data, including computing aggregates by estimating the probability density function of user attributes. Note that the inventors decided to use the Gaussian approximation to estimate probability density functions for two reasons. First, this leads to precise aggregates with few users. The central limit theorem states that the arithmetic mean of a sufficiently large number of independent random variables, drawn from distributions of expected value μ and variance σ^{2}, will be approximately normally distributed N(μ, σ^{2}). Second, the Gaussian probability density function is fully defined by these two parameters and thus there is no need for additional coordination among users (after an initialization phase). For information leakage ranking, the inventors chose an informationtheoretic distance function.
With this protocol, there are two possible modes of implementations: batch and interactive. In batch mode, users 124 send their encrypted profiles containing personal attributes to aggregator 122. Aggregator 122 combines encrypted profiles, decrypts them, obtains aggregates for each attribute, and ranks attributes based on the amount of “valuable” information they provide. Aggregator 122 then offers customer 120 access to specific attributes.
In interactive mode, customer 120 initiates a query about specific attributes and users. Aggregator 122 selects the users matching the query, collects encrypted replies, computes aggregates, and monetizes them according to a pricing function.
This protocol is executed between users 124, aggregator 122, and customer 120. Each user iεU and aggregator 122 may receive the following parameters: The total number of users N, the total number of attributes K, the maximum value M_{j }and minimum value m_{j }for each attribute j, and a time period t (e.g., last month) for which users agree to aggregate their data.
As illustrated in
In one implementation, G is a cyclic group of prime order n, where the decisional DiffieHellman assumption holds. H:Z→G is a hash function modeled as a random oracle. Assume that a trusted dealer chooses a generator gεG, which is public, and N+1 random secret shares s_{0}, s_{1}, . . . , s_{N }εZ_{p }such that Σ_{i=0}^{N}s_{i}=0. Aggregator 122 obtains the secret s_{0 }and each user iεU obtains a respective secret s_{i}.
Customer 120 begins by sending a query to aggregator 122 (operation 304). The query may contain information about the type of aggregates and users. In some implementations, the query may be formatted as an SQL query. Aggregator 122 then selects users based on the customer query (operation 306). Aggregator 122 may select users based on some basic information, such as user demographics. In some implementations, aggregator 122 may let users decide whether to participate or not when it forwards the customer query to users.
Aggregator 122 forwards the customer'"'"'s query to users (operation 308). Aggregator 122 may also send to users a public feature extraction function ƒ.
Next, each user i generates a profile vector containing personal attributes and also generates encrypted vectors (operation 310). Each user i generates a profile vector p_{i}εD^{K }containing personal attributes jε{1, . . . , K}. K is the number of attributes and the number of dimensions of the profile vector, and D represents the domain. In other words, each user i generates a profile vector p_{i}=[x_{i,1}, . . . x_{i,K}]. Each attribute j is a value x_{i,j}ε{m_{j}, . . . , M_{j}}, where m_{j}, M_{j}εZ_{p }are the minimum and the maximum value. Note that computations are in cyclic group Z_{p}, and p is a prime order. In some implementations, p is a 1024 bits modulus. In practice, a user can derive p_{i }either from an existing online profile (e.g., Facebook or Google+) or by manually entering values x_{i,j}. The inventors used real values obtained from the U.S. Census Bureau for evaluation.
To privately compute the Gaussian parameters ({circumflex over (μ)}_{j}, σ_{j}^{2}) for each attribute j and guarantee (ε, δ)differential privacy, each user i adds noise values r_{i,j}, o_{i,j}, to attribute values sampled from a symmetric Geometric distribution. In particular, each user i adds noise to both x_{i,j }and x_{i,j}^{2}, as they will be subsequently combined to obliviously compute the parameters of the model that underlies the actual data:
{circumflex over (x)}_{i,j}=x_{i,j}+r_{i,j }mod p
{circumflex over (x)}_{i,j}^{(2)}=x_{i,j}^{2}+o_{i,j }mod p
where p is the prime order.
With {circumflex over (x)}_{i,j }and {circumflex over (x)}_{i,j}^{(2) }each user generates the following encrypted vectors (c_{i}, b_{i}):
Note that b_{i,j }and c_{i,j }represent the encrypted data of each user i for an attribute j in a group of N users, s_{i }represents the secret key for user i, g represents the generator, K represents the number of attributes, and H(t) represents a hash function at time t.
Each user i then sends (c_{i}, b_{i}) to aggregator 122. Note that the encryption scheme guarantees that aggregator 122 is unable to decrypt the vectors (c_{i}, b_{i}). However, aggregator 122 can decrypt aggregates using secret share s_{0}.
Aggregator 122 then computes intermediate values, determines mean and variance, and computes probability density function for each attribute (operation 312). To compute the sample mean {circumflex over (μ)}_{j }and variance {circumflex over (σ)}_{j}^{2 }without having access to the individual values {circumflex over (x)}_{i,j}, {circumflex over (x)}_{i,j}^{(2) }of any user i, aggregator 122 first computes the intermediate values:
V_{j}=H(t)^{s}^{0}Π_{i=1}^{N}c_{i,j}=H(t)^{Σ}^{k=0}^{N}^{s}^{k}g^{Σ}^{i=1}^{N}^{{circumflex over (x)}}^{i,j}=g^{Σ}^{i=1}^{N}^{{circumflex over (x)}}^{i,j }
W_{j}=H(t)^{s}^{0}Π_{i=1}^{N}b_{i,j}=H(t)^{Σ}^{k=0}^{N}^{s}^{k}g^{Σ}^{t=1}^{N}^{{circumflex over (x)}}^{i,j}^{(2)}=g^{Σ}^{i=1}^{N}^{{circumflex over (x)}}^{i,j}^{(2) }
Specifically, b_{i,j }and c_{i,j }represent the encrypted data of each user i for an attribute j in a group of N users, s_{0 }represents the secret key for aggregator 122, and H(t) represents a hash function at time t. Also, s_{k }represents the secret key for user k, and g represents the generator.
To obtain ({circumflex over (μ)}_{j}, {circumflex over (σ)}_{j}^{2}), aggregator 122 computes the discrete logarithm base g of {V_{j}, W_{j}}:
Finally, using the derived ({circumflex over (μ)}_{j}, {circumflex over (σ)}_{j}^{2}), aggregator 122 computes the Gaussian probability density function for each of the K attributes. In some implementations, aggregator 122 may compute probability density functions at different points in time. Users can contribute information regularly and one can observe trends and patterns in attitudes of the users.
Aggregator 122 may then compute distance measures and rank attributes according to information leakage (operation 314). In order to estimate the amount of valuable information (i.e., sensitivity) that each attribute leaks, the inventors propose to measure the distance between N_{j }and the Uniform distribution U (that does not leak any information). N_{j }represents the estimated distribution for attribute j. Others have studied a related concept for measuring the “interestingness” of textual data by comparing it to an expected model, usually with the KullbackLiebler divergence. To the best of the inventors'"'"' knowledge, this disclosure is the first to explore this approach in the context of information privacy. Instead of the KL divergence, the inventors rely on the JensenShannon (JS) divergence for two reasons: (1) JS is a symmetric and (2) bounded equivalent of the KL divergence. It is defined as:
where m=u/2+q/2 and H is the Shannon entropy. As JS is in [0,1] (when using the logarithm base 2), it quantifies the relative distance between N_{j }and U_{j}, and also provides absolute comparisons with distributions different from the uniform.
Note that in some implementations, one can also compare N_{j }with a Gaussian distribution or other probability distribution. Some implementations may use different similarity functions for measuring divergence, such as the KullbackLiebler Divergence, based on specific requirements (e.g., presence or absence of bounds, symmetry, performance, or data types).
Since JS operates on discrete values, aggregator 122 must first discretize distributions N_{j }and U_{j}. Given the knowledge of intervals {m_{j}, . . . , M_{j})} for each attribute j, one can use Riemann'"'"'s centered sum to approximate a definite integral, where the number of approximation bins is related to the accuracy of the approximation. The inventors choose the number of bins to be M_{j}−m_{j}, and thus guarantee a bin width of 1. One can approximate N_{j }by the discrete random variable dN_{j }with the following mass function:
where pdf_{j }is the probability density function of N_{j }and x_{j}ε{m_{j}, . . . , M_{j}}. For the uniform distribution U_{j}, the discretization to dUj is straightforward, i.e., Pr(dU_{j})=(1/(M_{j}−m_{j}), . . . , 1/(M_{j}−m_{j}))^{T}, where dim(dU_{j})=K.
In one implementation, aggregator 122 can compute distances d_{j}=JS(dN_{j}, dU_{j})ε[0,1] and rank attributes in increasing order of information leakage such that d_{ρ1}≦d_{ρ2}≦ . . . ≦d_{ρK}, where
ρ_{1}=arg min_{j }d_{j }and ρ_{z }(for 2≦z≦K) is defined as ρ_{z}=arg min_{j≠{ρ}_{k}_{}}_{k=1}_{z−1}(d_{j}).
At this point, aggregator 122 has computed the 3tuple (d_{ρj}, {circumflex over (μ)}_{j}, {circumflex over (σ)}_{j}^{2}) for each attribute j. Each user i can now decide whether it is comfortable sharing attribute j given distance d_{j }and privacy sensitivity λ_{i,j}. To do so, each user i sends λ_{i,j }to aggregator 122 for comparison. Aggregator 122 then checks which users are willing to share each attribute j and updates a ratio γ_{j}=S_{j}/N where S_{j }is the number of users that are comfortable sharing, i.e., S_{j}={iεU s·t·d_{j}≦1−λ_{i,j}}. In some implementations, aggregator 122 may use a majority rule to decide whether or not to monetize attribute j.
In some implementations, aggregator 122 may recompute aggregate values using only data from those users that are willing to share their attribute data, and share the recomputed aggregate values. In some implementations, aggregator 122 may send information including distance d_{j }to users, and users may choose to share their attribute data provided that they receive a predetermined increase in monetary retribution.
Note that one can use the disclosed techniques for ranking data aggregates in many scenarios, including: (1) detecting sensitivity of different data types for a set of users, (2) pricing different data types depending on potential economic value, (3) quantifying similarities of distributions among different information types, (4) assessing similarity between the expected behavior of a person and the actual behavior for authorization and access control purposes, (5) diagnosing a health condition (comparison of the symptoms with expected model for a given condition), and (6) providing differentiated privacy guarantees (and costs) based on the sensitivity of information. This can help existing market players introduce differentiated services and pricing based on the sensitivity of information types, depending on the set of users.
Further, the disclosed method for privacypreserving ranking of user data is oblivious to the nature or type of personal information it ranks, as it does not require access to data. It works with any number of users, and operates with any type of data that can be expressed in numerical form.
After the ranking phase, aggregator 122 may conclude the process with pricing and revenue phases. Aggregator 122 may determine the cost Cost(j) of each attribute j (operation 316). Note that users typically assign unique monetary value to different types of attributes depending on several factors, such as offline/online activities, type of thirdparties involved, privacy sensitivity, and amount of details and fairness.
In some applications, aggregator 122 can measure the value of aggregates depending on their sensitivity, the number of contributing users, and the price of each attribute. One possible way to estimate the value of an aggregate j is to use the following linear model:
Cost(j)=Price(j)·d_{j}·N
where Price(j) is the monetary value that users assign to attribute j. As an example pricing scheme each attribute may have a relative value of 1. Others have estimated the value of user attributes in a large range from $0.0005 to $33, highlighting the difficulty in determining a fixed price. In practice, this is likely to change depending on the monetization scenario.
Aggregator 122 may then send data to customers to facilitate purchases of model parameters (operation 318). In some implementations, aggregator 122 may send a set of 2tuples {(d_{ρ}_{z}, Cost(ρ_{z}))}_{ρ}_{z=1}^{K }to customer 120. Based on the tuples, customer 120 may select a set P of attributes it wishes to purchase. After the purchase is complete, aggregator 122 redistributes revenue R among users and itself, according to an agreement stipulated with the users upon their first registration with aggregator 122.
One implementation of a revenue sharing monetization scheme in which revenue is split among users and aggregator 122 (e.g., aggregator 122 takes commissions) can be as follows:
where i represents a user i, j indicates attribute j, N represents the number of users, A represents aggregator 122 and w_{j }is the commission percentage of aggregator 122. This system is popular in various aggregating schemes, creditcard payments, and online stores (e.g., iOS App Store). Note that this assumes that w_{j }is fixed for each attribute j.
In some implementations, aggregator may receive the data (e.g., encrypted vectors with user attribute data) from users with devices in exchange for a benefit of economic value to the users as part of a monetizing and/or advertising or marketing program. For example, the users may get special discounts based on personal car preferences and personal brand preferences or other customized offers. In return, advertisers may receive aggregate data such as distributions of attributes.
To test the relevance and the practicality of the privacypreserving monetization solution, the inventors measure the quality of aggregates, the overhead, and generated revenue. In particular, the inventors study how the number of protocol participants and their privacy sensitivities affect the accuracy of the Gaussian approximations, the computational performance, the amount of information leaked for each attribute, and revenue.
The inventors analyzed an implementation with secret shares in Z_{p }where p is a 1024 bits modulus, with number of users Nε[10, 100000], and each user i is associated with profile p_{i}. The inventors implemented the privacypreserving protocol in Java, and relied on public libraries for secret key initialization, for multithreading decryption, and on the MAchine Learning for LanguagE Toolkit (MALLET) package for computation of the JS divergence.
The inventors ran the experiments on a machine equipped with Mac OSX 10.8.3, dualcore Core i5 processor, 2.53 GHz, and 8 GB RAM. Measurements up to 100 users are averaged over 300 iterations, and the rest (from 1 k to 100 k users) are averaged over 3 iterations due to large simulation times.
The inventors populated user profiles with U.S. Census Bureau information. The inventors obtained anonymized offline and online attributes for 100,000 people, and preprocessed the acquired data by removing incomplete profiles (i.e., some respondents preferred not to reveal specific attributes).
The inventors focused on three types of offline attributes: yearly income level, education level, and age. The inventors selected these attributes because (1) a recent study shows that these attributes have high monetary value (and thus privacy sensitivity), and (2) they have significantly different distributions across users. This allowed the inventors to compare retribution models, and measure the accuracy of the Gaussian approximation for a variety of distributions.
“UserRand” displays revenue for each user when a subset of all users are chosen at random to contribute data to the aggregator. “User Indiv.” displays revenue for each user when only users that have privacy sensitivity greater than the data sensitivity contribute data to the aggregator. “UserAll” displays revenue for each user when all users contribute data to the aggregator.
The inventors evaluated four aspects of the privacypreserving scheme: model accuracy, information leakage, overhead and pricing. The results of evaluating these four aspects are described below.
Model Accuracy.
The inventors proposed to approximate empirical probability density functions with Gaussian distributions. The accuracy of approximations is important to assess the relevance of derived data models.
One can measure the accuracy of the Gaussian approximation in more detail with the JS divergence (
These results indicate that, for nonuniform distributions, the Gaussian approximation is accurate with a relatively small number of users (about 100). It is interesting to study this result in light of the central limit theorem. The central limit theorem states that the arithmetic mean of a sufficiently large number of variables will tend to be normally distributed. In other words, a Gaussian approximation quickly converges to the original distribution and this confirms the validity of the experiments. This also means that a customer can obtain accurate models even if it requests aggregates about small groups of users. In other words, collecting data about more than 1 k users does not significantly improve the accuracy of approximations, even for more extreme distributions.
Information Leakage.
One can compare the divergence between Gaussian approximations and uniform distributions to measure the information leakage of different attributes.
Overall, observe that education is by far the attribute with the largest distance to the uniform distribution, and therefore arguably the most valuable one. In comparison, income and age are 50% and 75% less “revealing.” Information leakage for age decreases from 100 to 1 k users, as age distribution in the dataset tends towards a uniform distribution. In contrast, education and income are significantly different from a uniform distribution. An important observation is that the amount of valuable information does not increase monotonically with the number of users: For age, it decreases by 30% when the number of users increases from 100 to 1 k, and for education it decreases by 3% when transitioning from 10 k to 5 k users.
These findings show that larger user samples do not necessarily provide better discriminating features. This also shows that users should not decide whether to participate in the protocol solely based on a fixed threshold over total participants, as this may prove to leak slightly more private information.
Overhead.
The inventors also measure the computation overhead for both users and the aggregator. For each user, one execution of the protocol requires 0.284 ms (excluding communication delays), out of which 0.01 ms are spent for the profile generation, 0.024 ms for the feature extraction, 0.026 ms for the differentialprivacy noise addition, and 0.224 ms for encryption of the noisy attribute. In general, user profiles are not subject to change within short time intervals, thus suggesting that userside operations could be executed on resourceconstrained devices such as mobile phones.
From
Pricing.
The price of an attribute aggregate depends on the number of contributing users, the amount of information leakage, and the cost of the attribute. In an implementation, each attribute j has a unit cost of 1 and the aggregator takes a commission w_{j}. There can be three types of privacy sensitivities λ: (i) a uniform random distribution of privacy sensitivities λ_{i,j }for each user i and for each attribute j, (ii) an individual privacy sensitivity λ_{i }for each user (same across different attributes), and (iii) an allshare scenario (λ_{i}=0 and all users contribute). The commission percentage is w_{j}=w=0.1.
Observe that users have an incentive to participate as they earn some revenue (rather than not benefiting at all), but the generated revenue does not generate significant income. Thus, it might encourage user participation from biased demographics (e.g., similar to Amazon Mechanical Turk). In contrast, the aggregator has incentives to attract more users, as its revenue increases with the number of participants. However, customers have an incentive to select fewer users because cost increases with the number of users, and 100 users provide as good an aggregate as 1000 users. This is an intriguing result, as it encourages customers to focus on small groups of users representative of a certain population category.
Passive Adversary.
To ensure privacy of the personal user attributes, the framework relies on the security of the underlying encryption and differentialprivacy methods. Hence, no passive adversary (e.g., a user participating in the monetization protocol, the data aggregator or an external party not involved in the protocol) can learn any of the user attributes. This assumes that the system has performed the key setup phase correctly and that one has chosen a suitable algebraic group (satisfying the DDH assumption) with a large enough prime order (e.g., 1024 bits or more).
Active Adversary.
The framework is resistant to collusion attacks among users and between a subset of users and the aggregator, as each user i encrypts its attribute values with a unique and secret key s_{i}. However, pollution attacks, which try to manipulate the aggregated result by encrypting outofscope values, can affect the aggregate result of the protocol. Nevertheless, one can mitigate such attacks by including, in addition to encryption, range checks based on efficient (noninteractive) zeroknowledge proofs of knowledge. Each user could submit, in addition to the encrypted values, a proof that such values are indeed in the plausible range specified by the data aggregator. However, even within a specific range, a user can manipulate its contributed value and thus affect the aggregate. Although nudging users to reveal their true attribute value is an important challenge, it is outside of the scope of this disclosure.
Specifically, apparatus 700 can comprise any combination of attribute collector module 702, encryption module 704, utility usage meter 706, and aggregatordevice communication module 708. Note that apparatus 700 may also include additional modules and data not depicted in
Some implementations may include attribute collector module 702 which collects user attribute data. Encryption module 704 encrypts data to send to aggregator 122. Some implementations may include utility usage meter 706 which measures electricity, water, or other utility consumption levels. Aggregatordevice communication module 708 sends the encrypted data to aggregator 122.
During operation, one or more applications, such as aggregation module 824, are loaded from storage device 806 into memory 804 and then executed by processor 802. While executing the program, processor 802 performs the aforementioned functions. Computer and communication system 800 may be coupled to an optional display 817, keyboard 818, and pointing device 820.
The data structures and code described in this detailed description are typically stored on a computerreadable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computerreadable storage medium includes, but is not limited to, volatile memory, nonvolatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computerreadable media now known or later developed.
The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computerreadable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computerreadable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computerreadable storage medium.
Furthermore, methods and processes described herein can be included in hardware modules or apparatus. These modules or apparatus may include, but are not limited to, an applicationspecific integrated circuit (ASIC) chip, a fieldprogrammable gate array (FPGA), a dedicated or shared processor that executes a particular software module or a piece of code at a particular time, and/or other programmablelogic devices now known or later developed. When the hardware modules or apparatus are activated, they perform the methods and processes included within them.
The foregoing descriptions of various embodiments have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention.