METHODS AND APPARATUS TO ANONYMIZE A DATASET OF SPATIAL DATA

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
26Forward
Citations 
0
Petitions 
2
Assignments
First Claim
1. A method to anonymize a dataset of spatial data, comprising:
 generating a spatial indexing structure with spatial data;
establishing a height value associated with the spatial indexing structure to generate a plurality of tree nodes, each of the plurality of tree nodes associated with spatial data counts;
calculating a localized noise budget value for respective ones of the tree nodes based on the height value and an overall noise budget; and
anonymizing the plurality of tree nodes with an anonymization process, the anonymization process using the localized noise budget value for respective ones of the tree nodes.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and apparatus are disclosed to anonymize a dataset of spatial data. An example method includes generating a spatial indexing structure with spatial data, establishing a height value associated with the spatial indexing structure to generate a plurality of tree nodes, each of the plurality of tree nodes associated with spatial data counts, calculating a localized noise budget value for respective ones of the tree nodes based on the height value and an overall noise budget, and anonymizing the plurality of tree nodes with a anonymization process, the anonymization process using the localized noise budget value for respective ones of the tree nodes.
26 Citations
SYSTEM AND METHOD FOR REALTIME REPORTING OF ANOMALOUS INTERNET PROTOCOL ATTACKS  
Patent #
US 20130340079A1
Filed 06/13/2013

Current Assignee
Perspecta Labs Inc.

Sponsoring Entity
Vencore Labs Inc.

DATA ANONYMIZATION BASED ON GUESSING ANONYMITY  
Patent #
US 20140123304A1
Filed 01/06/2014

Current Assignee
Accenture Global Services Limited

Sponsoring Entity
Accenture Global Services Limited

DIFFERENTIALLY PRIVATE LINEAR QUERIES ON HISTOGRAMS  
Patent #
US 20140283091A1
Filed 03/15/2013

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Corporation

Privacy Preserving Statistical Analysis on Distributed Databases  
Patent #
US 20140281572A1
Filed 03/14/2013

Current Assignee
Mitsubishi Electric Research Laboratories

Sponsoring Entity
Mitsubishi Electric Research Laboratories

System and method for realtime reporting of anomalous internet protocol attacks  
Patent #
US 9,130,982 B2
Filed 06/13/2013

Current Assignee
Perspecta Labs Inc.

Sponsoring Entity
Vencore Labs Inc.

COMMUNICATION WITH COMPONENTBASED PRIVACY  
Patent #
US 20160014160A1
Filed 07/02/2015

Current Assignee
Mindhive Inc.

Sponsoring Entity
Mindhive Inc.

Differentially private linear queries on histograms  
Patent #
US 9,672,364 B2
Filed 03/15/2013

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Technology Licensing LLC

Emoji frequency detection and deep link frequency  
Patent #
US 9,705,908 B1
Filed 09/24/2016

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Emoji frequency detection and deep link frequency  
Patent #
US 9,712,550 B1
Filed 09/24/2016

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Emoji frequency detection and deep link frequency  
Patent #
US 9,894,089 B2
Filed 06/19/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Identifying cohorts with anomalous confidential data submissions using matrix factorization and completion techniques  
Patent #
US 10,037,437 B1
Filed 06/09/2017

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Technology Licensing LLC

Differentially private linear queries on histograms  
Patent #
US 10,121,024 B2
Filed 05/04/2017

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Technology Licensing LLC

Learning new words  
Patent #
US 10,133,725 B2
Filed 04/03/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Privacy preserving statistical analysis on distributed databases  
Patent #
US 10,146,958 B2
Filed 03/14/2013

Current Assignee
Mitsubishi Electric Research Laboratories

Sponsoring Entity
Mitsubishi Electric Research Laboratories

Emoji frequency detection and deep link frequency  
Patent #
US 10,154,054 B2
Filed 06/30/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Efficient implementation for differential privacy using cryptographic functions  
Patent #
US 10,229,282 B2
Filed 09/23/2016

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Computerized matrix factorization and completion to infer median/mean confidential values  
Patent #
US 10,262,154 B1
Filed 06/09/2017

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Technology Licensing LLC

Data anonymization based on guessing anonymity  
Patent #
US 10,380,351 B2
Filed 01/06/2014

Current Assignee
Accenture Global Services Limited

Sponsoring Entity
Accenture Global Services Limited

Tracking privacy budget with distributed ledger  
Patent #
US 10,380,366 B2
Filed 04/25/2017

Current Assignee
SAP SE

Sponsoring Entity
SAP SE

Emoji frequency detection and deep link frequency  
Patent #
US 10,454,962 B2
Filed 10/12/2018

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Efficient implementation for differential privacy using cryptographic functions  
Patent #
US 10,552,631 B2
Filed 03/08/2019

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Communication with componentbased privacy  
Patent #
US 10,560,479 B2
Filed 07/02/2015

Current Assignee
Mindhive Inc.

Sponsoring Entity
Mindhive Inc.

User experience using privatized crowdsourced data  
Patent #
US 10,599,867 B2
Filed 11/07/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

User experience using privatized crowdsourced data  
Patent #
US 10,599,868 B2
Filed 11/07/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Learning new words  
Patent #
US 10,701,042 B2
Filed 10/12/2018

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

Differential privacy using a multibit histogram  
Patent #
US 10,726,139 B2
Filed 09/30/2017

Current Assignee
Apple Inc.

Sponsoring Entity
Apple Inc.

No References
37 Claims
 1. A method to anonymize a dataset of spatial data, comprising:
generating a spatial indexing structure with spatial data; establishing a height value associated with the spatial indexing structure to generate a plurality of tree nodes, each of the plurality of tree nodes associated with spatial data counts; calculating a localized noise budget value for respective ones of the tree nodes based on the height value and an overall noise budget; and anonymizing the plurality of tree nodes with an anonymization process, the anonymization process using the localized noise budget value for respective ones of the tree nodes.  View Dependent Claims (5, 7, 8, 9, 11, 12, 13, 14, 15)
 2. 4. (canceled)
 6. (canceled)
 10. (canceled)
 16. An apparatus to anonymize a dataset of spatial data, comprising:
a spatial decomposition manager to generate a spatial indexing structure with spatial data, the spatial decomposition manager to establish a height value associated with the spatial indexing structure to generate a plurality of tree nodes, each of the plurality of tree nodes associated with spatial data counts; a privacy budget manager to calculate a localized noise budget value for respective ones of the tree nodes based on the height value and an overall noise budget; and a noise allocation engine to anonymize the plurality of tree nodes with an anonymization process, the anonymization process using the localized noise budget value for respective ones of the tree nodes.  View Dependent Claims (20, 21, 22)
 17. 19. (canceled)
 23. (canceled)
 25. 26. (canceled)
 27. A tangible machine accessible medium having instructions stored thereon that, when executed, cause a machine to, at least:
generate a spatial indexing structure with spatial data; establish a height value associated with the spatial indexing structure to generate a plurality of tree nodes, each of the plurality of tree nodes associated with spatial data counts; calculate a localized noise budget value for respective ones of the tree nodes based on the height value and an overall noise budget; and anonymize the plurality of tree nodes with an anonymization process, the anonymization process using the localized noise budget value for respective ones of the tree nodes.  View Dependent Claims (28, 29, 31, 35, 36)
 30. (canceled)
 32. 34. (canceled)
 37. 47. (canceled)
1 Specification
This disclosure relates generally to data security, and, more particularly, to methods and apparatus to anonymize a dataset of spatial data.
Data collected and/or otherwise cultivated by a business may permit one or more business services to occur in a manner tailored to one or more corresponding subscribers. The collected data may relate to subscriber behavior(s) and/or other details to allow the one or more services to satisfy subscriber expectations. In some examples, the collected data also includes information deemed private by the corresponding subscribers and, if revealed in a public manner, may expose the subscribers to risk and/or embarrassment.
Methods and apparatus are disclosed to anonymize a dataset of spatial data. An example method includes generating a spatial indexing structure with spatial data, establishing a height value associated with the spatial indexing structure to generate a plurality of tree nodes, each of the plurality of tree nodes associated with spatial data counts, calculating a localized noise budget value for respective ones of the tree nodes based on the height value and an overall noise budget, and anonymizing the plurality of tree nodes with an anonymization process, the anonymization process using the localized noise budget value for respective ones of the tree nodes.
Collected and/or cultivated data may include data that is spatial, such as mobility patterns associated with wireless subscribers. A business organization responsible for the collection, cultivation and/or maintenance of such data may pass the data between its own business units to facilitate one or more services for the subscribers. In other examples, the data may be outsourced to authorized recipients/entities in an effort to satisfy subscriber service expectations, such as providing the data associated with each subscriber to an outsourced call center that processes service call requests.
In the event the data is intercepted enroute to an authorized recipient (e.g., an outsourced call center), or if the data is wrongfully obtained via one or more hacking attempts, the one or more subscribers associated with the data may be at financial risk, may suffer embarrassment and/or one or more local and/or federal laws may result in service provider liability. In other examples, data having one or more elements deemed private may be useful for entities interested in generating one or more conclusions related to, for example, market research. For example, although a data set may include information deemed private, other aspects of the data that are nonprivate may be useful for generating aggregate conclusions related to the data set, such as a number of people that live in a certain geographical area, a number of people that use a particular cellular tower in a certain geographical area, etc. In the event that the data set is sanitized/anonymized to prevent one or more privacy concerns related to individuals associated therewith, distribution of the data set may occur without concern and provide utility to a user of the sanitized data set.
In other examples, original data may be processed to remove one or more pieces of information in an effort to maintain privacy expectations of one or more persons associated with the original data. For instance, social security numbers, street address information, etc., associated with people may be removed/deleted from database records to ensure privacy expectations are met. However, previous attempts to manually anonymize data have resulted in privacy concerns when a potential attacker has one or more other and/or disparate data sources to combine with the anonymized data. For example, the Massachusetts Group Insurance Commission released data associated with state employees that showed every hospital visit by those employees. Prior to releasing the data, the state removed obvious identifiers such as name, address and social security number. Additionally, despite public concerns for the release of the data, the Governor of Massachusetts assured the public that patient privacy would be maintained. However, graduate student efforts combined publically available voter records with the processed state employee hospital visit data to identify the Governor'"'"'s unique hospital visit information from data that was previously deemed “anonymized.”
Alternate techniques to transform a data set, such as a spatial data set (e.g., latitude/longitude coordinates associated with subscribers), from having sensitive data to anonymized data may include adding noise drawn from a random distribution. For example, the added noise is applied to each cell in a data grid (e.g., a grid over a geographical region of interest, such as a county) so that the original (true) data points in each grid are replaced with noise. However, the addition of noise may drownout the remaining value to the data set to yield a geographical grid having nonzero noise values throughout. While the addition of noise may improve a degree of privacy associated with the data set, such privacy improvement may occur at the expense of data set utility.
Releasing data sets that do not compromise the privacy of data subjects may satisfy differential privacy rules if what can be learned from the released data does not substantially differ whether or not any given data from a particular individual is included. The term “differential privacy,” as used herein with respect to published data, refers to data that is published in accordance with a privacy approach that seeks to increase (e.g., maximize) accuracy of data and/or queries against the data, while reducing (e.g., minimizing) the chances that someone can identify one or more particular records associated with the data. If published data complies with differential privacy, then the probability of output falling in a set of values is at most e^{ε} times the probability of the output falling in the set of values, given input that differs in the records of at most one individual.
In one example, D_{1 }and D_{2 }are two example neighboring datasets (e.g., D_{1 }and D_{2 }differ in only one tuple t), such that the absolute value of their difference is one, as shown in example Equation 1.
∥D_{1}−D_{2}∥=1 Equation 1.
In some examples, Equation 1 is interpreted as meaning that t has different values in each of D_{1 }and D_{2}, while in other examples, Equation 1 is interpreted as meaning that t is present in only one of the two datasets. If A refers to a randomized algorithm and/or process performed over the datasets, and S refers to an arbitrary set of possible outputs of process A, then process A is said to be εdifferentially private if example Equation 2 is satisfied for all S.
Pr[A(D_{1}εS)]≦e^{ε}PR[A(D_{2})εS] Equation 2.
Generally speaking, example Equation 2 indicates that no individual tuple can significantly affect released information because an output distribution generated by process A is nearly the same whether or not any tuple of interest is present in the dataset. Given the process A, which may include randomization component(s), different answers will result for each iteration of process A that processes the same inputs, which further reveals a probability distribution (Pr) over a set of outputs. The noise parameter ε controls how much privacy is afforded (e.g., to an individual, to a privatization process, etc.). As the value of ε goes down, then privacy guarantees and/or likelihood of privacy increases, but at the expense of additional noise.
Example methods and apparatus disclosed herein develop private spatial decompositions (PSDs) using one or more spatial indexing methods in which to organize one or more datasets of interest. Example spatial indexing methods describe a data distribution and include, but are not limited to, quadtrees, Rtrees, Btrees, and kdtrees in which example methods and apparatus disclosed herein partition the spatial data into smaller regions in a manner that complies with differential privacy guarantees (choosing splitting points). Additionally, for each cell of an example PSD, noisy counts are generated to replace true counts in a manner that preserves differential privacy guarantees. Example PSDs disclosed herein balance practicality of privacy concerns with utility concerns so that anonymized data preserves privacy expectations while still allowing one or more useful queries to be performed on the anonymized data.
Example spatial indexing methods disclosed herein include dataindependent decompositions, datadependent decompositions and hybrid decompositions that include both dataindependent and datadependent aspects.
Spatial decompositions may be represented as a hierarchical tree decomposition of a geometric space into each smaller area (e.g., each cell) that have corresponding data points. To illustrate,
In the event the dataset shown in
Let h denote the height of the tree, in which leaves have level zero (0) and the root has level h. If given a total privacy budget of noise parameter ε, then each localized value of ε_{i }associated with each level i for 0≦i≦h is determined in a uniform manner as shown in example Equation 3.
ε=Σ_{i=0}^{h}ε_{i} Equation 3.
In some examples, each localized value of ε_{i }may be established in a uniform manner in which each localized value is the same at each level. In other examples, localized values of ε_{i }may be dissimilar, which may include values equal to zero for level(s) associated with relatively low noisy counts.
An example PSD may be built by computing the original data counts of the example quadtree for each node. Counts for each node may be computed via any number of processes A, including a Laplace mechanism. Let ƒ(D) denote a numeric function over a dataset D. To release ƒ in a manner that complies with differential privacy that is guided by a budget noise parameter ε, the dataset may be published in a manner consistent with example Equation 4.
L(D)=ƒ(D)+X Equation 4.
In the illustrated example of Equation 4, X represents a random variable drawn from a Laplace distribution consistent with example Equation 5.
In the illustrated example of Equation 5, the value of σ(ƒ) represents a sensitivity of ƒ, which relates to a maximum change in ƒ when any single tuple of D changes. In particular, the sensitivity may be represented in a manner consistent with example Equation 6.
σ(ƒ)=max_{D1,D2:∥D1−D2∥=1}ƒ(D_{1})−ƒ(D_{2}) Equation 6.
For dataindependent trees/decompositions, in the event the structure of the index (e.g., the node rectangles) is publically released, there is no corresponding danger to a release of private data associated with any individual. Instead, the node counts are modified to preserve privacy, in which each node stores a number of data points that lie in a spatial cell associated therewith. Adding or deleting a single tuple changes the counts of all nodes on a path from the root to the leaf that contains that changed tuple. As such, to generate a tree that satisfies differential privacy, privacy guarantees for each individual node count are considered as a sequential composition. For example, assume A_{1}, . . . , A_{t }represents t processes such that Ai satisfies differential privacy for 1≦i≦t. A sequential composition of A1 through At satisfies differential privacy in a manner consistent with example Equation 7 to create the example PSD.
ε=Σ_{i=1}^{t}ε_{i} Equation 7.
A subsequent query on the PSD, such as the example tree 250 having the noisy values published, results in varying accuracy. For a query Q, an answer to that query computed over the private tree (i.e., the noisy values) is represented by Q^{˜}. The private tree query Q^{˜ }is a random variable and an unbiased estimator of the true answer (i.e., the node values not surrounded by boxes). The variance of Q^{˜ }is a strong indicator of query accuracy, thus, the error measure may be represented in a manner consistent with example Equation 8.
Err(Q)=VAR(Q^{˜}) Equation 8.
Unlike a tree over the original dataset having unmodified counts (e.g., counts without noise), the PSD may return any number of different results to a query Q. In the illustrated example of
Based on such differing results, calculating query accuracy may be performed to guide localized noise parameters in an effort to improve accuracy. To analyze the error of a query Q (i.e., computing Q^{˜}), let Y be a set of noisy counts, and let Ube a set of nodes used to answer the query Q. The total variance is the sum of the node variances, as shown via example Equation 9.
Err(Q)=Σ_{uεU}Var(Y_{u}) Equation 9.
Starting from the root, all nodes that intersect the query Q are visited and, if fully within the query Q boundary, a noisy count Y_{u }is added to the answer Q^{˜}. Each child of u that intersects with the query Q is visited, and the same procedure is performed (recursively) until the leaves of the tree structure are reached. In the event a leaf a intersects query Q, but it is not contained within the query boundary, a uniformity assumption is used to estimate a fraction of Y_{a }to add to the answer Q^{˜}. Let n(Q) be a number of nodes that contribute their counts to the answer Q^{˜}. For each 0≦i≦h, let n_{i }be the number of nodes at level i that are maximally contained in the query Q in a manner consistent with example Equation 10.
n(Q)=Σ_{i=0}^{h}n_{i} Equation 10.
The resulting value n(Q) bounds each node n_{i}, and may be used to guide one or more selections of the localized noise parameter ε_{i}. Noise is independently generated in each node, and because the variance of the Laplace mechanism with the noise parameter ε_{i }is consistent with example Equation 11, then the error of a query Q may be shown in a manner consistent with example Equation 12.
When developing the PSD with a uniform budget, localized noise parameters may be calculated in a manner consistent with example Equation 13.
However, query accuracy may be improved by applying a nonuniform budget strategy when building the PSD. By substituting an upper bound for the number of nodes at i (n_{i}), the corresponding minimized upper bound may be shown in a manner consistent with example Equation 14 subject to example Equation 3.
Example equation 14 is attained for the localized noise parameter ε_{i }in a manner consistent with example Equation 15.
The bound of example Equation 14 suggests that reducing a value for the tree height h will also reduce a corresponding noise, but such variance bounds the error from noisy counts. An additional error consideration includes that due to a query Q that partly intersects some leaves (e.g., errors that arise from a uniformity assumption). In a worst case scenario, the error is proportional to a number of points in each leaf that is intersected by the query Q (e.g., intersect approximately n_{h}=2^{h }leaves). On an average for an input having n points, each leaf may have O(n/4^{h}) points per leaf assuming balanced leaf counts. As such, the error due to making the uniformity assumption occurs in a manner consistent with example Equation 17, which suggests that a greater benefit results from increasing the tree height h because the error grows as O(n/2^{h}+2^{h/3}).
While example noise budget strategies described above include uniform noise allocation and geometric noise allocation, the noise budget strategies are not limited thereto. One or more alternate and/or additional strategies for dividing the noise parameter ε along a path are possible. In some examples, a quadtree may be built to a depth h and set ε_{h}=ε, in which the entire budget is allocated to the leaves, but none of the budget is allocated to intermediary nodes along a path. In this example, queries may be computed over a grid defined by the leaf regions and the hierarchical structure of the tree is not relevant to the outcome.
In other examples, budget conservation may be applied on a nodebynode basis. For some levels i, the noise parameter may be set to zero (ε_{i}=0). As such, no corresponding counts would be released for nodes having a zero noise parameter, thereby allowing prior and/or subsequent nodes to use portion(s) of the noise budget ε.
In addition to dataindependent trees derived from corresponding dataindependent decompositions, such as the example quadtree 100 of
To compute an example private median for a given set of points (e.g., a dataset), let C={x_{1}, . . . , x_{n}} be a set of n values in nondecreasing order in a range between a low value and a high value, in which the difference between the low and high values is M. The median value for the example dataset may be represented as x_{m}. In some examples, the Laplace mechanism may be applied to return L(C)=x_{n}, +X, where X is Laplace noise. However, the sensitivity of the median is of the same order of magnitude as the range M, and the noise value may dwarf x_{m}. At least one consequence of this noise value is that the noisy median is outside the range of the low value and high value, which prohibits useful division of the dataset.
To divide a dataset with a datadependent spatial index, example methods and apparatus disclosed herein employ one or more of a smooth sensitivity approach, an exponential mechanism approach, a cellbased approach and/or a noisy mean approach. The example smooth sensitivity (SS) approach tailors the noise to be more specific to the example set C. However, one tradeoff of the SS approach over the Laplace mechanism is that it has weaker privacy guarantees. Assuming that a noise budget of interest ε is greater than zero (0<ε), δ<1, and example Equation 18 is true, then example Equations 19 and 20 define the smooth sensitivity of the median for C.
In the illustrated example Equations 18, 19 and 20, X represents a random variable drawn from the Laplace distribution with parameter 1 and σ_{s}=σ_{s }(median). Additionally, x_{i }is deemed low when i<0, and x_{i }is deemed high when i>n.
The example exponential mechanism (EM) that draws output values from a probability distribution over all possible outputs rather than adding random noise to the true values, as in Laplace mechanism approaches. Drawing the output values from a probability distribution satisfies differential privacy conditions when applied to the median. Assuming x is an element between low and high values, let rank(x) denote the rank of x in C. The example exponential mechanism returns x in a manner consistent with example Equation 21.
In the illustrated example of Equation 21, because all values x between two consecutive values in C have the same rank, they are equally likely to be chosen. As such, EM may be implemented by observing that it chooses an output from an interval consistent with example Equation 22, and a probability proportional to example Equation 23. Conditional on I_{k }being chosen in the first step, EM returns a uniform random value in I_{k}.
The example cellbased approach imposes a fixed resolution grid over C, and then computes the median based on the noisy counts in the grid cells. When applied to a hierarchical decomposition, a fixed grid is computed over the entire dataset, then medians are computed from the subset of grid cells in each node. Cell counts have sensitivity of 1, and the accuracy depends on a degree of coarseness of the grid relative to the dataset distribution.
The example noisy mean replaces a median calculation with a mean calculation. A private mean may be computed privately by computing a noisy sum (e.g., with sensitivity M), and a noisy count (e.g., with sensitivity 1), and calculating a corresponding ratio.
The privacy guarantee and/or expectation of privacy of a datadependent tree may be obtained by composing individual privacy guarantees for each median, in which each median consumes a portion of the overall privacy budget (noise budget) (ε). As discussed above, smaller values of the privacy budget result in stronger privacy guarantees, but at the expense of greater noise. Stated differently, larger values for each localized noise parameter ε_{i }yields smaller noise in a probabilistic manner. For each calculated median, a portion of the overall budget is consumed, thereby leaving a lower amount of the budget for additional median(s). Additionally, the privacy guarantee of the datadependent trees is obtained by composing the individual privacy guarantees for counts along each roottoleaf path (see example Equation 7). The resulting PSD released is deemed differentially private in a manner consistent with example Equation 24.
ε=Σ_{i=0}^{h1}ε_{i}^{m}+Σ_{i=0}^{h}ε_{i}^{c} Equation 24.
In the illustrated example Equation 24, the one or more noisy processes employed to calculate the medium(s) A_{i}^{m }satisfy 1≦i≦h that correspond to internal nodes on a path such that the noisy algorithm(s) (processes) satisfy differential privacy. Additionally, in the illustrated example Equation 24, the one or more noisy processes employed to calculate noisy counts A_{i}^{c }satisfy 0≦i≦h for each of the same corresponding internal nodes such that the noisy process(es) satisfy differential privacy.
Noise magnitude ε_{i }may result in one or more different consequences on an overall accuracy of the tree (for both median calculations and count calculations). A relatively large count noise results in a greater degree of uncertainty for the count of each region in the tree that intersects a query Q. On the other hand, a relatively large median noise may result in a skewed split in one or more node(s). In some examples, children nodes may become unbalanced in terms of the number of points in their corresponding leaves, and may have unbalanced split region sizes. In the event that a median split fails to divide a current point set so that there are a constant fraction of points on each side, then tree level waste occurs.
The median noise budget ε_{median }and the count noise budget ε_{count }may be represented in a manner consistent with example Equations 25 and 26, respectively. Accordingly, the overall noise budget ε is represented as ε=ε_{count}+ε_{medium}.
ε_{median}=Σ_{i=0}^{h1}ε_{i}^{m} Equation 25.
ε_{count}=Σ_{i=1}^{h1}ε_{i}^{c} Equation 26.
When c_{median }is fixed at a first value, it is distributed among one or more internal nodes. In some examples, distribution may occur in a uniform manner in which each level receives an equal portion of the overall median noise budget. In other examples, such as when a hybrid tree is employed that switches to a dataindependent quadtree after splitting l levels, the example median budget may be split on a per level basis for h−l<i≦h and ε_{i}^{m}=0 for 0≦i≦h−l.
In addition to employing geometric noise in a hierarchical decomposition to improve query accuracy, further query optimizations may be employed after the noisy counts are calculated. Example methods and apparatus disclosed herein process the one or more differentially private output(s) from the PSD in a manner that does not affect the privacy guarantees established therein. In some examples, a tree having root a and four children nodes b, c, d and e has noisy count Y_{v }for node v, which is an element of {a, b, c, d, e}. In the aforementioned examples, assume a uniform noise of ε/2 is used for each count. A first estimate of a true count for root a is Y_{a}, and a second estimate includes the sum of counts of the corresponding leaves (i.e., Y_{b}+Y_{c}+Y_{d}+Y_{e}). An example estimate β_{a }is shown in a manner consistent with example Equation 27, which is an average of both estimates, and a corresponding variance is shown in a manner consistent with example Equation 28.
In the illustrated examples of Equation 27 and 28, the estimate for the variance of β_{a }is worse than if Y_{a }were used. Accordingly, example Equation 29, which yields example Equation 30, improves the result.
For any nonuniform budgeting approach, if the budget for a is ε_{1}, and the budget for its children is ε_{0}, then the example Equation 31 improves the resulting accuracy, which derives example Equation 32.
In the illustrated examples of Equations 27 through 32, the number of choices increases in an exponential manner as tree size increases. One or more counts may be combined, including counts from ancestors, descendants, siblings, etc. To calculate an optimized solution, the ordinary leastsquares (OLS) estimate may be employed as a linear statistical inference. Generally speaking, computing the OLS for n unknowns may include solving a linear system with n×n matrices. The relatively large matrices may be inverted to compute the OLS using matrix mathematics. However, the amount of time required to calculate an n×n matrix is proportional to the cube of n. Example methods and apparatus disclosed herein calculate the OLS estimate via an example linear time process to achieve the same result as calculating via matrix mathematics, but in a relatively less computationally intensive manner. Accordingly, as the size of the tree increases, the example linear time process improves the efficiency of calculating the OLS estimate. The example linear time process considers one or more inherent symmetries of the matrices defined for the tree structure and assume that all nodes at a level i have the same Laplace parameter ε_{i }(for either uniform and/or geometric budget approaches). For example, the linear time process takes advantage of inherent nodal similarities of adjacent nodes of a tree, such as relative count similarities that occur between a parent node as compared to the count of the sum of corresponding children nodes. Such similarities do not reside in traditional matrices when calculating OLS estimates. Accordingly, the time it takes to calculate the OLS estimate via the linear time process is proportional to n, rather than proportional to the cube of n as is the case with an n×n matrix approach.
The example OLS estimate denotes ƒ as the fanout of a spatial index having a height of h. Assume that h(v) denotes the height of a particular node v and, in the event the node is a leaf, then h(v)=0, and in the event the node is a root, then h(v)=h. Also assume that all paths have length h and all internal nodes have the fanout f, in which u>>v denotes that node u is a leaf in the subtree of v. As noted below, anc(u) denotes a set of all ancestors of u, including node u itself, and par(u) and child(u) denote the parent and children of node u, respectively. OLS is defined as a linear inference over a tree structure in which Y denotes a vector of original noisy counts (e.g., Y_{v }is a noisy count of node v), and c, denotes a noise parameter for node v. β represents the vector of counts after postprocessing the PSD and further represents the OLS estimator if it is consistent in a manner represented by example Equation 33 and minimizes example Equation 34.
β_{v}=Σ_{uεchild(v)}β_{u }(for all nodes v) Equation 33.
Σ_{v}ε_{v}^{2}(Y_{v}−β_{v})^{2} Equation 34.
The example OLS β is unbiased for any query Q, and achieves a minimum error for all range queries. Example Equation 35 illustrates a manner in which β may be computed.
In the illustrated example of Equation 35, an array E of h+1 entries is computed in a manner consistent with example Equation 36.
E_{l}=Σ_{j=0}^{l}ƒ^{j}ε_{j}^{2} Equation 36.
In the illustrated example of Equation 36, because E_{1}=E_{l1}+ƒ_{1}ε_{1}^{2}, a corresponding time O(h) is taken to compute the array E. For any node v, Z_{v }is a node transform and is defined in a manner consistent with example Equation 37, and used in a multiphase approach to compute the OLS estimator in a time linear manner.
Z_{v}=Σ_{u<<v}Σ_{wεanc(u)}ε_{h(w)}^{2}Y_{w} Equation 37.
In the illustrated example of Equation 37, Z_{v }is computed for all nodes (v) in two linear traversals of the tree. In particular, an example first phase includes a topdown traversal of the tree, an example second phase includes a bottomup traversal, and an example third phase includes another topdown traversal. In the example first phase, Z_{v }is computed for all leaves v in a manner consistent with example Equation 38.
Z_{v}=Σ_{wεanc(v)}ε_{h(w)}^{2}Y_{w} Equation 38.
In a topdown traversal of the first phase, let α_{root}=ε_{h}^{2}Y_{root}, and compute for each node u:α_{u}=α_{par(u)+ε}_{h(u)}^{2}Y_{u}. When a leaf v is reached, set Z_{v}=α_{v}.
In the example second phase, Z_{v }is computed for all internal nodes v in a manner consistent with example Equation 39. Example phase two is a single bottomup traversal.
Z_{v}=Σ_{uεchild(v)}Z_{u} Equation 39.
In the example third phase, β_{v }is computed for all nodes v. During the computation of β_{v}, an auxiliary value F_{v }is computed and is defined in a manner consistent with example Equation 40.
F_{v}=Σ_{wεanc(v)/{v}}β_{w}ε_{h(w)}^{2} Equation 40.
In the illustrated example of Equation 35, v=root is represented in a manner consistent with example Equation 41.
(Σ_{j=0}^{h}ƒ^{j}ε_{j}^{2})β_{root}=(ε_{h}β_{root})=Z_{root} Equation 41.
A computation of β_{root }reveals β_{root}=Z_{root}/E_{h }and, if F_{root}=0, for any node v not equal to root, compute F_{v }in a manner consistent with example Equation 42.
F_{v}=F_{par(v)}+β_{par(v)}ε_{h(v)+1}^{2} Equation 42.
Using example Equation 35, Z_{root }is identified in a manner consistent with example Equation 43.
Accordingly, the example first, second, and third phases computes the OLS estimator in a time linear manner and improves the query accuracy.
In operation, the example differential privacy engine 400 employs the dataset manager 402 to obtain a dataset of interest for privatization. In some examples, a dataset that includes some information deemed private and/or otherwise sensitive in the event of accidental release and/or breach is privatized to remove such sensitive information. The removal of the sensitive information may render the resulting processed dataset unusable if too much noise is applied to the privatization process and, on the other hand, if too little noise is applied to the original dataset, then personal information may be released. The example privacy budget manager 404 establishes a privacy budget value, which balances concerns for privacy of the dataset with a corresponding utility of the dataset for one or more queries.
The privatization of a dataset may employ any type of spatial indexing method to generate a spatial decomposition of the dataset, which is selected by the example spatial decomposition manager 406. For example, some datasets may be processed in a dataindependent manner, which means that subdividing the dataset in a spatial manner does not adversely affect a privacy concern related to individuals associated with the information of the dataset. For datasets built with a dataindependent decomposition (e.g., a quadtree), split locations may be determined based on one or more bounding regions (e.g., splitting rectangles into two or four sections of equal area). On the other hand, some datasets processed in a datadependent manner may reveal clues and/or information that may allow an attacker to identify one or more individuals based on the spatial arrangement of the information in the dataset. For example, if an attacker knows that three people in a particular zip code earn over $100k per year in a rural subdivision, then a spatial split of the dataset based on population density may reveal which individual dataset points correspond to the three people of interest. In other examples, publicly available U.S. Census data may be employed to derive information when combined with a dataset that would otherwise be considered private. As such, the example tree structure engine 407 invokes the example data independent structure manager 408 to generate dataindependent spatial indexing structures, such as quadtrees. Additionally, the example tree structure engine 407 invokes the example data dependent structure manager 410 to generate datadependent spatial indexing structures, such as kdtrees. Datadependent decompositions (e.g., kdtree indexing structures) use the distribution of the dataset within a region to determine where to split. Still further, the example tree structure engine 407 invokes the example hybrid structure manager 412 to generate combined dataindependent and datadependent structures to improve query accuracy when calculating node splits and/or node counts during a privatization process.
During the privatization process, the example privacy budget manager 404 calculates a localized noise parameter via the example noise allocation engine 403 for each node in the tree associated with the dataset, which facilitates calculation of noisy counts for each portion of the tree. The example noise allocation engine 403 includes the example uniform calculator 404A to calculate noisy counts in a uniform manner, and an example nonuniform calculator 404B to calculate noisy counts in a nonuniform manner, such as via a geometric sequence. The example error analyzer 414 calculates a corresponding error of the privatization for each node and compares the results to one or more thresholds. In the event one or more thresholds are triggered by the error analysis (e.g., an error is lower than a threshold, an error is greater than a threshold, etc.), the example error analyzer 414 may modify a tree height (h) and/or one or more localized noise parameters (c).
After the PSD is generated, the example post process optimizer 416 is invoked by the differential privacy engine 400 to improve query accuracy for a query of the privatized dataset. Optimization may include, but is not limited to establishing a baseline accuracy and employing ordinary least squares (OLS) to improve query accuracy.
While an example manner of implementing the differential privacy engine 400 has been illustrated in
A flowchart representative of example machine readable instructions for implementing the differential privacy engine 400 of
As mentioned above, the example processes of
The program 500 of
The example spatial decomposition manager 406 selects a spatial decomposition taxonomy (block 506), such as a dataindependent quadtree or a datadependent kdtree. In the event the dataset of interest is to be privatized via a dataindependent taxonomy (block 508), the example tree structure engine 407 invokes the example data independent structure manager 408 to build the quadtree structure (e.g., the quadtree 100 of
In some examples, the privatization process 500 employs a hybrid taxonomy of data dependent and data independent structures (block 514), in which case splits of the dataset occur first before privatization of node counts (block 516). During the hybrid privatization, the budget allocation is verified to remain within the limits set by the example privacy budget manager 404 (block 516). One or more post process optimizations may occur with the generated PSD, such as application of an ordinary leastsquares (OLS) estimate to optimize query accuracy without affecting the privacy guarantee of previously determined splits and/or node counts (block 518).
If uniform allocation of the privacy budget c is selected, the example uniform calculator 404a applies one or more localized noise parameters among all the nodes of the tree in a uniform manner (block 710). For example, the uniform calculator 404a may apply localized noise parameters in a manner consistent with example Equation 13.
Briefly returning to
Returning to
The computer 1000 of the instant example includes a processor 1012. For example, the processor 1012 can be implemented by one or more microprocessors or controllers from any desired family or manufacturer.
The processor 1012 is in communication with a main memory including a volatile memory 1014 and a nonvolatile memory 1016 via a bus 1018. The volatile memory 1014 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory (RDRAM) and/or any other type of random access memory device. The nonvolatile memory 1016 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 1014, 1016 is controlled by a memory controller.
The computer 1000 also includes an interface circuit 1020. The interface circuit 1020 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB), and/or a PCI express interface.
One or more input devices 1022 are connected to the interface circuit 1020. The input device(s) 1022 permit a user to enter data and commands into the processor 1012. The input device(s) can be implemented by, for example, a keyboard, a mouse, a touchscreen, a trackpad, a trackball, isopoint and/or a voice recognition system.
One or more output devices 1024 are also connected to the interface circuit 1020. The output devices 1024 can be implemented, for example, by display devices (e.g., a liquid crystal display, a cathode ray tube display (CRT), a printer and/or speakers). The interface circuit 1020, thus, typically includes a graphics driver card.
The interface circuit 1020 also includes a communication device (e.g., communication device 1056) such as a modem or network interface card to facilitate exchange of data with external computers via a network 1026 (e.g., an Ethernet connection, a digital subscriber line (DSL), a telephone line, coaxial cable, a cellular telephone system, etc.).
The computer 1000 also includes one or more mass storage devices 1028 for storing software and data. Examples of such mass storage devices 1028 include floppy disk drives, hard drive disks, compact disk drives and digital versatile disk (DVD) drives.
The coded instructions 1058 of
From the foregoing, it will be appreciated that the above disclosed methods, apparatus and articles of manufacture facilitate privatization of spatial data in a manner that satisfies differential privacy standards, guarantees and/or expectations, such as those defined by example Equation 2. While prior techniques to privatize spatial data employed applying a blanket noise value over the entirety of spatial data to obscure true localized details, such blanket application of noise caused a substantial error of the resulting dataset, rendering the dataset unusable for one or more queries.
Although certain example methods, apparatus and articles of manufacture have been described herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the claims of this patent.