ε-differential privacy
Definition
Let ε be a positive real number and be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let denote theExample
According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information ''about'' the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly. For example, assume we have a database of medical records where each record is a pair (Name, X), where is a Boolean denoting whether a person has diabetes or not. For example: Now suppose a malicious user (often termed an ''adversary'') wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query that returns the partial sum of the first rows of column in the database. In order to find Chandler's diabetes status the adversary executes and , then computes their difference. In this example, and , so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual. Continuing this example, if we construct by replacing (Chandler, 1) with (Chandler, 0) then this malicious adversary will be able to distinguish from by computing for each dataset. If the adversary were required to receive the values via an -differentially private algorithm, for a sufficiently small , then he or she would be unable to distinguish between the two datasets.Composability and robustness to post processing
Composability refers to the fact that the joint distribution of the outputs of (possibly adaptively chosen) differentially private mechanisms satisfies differential privacy. * Sequential composition. If we query an ε-differential privacy mechanism times, and the randomization of the mechanism is independent for each query, then the result would be -differentially private. In the more general case, if there are independent mechanisms: , whose privacy guarantees are differential privacy, respectively, then any function of them: is -differentially private.Privacy integrated queries: an extensible platform for privacy-preserving data analysisGroup privacy
In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted their information. However this is also extendable. We may want to protect databases differing in rows, which amounts to an adversary with arbitrary auxiliary information knowing if particular participants submitted their information. This can be achieved because if items change, the probability dilation is bounded by instead of ,Differential PrivacyHypothesis testing interpretation
One can think of differential privacy as bounding the error rates in a hypothesis test. Consider two hypotheses: * : The individual's data is not in the dataset. * : The individual's data is in the dataset. Then, there are two error rates: * False Positive Rate (FPR): * False Negative Rate (FNR): Ideal protection would imply that both error rates are equal, but for a fixed (ε, δ) setting, an attacker can achieve the following rates: *ε-differentially private mechanisms
Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism and posterior sampling sample from a problem-dependent family of distributions instead. An important definition with respect to ε-differentially private mechanisms is sensitivity. Let be a positive integer, be a collection of datasets, and be a function. One definition of the ''sensitivity'' of a function, denoted , can be defined by:where the maximum is over all pairs of datasets and in differing in at most one element and denotes the L1 norm. In the example of the medical database below, if we consider to be the function , then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one. This can be generalized to other metric spaces (measures of distance), and must be to make certain differentially private algorithms work, including adding noise from the Gaussian distribution (which requires the L2 norm) instead of the Laplace distribution. There are techniques (which are described below) using which we can create a differentially private algorithm for functions, with parameters that vary depending on their sensitivity.Laplace mechanism
Randomized response
A simple example, especially developed in the social sciences, is to ask a person to answer the question "Do you own the ''attribute A''?", according to the following procedure: # Toss a coin. # If heads, then toss the coin again (ignoring the outcome), and answer the question honestly. # If tails, then toss the coin again and answer "Yes" if heads, "No" if tails. (The seemingly redundant extra toss in the first case is needed in situations where just the ''act'' of tossing a coin may be observed by others, even if the actual result stays hidden.) The confidentiality then arises from the refutability of the individual responses. But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the ''attribute A'' and three-quarters by people who actually possess it. Thus, if ''p'' is the true proportion of people with ''A'', then we expect to obtain (1/4)(1-''p'') + (3/4)''p'' = (1/4) + ''p''/2 positive responses. Hence it is possible to estimate ''p''. In particular, if the ''attribute A'' is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be. Although this example, inspired by randomized response, might be applicable to microdata (i.e., releasing datasets with each individual response), by definition differential privacy excludes microdata releases and is only applicable to queries (i.e., aggregating individual responses into one result) as this would violate the requirements, more specifically the plausible deniability that a subject participated or not.Stable transformations
A transformation is -stable if the Hamming distance between and is at most -times the Hamming distance between and for any two databases . If there is a mechanism that is -differentially private, then the composite mechanism is -differentially private. This could be generalized to group privacy, as the group size could be thought of as the Hamming distance between and (where contains the group and does not). In this case is -differentially private.Research
Early research leading to differential privacy
In 1977, Tore Dalenius formalized the mathematics of cell suppression. Tore Dalenius was a Swedish statistician who contributed to statistical privacy through his 1977 paper that revealed a key point about statistical databases, which was that databases should not reveal information about an individual that is not otherwise accessible. He also defined a typology for statistical disclosures. In 1979, Dorothy Denning, Peter J. Denning and Mayer D. Schwartz formalized the concept of a Tracker, an adversary that could learn the confidential contents of a statistical database by creating a series of targeted queries and remembering the results. This and future research showed that privacy properties in a database could only be preserved by considering each new query in light of (possibly all) previous queries. This line of work is sometimes called ''query privacy,'' with the final result being that tracking the impact of a query on the privacy of individuals in the database was NP-hard.21st century
In 2003, Kobbi Nissim and Irit Dinur demonstrated that it is impossible to publish arbitrary queries on a private statistical database without revealing some amount of private information, and that the entire information content of the database can be revealed by publishing the results of a surprisingly small number of random queries—far fewer than was implied by previous work. The general phenomenon is known as the Fundamental Law of Information Recovery, and its key insight, namely that in the most general case, privacy cannot be protected without injecting some amount of noise, led to development of differential privacy. In 2006, Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the amount of noise that needed to be added and proposing a generalized mechanism for doing so. This paper also created the first formal definition of differential privacy. Their work was a co-recipient of the 2016 TCC Test-of-Time Award and the 2017 Gödel Prize. Since then, subsequent research has shown that there are many ways to produce very accurate statistics from the database while still ensuring high levels ofAdoption in real-world applications
To date there are over 12 real-world deployments of differential privacy, the most noteworthy being: * 2008: U.S. Census Bureau, for showing commuting patterns.Ashwin Machanavajjhala, Daniel Kifer, John M. Abowd, Johannes Gehrke, and Lars Vilhuber. "Privacy: Theory meets Practice on the Map". In Proceedings of the 24th International Conference on Data Engineering, ICDE) 2008. * 2014:Public purpose considerations
There are several public purpose considerations regarding differential privacy that are important to consider, especially for policymakers and policy-focused audiences interested in the social opportunities and risks of the technology: * Data utility and accuracy. The main concern with differential privacy is the trade-off between data utility and individual privacy. If the privacy loss parameter is set to favor utility, the privacy benefits are lowered (less “noise” is injected into the system); if the privacy loss parameter is set to favor heavy privacy, the accuracy and utility of the dataset are lowered (more “noise” is injected into the system). It is important for policymakers to consider the trade-offs posed by differential privacy in order to help set appropriate best practices and standards around the use of this privacy preserving practice, especially considering the diversity in organizational use cases. It is worth noting, though, that decreased accuracy and utility is a common issue among all statistical disclosure limitation methods and is not unique to differential privacy. What is unique, however, is how policymakers, researchers, and implementers can consider mitigating against the risks presented through this trade-off. * Data privacy and security. Differential privacy provides a quantified measure of privacy loss and an upper bound and allows curators to choose the explicit trade-off between privacy and accuracy. It is robust to still unknown privacy attacks. However, it encourages greater data sharing, which if done poorly, increases privacy risk. Differential privacy implies that privacy is protected, but this depends very much on the privacy loss parameter chosen and may instead lead to a false sense of security. Finally, though it is robust against unforeseen future privacy attacks, a countermeasure may be devised that we cannot predict.Attacks in practice
Because differential privacy techniques are implemented on real computers, they are vulnerable to various attacks not possible to compensate for solely in the mathematics of the techniques themselves. In addition to standard defects of software artifacts that can be identified using testing or fuzzing, implementations of differentially private mechanisms may suffer from the following vulnerabilities: * Subtle algorithmic or analytical mistakes. * Timing side-channel attacks. In contrast with timing attacks against implementations of cryptographic algorithms that typically have low leakage rate and must be followed with non-trivial cryptanalysis, a timing channel may lead to a catastrophic compromise of a differentially private system, since a targeted attack can be used to exfiltrate the very bit that the system is designed to hide. * Leakage through floating-point arithmetic. Differentially private algorithms are typically presented in the language of probability distributions, which most naturally lead to implementations using floating-point arithmetic. The abstraction of floating-point arithmetic is leaky, and without careful attention to details, a naive implementation may fail to provide differential privacy. (This is particularly the case for ε-differential privacy, which does not allow any probability of failure, even in the worst case.) For example, the support of a textbook sampler of the Laplace distribution (required, for instance, for the Laplace mechanism) is less than 80% of all double-precision floating point numbers; moreover, the support for distributions with different means are not identical. A single sample from a naïve implementation of the Laplace mechanism allows distinguishing between two adjacent datasets with probability more than 35%. * Timing channel through floating-point arithmetic. Unlike operations over integers that are typically constant-time on modern CPUs, floating-point arithmetic exhibits significant input-dependent timing variability. Handling of subnormals can be particularly slow, as much as by ×100 compared to the typical case.See also
* Implementations of differentially private analyses – deployments of differential privacy * Quasi-identifier * Exponential mechanism (differential privacy) – a technique for designing differentially private algorithms * k-anonymity * Differentially private analysis of graphs * Protected health information * Local differential privacy *References
Further reading
Publications
Tutorials