The need for PPL research is clear. Say you're a hospital or national census bureau, and you want to release a statistic on some sensitive data you've collected. How can you release aggregate information without revealing too much individual information? Early PPL work focussed on definitions. Some like k-anonymity have been very popular, and contributed much to the problem, but have been debunked for being too weak in some cases. Choosing the wrong notion of privacy to enforce (or worse still, no privacy) can lead to serious embarrassment. Infamous incidents include the AOL search data disaster, discovering health records of a governor of Massachusetts in an "anonymized" health database, and the identification of Netflix users in the Netflix dataset we all know and love. Research in PPL is gaining popularity fast, especially in the theory community since the introduction of differential privacy.
Differential Privacy
The setup for differential privacy can be visualized as above. In the interactive case, a mechanism sits between a dataset or DB and the querier, like a fancy firewall. The querier is allowed to submit certain queries about the DB to the mechanism. The mechanism releases responses to these queries, which are usually the evaluation of the query plus suitable noise. Even against an adversarial querier, the mechanism must not reveal individual examples (rows) from the dataset (DB). The non-interactive use-case is similar, except that responses do not depend on the query. Thus non-interactive privacy includes (and tends to be used synonymously) with anonymization: the mechanism's response is some transform of the DB that preserves sufficient statistics of interest, but removes example-identifying information. The following definitions will be stated for the interactive case, but they work for non-interactive mechanisms also.
Databases D1 and D2 are called neighboring if they differ on exactly one example. An interactive mechanism A has epsilon-differential privacy if for all queries Q (within some class), all possible responses x, and all pairs of neighboring DBs D1 and D2 we have
To understand this definition, first observe that every fixed DB-query pair (D,Q) induces a distribution over the mechanism's random response X. So on D1 and D2 (and fixed query) we have two distributions that we can compare. The above condition merely states that these distributions are pointwise close in a multiplicative sense. Consider now an adversary trying to reveal information about examples in the true DB D, by submitting queries to the mechanism. Even if the adversary knows the mechanism's mapping (up to sources of randomness), and all but one example, and simulates the response distributions induced by all consistent DBs, she cannot possibly distinguish between consistent DBs, as they all match the empirical distribution (built by querying) up to sampling error.
Another way to put it, is that a user deciding whether to include her data as a row in the DB can rest assured that the mechanism's response distribution is barely affected by the outcome of her decision. Either way you look at it, the condition is a very strong guarantee of data privacy indeed.
An Example
Here's a standard little example that illustrates the main concepts, including how to make many interactive mechanism's differentially private. Consider simple subset sum queries, where an indicator f is applied to every row of the DB and summed. The query's value will be the cardinality of the subset of rows matching f's predicate.
First note that on neighboring databases, the query's value can differ by at most one (the databases differ on at most one row, which contributes to one indicator summand). Thus the absolute difference between any two neighboring DBs' query values, called the global sensitivity, is unity.
Next consider a mechanism A that ads noise N~Laplace(1/a) to the query evaluation. With a little algebra, the convenient PDF of the Laplace and the triangle inequality, we can quickly prove the a-differential privacy of our mechanism for any desired level a>0.
This pattern works in many cases. For vector-valued mechanism responses, global sensitivity is in terms of the L1 (or some other) norm, and the noise is vector valued (e.g., iid Laplace). Notice the norm used in global sensitivity matches the norm in the noise r.v.'s PDF.
Utility
Clearly adding too much noise to a query response, or "overly anonymizing" a database, is going to make for a pretty useless statistic. Various definitions of usefulness or utility, quantify how much noise is acceptable. In the interactive case, you intuitively want responses to be close to the query's evaluation on the DB (wrt some norm) with high probability. I.e. responses should be probably approximately correct (ring a bell?). We can hope for something similar from non-interactive mechanisms: on a specified class of statistics, the response and original DB produce close values w.h.p. We can't expect the anonymized DB to be useful wrt any given statistic (think very specific subset-sums), so we only require utility on a restricted class.
Interesting Direction
Research on PPL is very active in the databases, datamining and theory communities. For example four papers on differential privacy appeared at STOC'2009. This post is already too long to enumerate all the interesting papers (please comment on papers you've found interesting!). Interest in the learning community has been relatively slow to take-off (one paper at each of ICML08, NIPS08, no others in ICML/NIPS/COLT 2008-09 that I'm aware of). One particularly interesting theoretical direction is to understand the limits of what can be learned privately while achieving utility. Clearly there is an inherent trade-off between the two competing properties. The trade-off is understood in only special cases.
A few positive examples (many more out there): Barak et al. (PODS'07) presented a mechanism for differentially releasing contingency tables while guaranteeing that all marginals of the released table are L1-close to the true table's marginals, w.h.p. Blum et al. (STOC'08) developed a differentially private non-interactive mechanism that achieves utility wrt VC-classes of predicate queries. Our recent report guarantees high probability pointwise utility of a mechanism for privately releasing SVM classifiers.
Authors have also proved negative results: you can't be too accurate and too private at once. Our afore-mentioned report establishes lower bounds for any private SVM mechanism. Dinur & Nissim (PODS'03) showed that for subset-sum queries of a DB of bits, if the noise rate is o(sqrt(n)) then an adversary can learn a 1-o(1) proportion of the DB (this result was latter extended by Dwork et al. (STOC'07)).
There are still many opportunities in PPL, both theoretical and applied, including for researchers in learning. For the latest research check out recent incarnations of KDD, STOC, FOCS, SIGMOD, PODS, WWW and the usual learning conferences. For those who can make it (I'll be there), IPAM at UCLA is holding a week-long meeting on PPL later this month.



I have worked for quite some time with automatic speech recognition and one thing that often comes to discussion is whether we are allowed or not to distribute models trained on people's voice without their consent (this also applies to NLP researchers gathering their data on the web).
ReplyDelete- Is there a legal ground which says that we are not allowed to derive statistics from this data?
- Can we prove that the derived statistics do not allow identification of the original people?
The first one would probably require asking a law person. The second is probably model specific.
What if you can prove that an example belongs to a DB by comparing the outcome of statistics on the DB with and without that example? Is this a concern for PPL?
Benob, your point that the release of statistics/classifiers/models has privacy implications is an excellent one. It's obvious to most people that anonymized data (such as the AOL dataset) should protect privacy; corporate lawyers worry about releasing datasets for researchers all the time. However there doesn't seem to be much concern about models. I don't know of any legal ground for models.
ReplyDeleteDifferential privacy is strong enough to prove that individuals cannot be identified. However this may be too strong in many cases.
Finally it isn't too hard to show that the definition of differential privacy is equivalent to insensitivity to removing or including an example. So you can't tell from querying the mechanism alone, whether an example is present in the DB.
1) The ICML'08 paper you linked to actually solves a qualitatively different type of problem: they look for distributed learning algorithms that leak only as much as the results of a centralized learning algorithm would leak; however, as you point out, there is no guarantee that the results of the centralized algorithm preserve privacy.
ReplyDeleteThe design of secure distributed versions of centralized algorithms is generally orthogonal to the design of privacy-preserving centralized algorithms. Beimel, Nissim and Omri (CRYPTO 2009) have a nice discussion of how the two problems can interact.
2) [Shameless self-promotion] One more result to add to the learning list: Nissim, Raskhodnkiova and I ("What Can We Learn Privately?", FOCS '08) showed that (essentially) any PAC-learnable concept class is also learnable differentially privately. This result inspired the result of Blum, Ligett and Roth that you refer to (though the BLR paper appeared first). In contrast, Beimel, Nissim and Kasiviswanathan (TCC 2010) recently gave examples of concept classes that provably require exponentially more computation time with privacy than without (under natural cryptographic assumptions).
See you at IPAM!
I thought of two areas of science which may be governed by privacy laws: genetics and neuroscience.
ReplyDeleteGenetic privacy is the total anonymity of one's genetic traits, unless told in discretion to a trusted other person. People with, for example, colour blindness should not be discriminated against in job applications. There will be alternative methods of performing the job.
Neuroscientific privacy is the total anonymity of one's thoughts, again unless otherwise excepted. Also, people should have privacy from listening or reading certain thoughts of other people. Imagine if people could switch input to and output from their mind onto certain tasks.