Home / Topics

Privacy-Preserving Data Mining

Curated by: Chris Clifton

The ever-increasing collection of personal data, and the growing capabilities to analyze that data, pose increased risks to personal privacy. This has long been a concern for the SIGKDD community; in 2003 there was actually a Data Mining Moratorium Actproposed in the U.S. Senate (for more details, see the SIGKDD response. While there have been examples of data mining that people feel are privacy violations, such as Target’s pregancy prediction, most privacy problems have come from security failures leading to data breaches, rather than the data analysis itself.

The SIGKDD community has long been active in research to protect privacy while enabling data analysis. This year, there are two papers on privacy-protection techniques. Both use the model of ε-differential privacy. The idea behind differential privacy is that sufficient noise is added to analysis results to hide the impact of any single individual, thus ensuring that the outcome of the analysis does not reveal information specific to any individual in the data. For a more complete introduction to differential privacy, I recommend this article by Cynthia Dwork or you can watch a tutorial by Christine Task.
While knowledge discovery is often on an aggregate level, access to instance-level data (and the attendant privacy risks) is often a part of the learning process. In Privacy-Preserving Class Ratio Estimation, Arun Iyer, Saketh Nath, and Sunita Sarawagi provide a way around this. The goal, estimating the ratio of instances belonging to different classes, is aggregate knowledge rather than instance level. But how do we estimate this without labeling individual instances (putting the privacy of those individuals at risk)? This paper shows that by labeling at the set level, rather than the instance level, good estimates for class ratios can be obtained, while satisfying the requirements of differential privacy.

KDD 2016 Papers in Privacy

Naïve approaches to differential privacy require a privacy budget be divided among all accesses to the data; the accuracy of an answer is limited by it’s share of the budget. If we simultaneously perform a set of analyses, we can share this budget, giving more accurate results while still protecting privacy. Unfortunately, this becomes a (computationally infeasible) non-convex optimization problem. In the poster presentation Optimal Linear Aggregate Query Processing under Approximate Differential Privacy, authors Ganzhao Yuan, Yin Yang, Zhenjie Zhang and Zhifeng Hao show how to efficiently perform a set of analyses under the slightly relaxed (ε, δ)-differential privacy.

Related KDD2016 Papers

Title & Authors
Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations
Author(s): Wei Cheng*, NEC Labs America; Kai Zhang, NEC labs America; Haifeng Chen, NEC Research Lab; Guofei Jiang, NEC labs America; Wei Wang, UC Los Angeles
Convex Optimization for Linear Query Processing under Approximate Differential Privacy
Author(s): Ganzhao Yuan*, SCUT; Yin Yang, ; Zhenjie Zhang, ; Zhifeng Hao,
Privacy-preserving Class Ratio Estimation
Author(s): Arun Iyer*, ; Saketh Nath, IIT Bombay; Sunita Sarawagi, IIT Bombay
FRAUDAR: Bounding Graph Fraud in the Face of Camouflage
Author(s): Bryan Hooi*, Carnegie Mellon University; Hyun Ah Song, Carnegie Mellon University; Alex Beutel, Carnegie Mellon University; Neil Shah, Carnegie Mellon University; Kijung Shin, Carnegie Mellon University; Christos Faloutsos, Carnegie Mellon University
Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning
Author(s): Yue Ning*, Virginia Tech; Sathappan Muthiah, Virginia Tech; Huzefa Rangwala, George Mason University; Naren Ramakrishnan, Virginia Tech