![]() |
|
| | Overview | Research | Members | Publications | Presentations | | |
|
RELATED LINKS: |
We address privacy-preserving data mining problem in a distributed system with one data miner and multiple data providers, each of which holds one data tuple. Randomization has been the approach proposed to preserve privacy in such scenario. However, this approach is now proven to be insecure as it has been discovered that some privacy intrusion techniques can be used to reconstruct private information from the randomized data tuples. We introduce an algebraic technique-based scheme. While the randomization approach features a one-way communication scheme (from the data providers to the data miner), we introduce a two-way communication mechanism between the data miner and the data providers with little overhead. In particular, we let the data miner send perturbation guidance to the data providers. Using this intelligence, data providers perturb their data such that the part of their private information transmitted to the data miner is the minimum necessary part for data mining. As a result, our scheme has the benefit of a better tradeoff between accuracy and privacy. Also, we avoid the computationally expensive distribution reconstruction process on the critical time path and distribute the computational load into the data collection process.
Compared to the randomization approach, our new scheme can build data mining models more accurately but disclose less private information. Our scheme also allows each data provider to play a role in determining the tradeoff between accuracy and privacy by choosing its own level of privacy disclosure. Furthermore, our new scheme can be readily integrated as a middleware with existing systems. We implement our scheme to solve privacy-preserving association rule mining problem and privacy-preserving data classification problem, respectively. We study the performance of our scheme in terms of privacy protection and accuracy of data mining models both theoretically and experimentally. In particular, we show that our scheme outperforms the randomization approach by comparing the performance of our approach with that of the randomization approach on real data sets.
We establishes the foundation for the performance measurements of privacy preserving data mining techniques. The performance is measured in terms of the accuracy of data mining results and the privacy protection of sensitive data. On the accuracy side, we address the problem of previous measures and propose a new measure, named “effective sample size”, to solve this problem. We show that our new measure can be bounded without any knowledge of the data being mined, and discuss when the bound can be met. On the privacy protection side, we identify a tacit assumption made by previous measures and show that the assumption is unrealistic in many situations. To solve the problem, we introduce a game theoretic framework for the measurement of privacy. We address issues related to sharing information in a distributed system consisting of autonomous entities, each of which holds a private database. Semi-honest behavior has been widely adopted as themodel for adversarial threats. However, it substantially underestimates the capability of adversaries in reality. We consider a threat space containing more powerful adversaries that includes not only semi-honest but also those malicious adversaries. In particular, we classify malicious adversaries into two widely existing subclasses, called weakly malicious and strongly malicious adversaries, respectively. We define a measure of privacy leakage for information sharing systems and propose protocols that can effectively and efficiently protect privacy against different kinds of malicious adversaries. We address the inference control problem in data cubes with some data known to users through external knowledge. The goal of inference controls is to prevent exact values of sensitive data from being inferred through answers to online analytical processing (OLAP) queries. We present an information theoretic approach for cardinality-based inference control, which simply counts the number of cells that all queries have covered thus far, and compare it with the number of covered cells in a maximum compromisable data cube to determine whether a new query should be answered. Compared to previous approaches in sum-only data cubes, our new approach has a more general framework (applies to MIN, MAX and SUM) and is more effective.
|
| Webmaster: swang@cs.tamu.edu | Copyright © 2000 DataCamo. All rights reserved. |