Updated FAQ (markdown)

abigailgold 2021-06-14 16:14:13 +03:00
parent 4777e239a9
commit a7454aef41

18
FAQ.md

@ -1,6 +1,22 @@
### 1. Why do ML models need privacy protection?
Recent studies show that a malicious third party with access to a trained ML model, even without access to the training data itself, can still reveal sensitive, personal information about the people whose data was used to train the model. For example, it may be possible to reveal whether or not a persons data is part of the models training set (membership inference), or even infer sensitive atributes about them, such as their salary (attribute inference). For more information see: https://github.com/IBM/ai-privacy-toolkit/wiki/Relevant-papers#membership-inference-attacks
### 2. What do you mean when you say anonymization?
The ML-guided anonymization method implemented in the anonymization module of this toolkit is based on a long-known construct called k-anonymity, which was proposed by L. Sweeney in 2002 to address the problem of releasing personal data while preserving individual privacy. This is a method to reduce the likelihood of any single person being identified when the dataset is linked with other, external data sources. The approach is based on generalizing attributes and possibly deleting records until each record becomes indistinguishable from at least k 1 other records. This generalization is applied only to those attributes that can be linked with other data sources containing identifiers, called quasi-identifiers (QI).
The difference between "classic" k-anonymity methods and the method implemented in this library is that we use an existing machine learning model to guide the anonymization process, which results in a model with higher accuracy for the same level of privacy.
### 3. Wasn't k-anonymity declared a failure?
### 4. Why not use differential privacy instead?
There has been some criticism of anonymization methods in general and k-anonymity in particular, mostly revolving around possible re-identification even after a dataset has been anonymized. Examples such as the Netflix recommendation contest dataset that was linked with an iMDB dataset to enable re-identification of customers (https://arxiv.org/abs/cs/0610105), or the ability to re-identify individuals from credit card records despite the resolution of data being decreased on several axes (https://science.sciencemag.org/content/347/6221/536), have been used to justify the need for new, more robust methods.
However, a deeper analysis of these cases reveals that this typically occurs when poor anonymization techniques were applied (https://arxiv.org/abs/1808.01113) or when the chosen list of quasi-identifiers was not exhaustive, leaving some attributes that could potentially be linked to external datasets unaccounted for. The success of k-anonymity is highly dependent on the choice of quasi-identifiers, with the risk of re-identification decreasing as the number of QIs increases. When correctly applied, the re-identification risk in a k-anonymized dataset is at most 1/k (https://doi.org/10.1197/jamia.M2716).
In addition, in the ML-guided anonymization paper (https://arxiv.org/abs/2007.13086) we show that this method achieves similar protection against membership inference attacks on the resulting models to more robust methods such as those based on differential privacy, and even demonstrated a reduced risk of attribute inference using this method.
### 4. Why not just use differential privacy instead?
### 5. Do I really need an already trained model to use ML-guided anonymization?
### 6. What kind of models does ML-guided anonymization work for?