Table of Contents
- 1. Why do ML models need privacy protection?
- 2. What do you mean when you say anonymization?
- 3. Wasn't k-anonymity declared a failure?
- 4. Why not just use differential privacy instead?
- 5. Do I really need an already trained model to use ML-guided anonymization?
- 6. What kind of models does ML-guided anonymization work for?
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
1. Why do ML models need privacy protection?
Recent studies show that a malicious third party with access to a trained ML model, even without access to the training data itself, can still reveal sensitive, personal information about the people whose data was used to train the model. For example, it may be possible to reveal whether or not a person’s data is part of the model’s training set (membership inference), or even infer sensitive atributes about them, such as their salary (attribute inference). For more information see: https://github.com/IBM/ai-privacy-toolkit/wiki/Relevant-papers#privacy-attacks-on-ml-models
2. What do you mean when you say anonymization?
The ML-guided anonymization method implemented in the anonymization module of this toolkit is based on a long-known construct called k-anonymity, which was proposed by L. Sweeney in 2002 to address the problem of releasing personal data while preserving individual privacy. This is a method to reduce the likelihood of any single person being identified when the dataset is linked with other, external data sources. The approach is based on generalizing attributes and possibly deleting records until each record becomes indistinguishable from at least k − 1 other records. This generalization is applied only to those attributes that can be linked with other data sources containing identifiers, called quasi-identifiers (QI).
The difference between "classic" k-anonymity methods and the method implemented in this library is that we use an existing machine learning model to guide the anonymization process, which results in a model with higher accuracy for the same level of privacy.
3. Wasn't k-anonymity declared a failure?
There has been some criticism of anonymization methods in general and k-anonymity in particular, mostly revolving around possible re-identification even after a dataset has been anonymized. Examples such as the Netflix recommendation contest dataset that was linked with an iMDB dataset to enable re-identification of customers (https://arxiv.org/abs/cs/0610105), or the ability to re-identify individuals from credit card records despite the resolution of data being decreased on several axes (https://science.sciencemag.org/content/347/6221/536), have been used to justify the need for new, more robust methods.
However, a deeper analysis of these cases reveals that this typically occurs when poor anonymization techniques were applied (https://arxiv.org/abs/1808.01113) or when the chosen list of quasi-identifiers was not exhaustive, leaving some attributes that could potentially be linked to external datasets unaccounted for. The success of k-anonymity is highly dependent on the choice of quasi-identifiers, with the risk of re-identification decreasing as the number of QIs increases. When correctly applied, the re-identification risk in a k-anonymized dataset is at most 1/k (https://doi.org/10.1197/jamia.M2716).
In addition, in the ML-guided anonymization paper (https://arxiv.org/abs/2007.13086) we show that this method achieves similar protection against membership inference attacks on the resulting models to more robust methods such as those based on differential privacy, and even demonstrated a reduced risk of attribute inference using this method.
4. Why not just use differential privacy instead?
Differential privacy has several big advantages. First and foremost, it provides a robust mathematical privacy guarantee, as opposed to k-anonymity that is considered a syntactic privacy construct. Differential privacy also provides forward-proof privacy, i.e., it doesn't matter which datasets may be available in the future, re-identification risk will not increase. Differential privacy is also suitable for high-dimensional and non-tabular data, such as images or text.
However it also suffers from a few drawbacks. It is much more invasive and complex to implement and use, and requires involvement of the data scientists since it requires replacing the original training algorithm with a new one. Moreover, each type of ML model, and in some cases different architectures and other internal implementation details, require a different differentially private implementation. This makes it much more difficult to implement in large organizations with many diverse types of models. Finally, it is typically much more resource-intensive than non-private, highly optimized training algorithms.
On the other hand, ML-guided anonymization sits “outside” of the training process, which does not need to be replaced or change in any way, and it is actually model-agnostic. The existing training algorithms, architectures and even hyper-parameters can be reused. This makes it much easier to integrate into existing ML pipelines. Since it does not rely on making modifications to the training process, it can be applied in a wide variety of use cases, including machine learning as a service. However, it only works for tabular, relatively low-dimensional data.
5. Do I really need an already trained model to use ML-guided anonymization?
Working with an existing model enables the highest level of tailoring, and will likely yield the highest accuracy results. The original model’s predictions are used to guide the anonymization process, i.e., the creation of the groups of k records that will be generalized together. The initial model used to generate these predictions may be a simple, representative model, trained on a subset of the data or a pre-trained model performing a similar classification task as the target model. However, if such a model is not available, the true class labels may be used instead of the model's predictions.
6. What kind of models does ML-guided anonymization work for?
This method is model-agnostic and does not require any changes to the training algorithm. So far we have only tested it on classification models in the supervised learning domain. However we believe it may be applicable in many more use cases, such as regression models and maybe even unsupervised learning models.