Add README

Signed-off-by: Maya Anderson <mayaa@il.ibm.com>
2026-07-23 17:01:03 +02:00 · 2023-03-17 00:20:18 +02:00 · 2023-03-17 00:20:18 +02:00 · d77cdf0da3
commit d77cdf0da3
parent 39dc8026e6
1 changed files with 96 additions and 0 deletions
--- a/apt/risk/data_assessment/README.md
+++ b/apt/risk/data_assessment/README.md
@ -0,0 +1,96 @@
+# Privacy Assessment of Datasets for AI Models
+
+This module implements a tool for privacy assessment of synthetic datasets that are to be used in AI models.
+
+The main interface, ``DatasetAttack``, with the ``assess_privacy()`` main method assumes the availability of the
+training data, holdout data and synthetic data at the time of the privacy evaluation.
+It is to be implemented by concrete assessment methods, which can run the assessment on a per-record level,
+or on the whole dataset.
+The method ``assess_privacy()`` returns a ``DatasetAttackScore``, which contains a ``risk_score`` and,
+optionally, a ``DatasetAttackResult``. Each specific attack can implement its own ``DatasetAttackScore``, which would
+contain additional fields.
+
+The abstract class ``DatasetAttackMembership`` implements the ``DatasetAttack`` interface, but adds the result
+of the membership inference attack, so that the final score contains both the membership inference attack result
+for further analysis and the calculated score.
+
+
+``DatasetAssessmentManager`` provides convenience methods to run multiple attacks and persist the result reports.
+
+Attack Implementations
+-----------------------
+
+One implementation is based on the paper [^1] and its implementation in [^2]. It is based on Black-Box MIA attack using
+distances of members (training set) and non-members (holdout set) from their nearest neighbors in the synthetic dataset.
+By default, the Euclidean distance is used (L2 norm), but another ``compute_distance()`` method can be provided in
+configuration instead.
+The area under the receiver operating characteristic curve (AUC ROC) gives the privacy risk measure.
+
+Another implementation is based on the papers [^3] and [^4], and on a variation of its reference implementation in [^5].
+It is based on distances of synthetic data records from members (training set) and non-members (holdout set).
+The privacy risk measure is the share of synthetic records closer to the training than the holdout dataset.
+By default, the Euclidean distance is used (L2 norm), but another ``compute_distance()`` method can be provided in
+configuration instead.
+
+Usage
+-----
+The interface for performing privacy attack for risk assessment for synthetic datasets to be used in AI models.
+The original data members (training data), non-members (the holdout data) and the synthetic data created from the
+original members should be available.
+For reliability, all the datasets should be preprocessed and normalized.
+The following example persists runs all the attacks and persists the result in files.
+
+```python
+from apt.risk.data_assessment.dataset_assessment_manager import DatasetAssessmentManager, \
+    DatasetAssessmentManagerConfig
+from apt.utils.datasets import ArrayDataset
+
+dataset_assessment_manager = DatasetAssessmentManager(
+    DatasetAssessmentManagerConfig(persist_reports=True, generate_plots=False))
+
+synthetic_data = ArrayDataset(x_synth, y_synth)
+original_data_members = ArrayDataset(x_train, y_train)
+original_data_non_members = ArrayDataset(x_test, y_test)
+
+dataset_name = 'my_dataset'
+[score_gl, score_h] = dataset_assessment_manager.assess(
+    original_data_members, original_data_non_members, synthetic_data, dataset_name)
+dataset_assessment_manager.dump_all_scores_to_files()
+```
+
+Alternatively, each attack can be run separately, for instance:
+
+```python
+from apt.risk.data_assessment.dataset_attack_membership_knn_probabilities import \
+    DatasetAttackConfigMembershipKnnProbabilities, DatasetAttackMembershipKnnProbabilities
+from apt.utils.datasets import ArrayDataset
+
+synthetic_data = ArrayDataset(x_synth, y_synth)
+original_data_members = ArrayDataset(x_train, y_train)
+original_data_non_members = ArrayDataset(x_test, y_test)
+
+dataset_name = 'my_dataset'
+
+config_gl = DatasetAttackConfigMembershipKnnProbabilities(use_batches=False,
+                                                          generate_plot=False)
+attack_gl = DatasetAttackMembershipKnnProbabilities(original_data_members,
+                                                    original_data_non_members,
+                                                    synthetic_data,
+                                                    config_gl,
+                                                    dataset_name)
+
+score_gl = attack_gl.assess_privacy()
+```
+
+Citations
+---------
+
+  [^1]: "GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models" by D. Chen, N. Yu, Y. Zhang, M. Fritz published in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 343–62, 2020. [https://doi.org/10.1145/3372297.3417238](https://doi.org/10.1145/3372297.3417238)
+
+  [^2]: Code for the paper "GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models" [https://github.com/DingfanChen/GAN-Leaks](https://github.com/DingfanChen/GAN-Leaks)
+
+  [^3]: "Data Synthesis based on Generative Adversarial Networks." by N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim in International Conference on Very Large Data Bases (VLDB), 2018.
+
+  [^4]: "Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data" by M. Platzer and T. Reutterer.
+
+  [^5]: Code for the paper "Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data" [https://github.com/mostly-ai/paper-fidelity-accuracy](https://github.com/mostly-ai/paper-fidelity-accuracy)