Fix README references, and those in other comments

Signed-off-by: Maya Anderson <mayaa@il.ibm.com>
2026-07-23 17:01:03 +02:00 · 2023-03-17 23:16:51 +02:00 · 2023-03-17 23:16:51 +02:00 · 4c7cad86df
commit 4c7cad86df
parent ab42e064a4
5 changed files with 30 additions and 21 deletions
--- a/apt/risk/data_assessment/README.md
+++ b/apt/risk/data_assessment/README.md
@ -1,6 +1,6 @@
 # Privacy Assessment of Datasets for AI Models

-This module implements a tool for privacy assessment of synthetic datasets that are to be used in AI models.
+This module implements a tool for privacy assessment of synthetic datasets that are to be used in AI model training.

 The main interface, ``DatasetAttack``, with the ``assess_privacy()`` main method assumes the availability of the
 training data, holdout data and synthetic data at the time of the privacy evaluation.
@ -20,13 +20,16 @@ for further analysis and the calculated score.
 Attack Implementations
 -----------------------

-One implementation is based on the paper [^1] and its implementation in [^2]. It is based on Black-Box MIA attack using
+One implementation is based on the paper "GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative
+Models"[^1] and its implementation[^2]. It is based on Black-Box MIA attack using
 distances of members (training set) and non-members (holdout set) from their nearest neighbors in the synthetic dataset.
 By default, the Euclidean distance is used (L2 norm), but another ``compute_distance()`` method can be provided in
 configuration instead.
 The area under the receiver operating characteristic curve (AUC ROC) gives the privacy risk measure.

-Another implementation is based on the papers [^3] and [^4], and on a variation of its reference implementation in [^5].
+Another implementation is based on the papers "Data Synthesis based on Generative Adversarial Networks"[^3] and
+"Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data"[^4], and on a variation of its reference
+implementation[^5].
 It is based on distances of synthetic data records from members (training set) and non-members (holdout set).
 The privacy risk measure is the share of synthetic records closer to the training than the holdout dataset.
 By default, the Euclidean distance is used (L2 norm), but another ``compute_distance()`` method can be provided in
@ -34,11 +37,15 @@ configuration instead.

 Usage
 -----
-The interface for performing privacy attack for risk assessment for synthetic datasets to be used in AI models.
+An implementation of the ``DatasetAttack`` interface is used for performing a privacy attack for risk assessment of
+synthetic datasets to be used in AI model training.
 The original data members (training data), non-members (the holdout data) and the synthetic data created from the
 original members should be available.
 For reliability, all the datasets should be preprocessed and normalized.
-The following example persists runs all the attacks and persists the result in files.
+
+The following example runs all the attacks and persists the results in files, using ``DatasetAssessmentManager``.
+It assumes that you provide it with the pairs ``(x_train, y_train)``, ``(x_test, y_test)`` and ``(x_synth, y_synth)``
+for members, non-members and the synthetic datasets, respectively.

 ```python
 from apt.risk.data_assessment.dataset_assessment_manager import DatasetAssessmentManager, \
@ -69,15 +76,12 @@ synthetic_data = ArrayDataset(x_synth, y_synth)
 original_data_members = ArrayDataset(x_train, y_train)
 original_data_non_members = ArrayDataset(x_test, y_test)

-dataset_name = 'my_dataset'
-
 config_gl = DatasetAttackConfigMembershipKnnProbabilities(use_batches=False,
                                                          generate_plot=False)
 attack_gl = DatasetAttackMembershipKnnProbabilities(original_data_members,
                                                    original_data_non_members,
                                                    synthetic_data,
-                                                    config_gl,
-                                                    dataset_name)
+                                                    config_gl)

 score_gl = attack_gl.assess_privacy()
 ```
@ -85,12 +89,17 @@ score_gl = attack_gl.assess_privacy()
 Citations
 ---------

-  [^1]: "GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models" by D. Chen, N. Yu, Y. Zhang, M. Fritz published in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 343–62, 2020. [https://doi.org/10.1145/3372297.3417238](https://doi.org/10.1145/3372297.3417238)
+  [^1]: "GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models" by D. Chen, N. Yu, Y. Zhang,
+  M. Fritz published in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 343–62,
+  2020. [https://doi.org/10.1145/3372297.3417238](https://doi.org/10.1145/3372297.3417238)

-  [^2]: Code for the paper "GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models" [https://github.com/DingfanChen/GAN-Leaks](https://github.com/DingfanChen/GAN-Leaks)
+  [^2]: Code for the paper "GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models"
+  [https://github.com/DingfanChen/GAN-Leaks](https://github.com/DingfanChen/GAN-Leaks)

-  [^3]: "Data Synthesis based on Generative Adversarial Networks." by N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim in International Conference on Very Large Data Bases (VLDB), 2018.
+  [^3]: "Data Synthesis based on Generative Adversarial Networks." by N. Park, M. Mohammadi, K. Gorde, S. Jajodia,
+  H. Park, and Y. Kim in International Conference on Very Large Data Bases (VLDB), 2018.

  [^4]: "Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data" by M. Platzer and T. Reutterer.

-  [^5]: Code for the paper "Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data" [https://github.com/mostly-ai/paper-fidelity-accuracy](https://github.com/mostly-ai/paper-fidelity-accuracy)
+  [^5]: Code for the paper "Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data"
+  [https://github.com/mostly-ai/paper-fidelity-accuracy](https://github.com/mostly-ai/paper-fidelity-accuracy)
--- a/apt/risk/data_assessment/init.py
+++ b/apt/risk/data_assessment/init.py
@ -1,7 +1,7 @@
 """
 Module providing privacy risk assessment for synthetic data.

-The main interface, ``DatasetAttack``, with the assess_privacy() main method assumes the availability of the
+The main interface, ``DatasetAttack``, with the ``assess_privacy()`` main method assumes the availability of the
 training data, holdout data and synthetic data at the time of the privacy evaluation.
 It is to be implemented by concrete assessment methods, which can run the assessment on a per-record level,
 or on the whole dataset.
--- a/apt/risk/data_assessment/dataset_assessment_manager.py
+++ b/apt/risk/data_assessment/dataset_assessment_manager.py
@ -35,7 +35,7 @@ class DatasetAssessmentManager:
    def assess(self, original_data_members: ArrayDataset, original_data_non_members: ArrayDataset,
               synthetic_data: ArrayDataset, dataset_name: str = DEFAULT_DATASET_NAME) -> list[DatasetAttackScore]:
        """
-        Do dataset assessment by running dataset attacks, and return their scores.
+        Do dataset privacy risk assessment by running dataset attacks, and return their scores.

        :param original_data_members: A container for the training original samples and labels,
            only samples are used in the assessment
--- a/apt/risk/data_assessment/dataset_attack.py
+++ b/apt/risk/data_assessment/dataset_attack.py
@ -23,8 +23,8 @@ class Config(abc.ABC):

 class DatasetAttack(abc.ABC):
    """
-         The interface for performing privacy attack for risk assessment for synthetic datasets to be used in AI models.
-         The original data members (training data) and non-members (the holdout data) should be available.
+         The interface for performing privacy attack for risk assessment of synthetic datasets to be used in AI model
+         training. The original data members (training data) and non-members (the holdout data) should be available.
         For reliability, all the datasets should be preprocessed and normalized.
    """

--- a/apt/risk/data_assessment/dataset_attack_membership_knn_probabilities.py
+++ b/apt/risk/data_assessment/dataset_attack_membership_knn_probabilities.py
@ -66,8 +66,8 @@ class DatasetAttackMembershipKnnProbabilities(DatasetAttackMembership):
    """
         Privacy risk assessment for synthetic datasets based on Black-Box MIA attack using distances of
         members (training set) and non-members (holdout set) from their nearest neighbors in the synthetic dataset.
-         By default, the Euclidean distance is used (L2 norm), but another compute_distance() method can be provided in
-         configuration instead.
+         By default, the Euclidean distance is used (L2 norm), but another ``compute_distance()`` method can be provided
+         in configuration instead.
         The area under the receiver operating characteristic curve (AUC ROC) gives the privacy risk measure.
    """

@ -100,7 +100,7 @@ class DatasetAttackMembershipKnnProbabilities(DatasetAttackMembership):
        that the query sample can be generated by the generative model.
        So, if the probability that the query sample is generated by the generative model is large,
        it is more likely that the query sample was used to train the generative model. This probability is approximated
-        by the Parzen window density estimation in 'probability_per_sample()', computed from the NN distances from the
+        by the Parzen window density estimation in ``probability_per_sample()``, computed from the NN distances from the
        query samples to the synthetic data samples.

        :return:
@ -129,7 +129,7 @@ class DatasetAttackMembershipKnnProbabilities(DatasetAttackMembership):
                                generate_plot=False) -> DatasetAttackScore:
        """
        Evaluate privacy score from the probabilities of member and non-member samples to be generated by the synthetic
-        data generator. The probabilities are computed by the 'assess_privacy()' method.
+        data generator. The probabilities are computed by the ``assess_privacy()`` method.
        :param dataset_attack_result attack result containing probabilities of member and non-member samples to be
                generated by the synthetic data generator
        :param generate_plot generate AUC ROC curve plot and persist it