Initial version of data minimization

2026-05-02 00:02:37 +02:00 · 2021-07-12 14:59:24 +03:00 · 2021-07-12 14:59:24 +03:00 · b6ea416bcb
commit b6ea416bcb
parent 60d48ab388
7 changed files with 857 additions and 8 deletions
--- a/apt/minimization/README.md
+++ b/apt/minimization/README.md
@ -0,0 +1,110 @@
+# data minimization module
+
+The EU General Data Protection Regulation (GDPR) mandates the principle of data minimization, which requires that only 
+data necessary to fulfill a certain purpose be collected. However, it can often be difficult to determine the minimal 
+amount of data required, especially in complex machine learning models such as neural networks. 
+
+This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform 
+predictions with a machine learning model, by removing or generalizing some of the input features. The type of data 
+minimization this toolkit focuses on is the reduction of the number and/or granularity of features collected for analysis. 
+
+The generalization process basically searches for several similar records and groups them together. Then, for each 
+feature, the individual values for that feature within each group are replaced with a represenataive value that is 
+common across the whole group. This process is done while using knowledge encoded within the model to produce a 
+generalization that has little to no impact on its accuracy. 
+
+For more information about the method see: http://export.arxiv.org/pdf/2008.04113
+
+The following figure depicts the overall process:
+
+<p align="center">
+  <img src="../../docs/images/AI_Privacy_project.jpg?raw=true" width="667" title="data minimization process">
+</p>
+<br />
+
+Usage
+-----
+
+The main class, ``GeneralizeToRepresentative``, is a scikit-learn compatible ``Transformer``, that receives an existing 
+estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for 
+analysis by the original model. The ``fit()`` method learns the generalizations and the ``transform()`` method applies 
+them to new data.
+
+It is also possible to export the generalizations as feature ranges.
+
+The current implementation supports only numeric features, so any categorical features must be transformed to a numeric 
+representation before using this class.
+
+Start by training your machine learning model. In this example, we will use a ``DecisionTreeClassifier``, but any 
+scikit-learn model can be used. We will use the iris dataset in our example.
+
+.. code:: python
+
+  from sklearn import datasets
+  from sklearn.model_selection import train_test_split
+  from sklearn.tree import DecisionTreeClassifier
+
+  dataset = datasets.load_iris()
+  X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)
+
+  base_est = DecisionTreeClassifier()
+  base_est.fit(X_train, y_train)
+
+Now create the ``GeneralizeToRepresentative`` transformer and train it. Supply it with the original model and the 
+desired target accuracy. The training process may receive the original labeled training data or the model's predictions 
+on the data.
+
+.. code:: python
+
+  predictions = base_est.predict(X_train)
+  gen = GeneralizeToRepresentative(base_est, target_accuracy=0.9)
+  gen.fit(X_train, predictions)
+
+Now use the transformer to transform new data, for example the test data.
+
+.. code:: python
+
+  transformed = gen.transform(X_test)
+
+The transformed data has the same columns and formats as the original data, so it can be used directly to derive 
+predictions from the original model.
+
+.. code:: python
+
+  new_predictions = base_est.predict(transformed)
+  
+To export the resulting generalizations, retrieve the ``Transformer``'s ``_generalize`` parameter.
+
+.. code:: python
+
+  generalizations = base_est._generalize
+  
+The returned object has the following structure::
+
+  {
+    ranges: 
+    {
+      list of (<feature name>: [<list of values>])
+    }, 
+    untouched: [<list of feature names>]
+  }
+  
+For example::
+
+  {
+    ranges: 
+    {
+      age: [21.5, 39.0, 51.0, 70.5], 
+      education-years: [8.0, 12.0, 14.5]
+    }, 
+    untouched: ["occupation", "marital-status"]
+  }
+  
+Where each value inside the range list represents a cutoff point. For example, for the ``age`` feature, the ranges in 
+this example are: ``<21.5, 21.5-39.0, 39.0-51.0, 51.0-70.5, >70.5``. The ``untouched`` list represents features that 
+were not generalized, i.e., their values should remain unchanged.
+
+
+
+
+