Add data minimization functionality to the ai-privacy-toolkit (#3)

* Fix directory issue when running tests for first time

* Initial version of data minimization

* Update version and documentation

* Fix documentation
This commit is contained in:
abigailgold 2021-07-12 15:56:42 +03:00 committed by GitHub
parent bcc3d67ba4
commit f2e1364b43
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
14 changed files with 920 additions and 34 deletions

View file

@ -6,20 +6,23 @@
A toolkit for tools and techniques related to the privacy and compliance of AI models.
The first release of this toolkit contains a single module called [**anonymization**](apt/anonymization/README.md).
This module contains methods for anonymizing ML model training data, so that when
a model is retrained on the anonymized data, the model itself will also be considered
anonymous. This may help exempt the model from different obligations and restrictions
The [**anonymization**](apt/anonymization/README.md) module contains methods for anonymizing ML model
training data, so that when a model is retrained on the anonymized data, the model itself will also be
considered anonymous. This may help exempt the model from different obligations and restrictions
set out in data protection regulations such as GDPR, CCPA, etc.
The [**minimization**](apt/minimization/README.md) module contains methods to help adhere to the data
minimization principle in GDPR for ML models. It enables to reduce the amount of
personal data needed to perform predictions with a machine learning model, while still enabling the model
to make accurate predictions. This is done by by removing or generalizing some of the input features.
Official ai-privacy-toolkit documentation: https://ai-privacy-toolkit.readthedocs.io/en/latest/
Installation: pip install ai-privacy-toolkit
**Related toolkits:**
[ai-minimization-toolkit](https://github.com/IBM/ai-minimization-toolkit): A toolkit for
reducing the amount of personal data needed to perform predictions with a machine learning model
ai-minimization-toolkit - has been migrated into this toolkit.
[differential-privacy-library](https://github.com/IBM/differential-privacy-library): A
general-purpose library for experimenting with, investigating and developing applications in,

View file

@ -3,6 +3,7 @@ The AI Privacy Toolbox (apt).
"""
from apt import anonymization
from apt import minimization
from apt import utils
__version__ = "0.0.2"
__version__ = "0.0.3"

110
apt/minimization/README.md Normal file
View file

@ -0,0 +1,110 @@
# data minimization module
The EU General Data Protection Regulation (GDPR) mandates the principle of data minimization, which requires that only
data necessary to fulfill a certain purpose be collected. However, it can often be difficult to determine the minimal
amount of data required, especially in complex machine learning models such as neural networks.
This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform
predictions with a machine learning model, by removing or generalizing some of the input features. The type of data
minimization this toolkit focuses on is the reduction of the number and/or granularity of features collected for analysis.
The generalization process basically searches for several similar records and groups them together. Then, for each
feature, the individual values for that feature within each group are replaced with a represenataive value that is
common across the whole group. This process is done while using knowledge encoded within the model to produce a
generalization that has little to no impact on its accuracy.
For more information about the method see: http://export.arxiv.org/pdf/2008.04113
The following figure depicts the overall process:
<p align="center">
<img src="../../docs/images/AI_Privacy_project.jpg?raw=true" width="667" title="data minimization process">
</p>
<br />
Usage
-----
The main class, ``GeneralizeToRepresentative``, is a scikit-learn compatible ``Transformer``, that receives an existing
estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for
analysis by the original model. The ``fit()`` method learns the generalizations and the ``transform()`` method applies
them to new data.
It is also possible to export the generalizations as feature ranges.
The current implementation supports only numeric features, so any categorical features must be transformed to a numeric
representation before using this class.
Start by training your machine learning model. In this example, we will use a ``DecisionTreeClassifier``, but any
scikit-learn model can be used. We will use the iris dataset in our example.
.. code:: python
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)
base_est = DecisionTreeClassifier()
base_est.fit(X_train, y_train)
Now create the ``GeneralizeToRepresentative`` transformer and train it. Supply it with the original model and the
desired target accuracy. The training process may receive the original labeled training data or the model's predictions
on the data.
.. code:: python
predictions = base_est.predict(X_train)
gen = GeneralizeToRepresentative(base_est, target_accuracy=0.9)
gen.fit(X_train, predictions)
Now use the transformer to transform new data, for example the test data.
.. code:: python
transformed = gen.transform(X_test)
The transformed data has the same columns and formats as the original data, so it can be used directly to derive
predictions from the original model.
.. code:: python
new_predictions = base_est.predict(transformed)
To export the resulting generalizations, retrieve the ``Transformer``'s ``_generalize`` parameter.
.. code:: python
generalizations = base_est._generalize
The returned object has the following structure::
{
ranges:
{
list of (<feature name>: [<list of values>])
},
untouched: [<list of feature names>]
}
For example::
{
ranges:
{
age: [21.5, 39.0, 51.0, 70.5],
education-years: [8.0, 12.0, 14.5]
},
untouched: ["occupation", "marital-status"]
}
Where each value inside the range list represents a cutoff point. For example, for the ``age`` feature, the ranges in
this example are: ``<21.5, 21.5-39.0, 39.0-51.0, 51.0-70.5, >70.5``. The ``untouched`` list represents features that
were not generalized, i.e., their values should remain unchanged.

View file

@ -0,0 +1,19 @@
"""
Module providing data minimization for ML.
This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform
predictions with a machine learning model, by removing or generalizing some of the input features. For more information
about the method see: http://export.arxiv.org/pdf/2008.04113
The main class, ``GeneralizeToRepresentative``, is a scikit-learn compatible ``Transformer``, that receives an existing
estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for
analysis by the original model. The ``fit()`` method learns the generalizations and the ``transform()`` method applies
them to new data.
It is also possible to export the generalizations as feature ranges.
The current implementation supports only numeric features, so any categorical features must be transformed to a numeric
representation before using this class.
"""
from apt.minimization.minimizer import GeneralizeToRepresentative

View file

@ -0,0 +1,664 @@
"""
This module implements all classes needed to perform data minimization
"""
import pandas as pd
import numpy as np
import copy
import sys
from scipy.spatial import distance
from sklearn.base import BaseEstimator, TransformerMixin, MetaEstimatorMixin
from sklearn.base import clone
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
class GeneralizeToRepresentative(BaseEstimator, MetaEstimatorMixin, TransformerMixin):
""" A transformer that generalizes data to representative points.
Learns data generalizations based on an original model's predictions
and a target accuracy. Once the generalizations are learned, can
receive one or more data records and transform them to representative
points based on the learned generalization.
An alternative way to use the transformer is to supply ``cells`` and
``features`` in init or set_params and those will be used to transform
data to representatives. In this case, fit must still be called but
there is no need to supply it with ``X`` and ``y``, and there is no
need to supply an existing ``estimator`` to init.
In summary, either ``estimator`` and ``target_accuracy`` should be
supplied or ``cells`` and ``features`` should be supplied.
Parameters
----------
estimator : estimator, optional
The original model for which generalization is being performed.
Should be pre-fitted.
target_accuracy : float, optional
The required accuracy when applying the base model to the
generalized data. Accuracy is measured relative to the original
accuracy of the model.
features : list of str, optional
The feature names, in the order that they appear in the data.
cells : list of object, optional
The cells used to generalize records. Each cell must define a
range or subset of categories for each feature, as well as a
representative value for each feature.
This parameter should be used when instantiating a transformer
object without first fitting it.
Attributes
----------
cells_ : list of object
The cells used to generalize records, as learned when calling fit.
ncp_ : float
The NCP (information loss) score of the resulting generalization,
as measured on the training data.
generalizations_ : object
The generalizations that were learned (actual feature ranges).
Notes
-----
"""
def __init__(self, estimator=None, target_accuracy=0.998, features=None,
cells=None):
self.estimator = estimator
self.target_accuracy = target_accuracy
self.features = features
self.cells = cells
def get_params(self, deep=True):
"""Get parameters for this estimator.
Parameters
----------
deep : boolean, optional
If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
-------
params : mapping of string to any
Parameter names mapped to their values.
"""
ret = {}
ret['target_accuracy'] = self.target_accuracy
if deep:
ret['features'] = copy.deepcopy(self.features)
ret['cells'] = copy.deepcopy(self.cells)
ret['estimator'] = self.estimator
else:
ret['features'] = copy.copy(self.features)
ret['cells'] = copy.copy(self.cells)
return ret
def set_params(self, **params):
"""Set the parameters of this estimator.
Returns
-------
self : object
Returns self.
"""
if 'target_accuracy' in params:
self.target_accuracy = params['target_accuracy']
if 'features' in params:
self.features = params['features']
if 'cells' in params:
self.cells = params['cells']
return self
def fit_transform(self, X=None, y=None):
"""Learns the generalizations based on training data, and applies them to the data.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_samples, n_features), optional
The training input samples.
y : array-like, shape (n_samples,), optional
The target values. An array of int.
This should contain the predictions of the original model on ``X``.
Returns
-------
self : object
Returns self.
"""
self.fit(X, y)
return self.transform(X)
def fit(self, X=None, y=None):
"""Learns the generalizations based on training data.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_samples, n_features), optional
The training input samples.
y : array-like, shape (n_samples,), optional
The target values. An array of int.
This should contain the predictions of the original model on ``X``.
Returns
-------
X_transformed : ndarray, shape (n_samples, n_features)
The array containing the representative values to which each record in
``X`` is mapped.
"""
# take into account that estimator, X, y, cells, features may be None
if X is not None and y is not None:
X, y = check_X_y(X, y, accept_sparse=True)
self.n_features_ = X.shape[1]
elif self.features:
self.n_features_ = len(self.features)
else:
self.n_features_ = 0
if self.features:
self._features = self.features
# if features is None, use numbers instead of names
elif self.n_features_ != 0:
self._features = [i for i in range(self.n_features_)]
else:
self._features = None
if self.cells:
self.cells_ = self.cells
else:
self.cells_ = {}
# Going to fit
# (currently not dealing with option to fit with only X and y and no estimator)
if self.estimator and X is not None and y is not None:
# divide dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
test_size = 0.4,
random_state = 18)
# collect feature data (such as min, max)
train_data = pd.DataFrame(X_train, columns=self._features)
feature_data = {}
for feature in self._features:
if not feature in feature_data.keys():
values = list(train_data.loc[:, feature])
fd = {}
fd['min'] = min(values)
fd['max'] = max(values)
feature_data[feature] = fd
self.cells_ = {}
self.dt_ = DecisionTreeClassifier(random_state=0, min_samples_split=2,
min_samples_leaf=1)
self.dt_.fit(X_train, y_train)
self._calculate_cells()
self._modify_cells()
nodes = self._get_nodes_level(0)
self._attach_cells_representatives(X_train, y_train, nodes)
# self.cells_ currently holds the generalization created from the tree leaves
# apply generalizations to test data
generalized = self._generalize(X_test, nodes, self.cells_, self.cells_by_id_)
# check accuracy
accuracy = self.estimator.score(generalized, y_test)
print('Initial accuracy is %f' % accuracy)
# if accuracy above threshold, improve generalization
if accuracy > self.target_accuracy:
level = 1
while accuracy > self.target_accuracy:
nodes = self._get_nodes_level(level)
self._calculate_level_cells(level)
self._attach_cells_representatives(X_train, y_train, nodes)
generalized = self._generalize(X_test, nodes, self.cells_,
self.cells_by_id_)
accuracy = self.estimator.score(generalized, y_test)
print('Level: %d, accuracy: %f' % (level, accuracy))
level+=1
# if accuracy below threshold, improve accuracy by removing features from generalization
if accuracy < self.target_accuracy:
while accuracy < self.target_accuracy:
self._calculate_generalizations()
removed_feature = self._remove_feature_from_generalization(X_test,
nodes, y_test,
feature_data)
if not removed_feature:
break
generalized = self._generalize(X_test, nodes, self.cells_,
self.cells_by_id_)
accuracy = self.estimator.score(generalized, y_test)
print('Removed feature: %s, accuracy: %f' % (removed_feature, accuracy))
# self.cells_ currently holds the chosen generalization based on target accuracy
# calculate iLoss
self.ncp_ = self._calculate_ncp(X_test, self.generalizations_, feature_data)
# Return the transformer
return self
def transform(self, X):
""" Transforms data records to representative points.
Parameters
----------
X : {array-like, sparse-matrix}, shape (n_samples, n_features)
The input samples.
Returns
-------
X_transformed : ndarray, shape (n_samples, n_features)
The array containing the representative values to which each record in
``X`` is mapped.
"""
# Check if fit has been called
msg = 'This %(name)s instance is not initialized yet. ' \
'Call fit or set_params with ' \
'appropriate arguments before using this method.'
check_is_fitted(self, ['cells', 'features'], msg=msg)
# Input validation
X = check_array(X, accept_sparse=True)
if X.shape[1] != self.n_features_ and self.n_features_ != 0:
raise ValueError('Shape of input is different from what was seen'
'in `fit`')
if not self._features:
self._features = [i for i in range(X.shape[1])]
representatives = pd.DataFrame(columns=self._features) # only columns
generalized = pd.DataFrame(X, columns=self._features, copy=True) # original data
mapped = np.zeros(X.shape[0]) # to mark records we already mapped
# iterate over cells (leaves in decision tree)
for i in range(len(self.cells_)):
# Copy the representatives from the cells into another data structure:
# iterate over features in test data
for feature in self._features:
# if feature has a representative value in the cell and should not
# be left untouched, take the representative value
if feature in self.cells_[i]['representative'] and \
( 'untouched' not in self.cells_[i] \
or feature not in self.cells_[i]['untouched'] ):
representatives.loc[i, feature] = self.cells_[i]['representative'][feature]
# else, drop the feature (removes from representatives columns that
# do not have a representative value or should remain untouched)
elif feature in representatives.columns.tolist():
representatives = representatives.drop(feature, axis=1)
# get the indexes of all records that map to this cell
indexes = self._get_record_indexes_for_cell(X, self.cells_[i], mapped)
# replace the values in the representative columns with the representative
# values (leaves others untouched)
if not representatives.columns.empty:
if len(indexes) > 1:
replace = pd.concat([representatives.loc[i].to_frame().T]*len(indexes)).reset_index(drop=True)
replace.index = indexes
else:
replace = representatives.loc[i].to_frame().T
generalized.loc[indexes, representatives.columns] = replace
return generalized.to_numpy()
def _get_record_indexes_for_cell(self, X, cell, mapped):
return [i for i, x in enumerate(X) if not mapped.item(i) and
self._cell_contains(cell, x, i, mapped)]
def _cell_contains(self, cell, x, i, mapped):
for f in self._features:
if f in cell['ranges']:
if not self._cell_contains_numeric(f, cell['ranges'][f], x):
return False
else:
#TODO: exception - feature not defined
pass
# Mark as mapped
mapped.itemset(i, 1)
return True
def _cell_contains_numeric(self, f, range, x):
i = self._features.index(f)
# convert x to ndarray to allow indexing
a = np.array(x)
value = a.item(i)
if range['start']:
if value <= range['start']:
return False
if range['end']:
if value > range['end']:
return False
return True
def _calculate_cells(self):
self.cells_by_id_ = {}
self.cells_ = self._calculate_cells_recursive(0)
def _calculate_cells_recursive(self, node):
feature_index = self.dt_.tree_.feature[node]
if feature_index == -2:
# this is a leaf
label = self._calculate_cell_label(node)
hist = [int(i) for i in self.dt_.tree_.value[node][0]]
cell = {'label': label, 'hist': hist, 'ranges': {}, 'id': int(node)}
return [cell]
cells = []
feature = self._features[feature_index]
threshold = self.dt_.tree_.threshold[node]
left_child = self.dt_.tree_.children_left[node]
right_child = self.dt_.tree_.children_right[node]
left_child_cells = self._calculate_cells_recursive(left_child)
for cell in left_child_cells:
if feature not in cell['ranges'].keys():
cell['ranges'][feature] = {'start': None, 'end': None}
if cell['ranges'][feature]['end'] is None:
cell['ranges'][feature]['end'] = threshold
cells.append(cell)
self.cells_by_id_[cell['id']] = cell
right_child_cells = self._calculate_cells_recursive(right_child)
for cell in right_child_cells:
if feature not in cell['ranges'].keys():
cell['ranges'][feature] = {'start': None, 'end': None}
if cell['ranges'][feature]['start'] is None:
cell['ranges'][feature]['start'] = threshold
cells.append(cell)
self.cells_by_id_[cell['id']] = cell
return cells
def _calculate_cell_label(self, node):
label_hist = self.dt_.tree_.value[node][0]
return int(self.dt_.classes_[np.argmax(label_hist)])
def _modify_cells(self):
cells = []
for cell in self.cells_:
new_cell = {'id': cell['id'], 'label': cell['label'], 'ranges': {},
'categories': {}, 'hist': cell['hist'], 'representative': None}
for feature in self._features:
if feature in cell['ranges'].keys():
new_cell['ranges'][feature] = cell['ranges'][feature]
else:
new_cell['ranges'][feature] = {'start': None, 'end': None}
cells.append(new_cell)
self.cells_by_id_[new_cell['id']] = new_cell
self.cells_ = cells
def _calculate_level_cells(self, level):
if level < 0 or level > self.dt_.get_depth():
#TODO: exception 'Illegal level %d' % level
pass
if level > 0:
new_cells = []
new_cells_by_id = {}
nodes = self._get_nodes_level(level)
for node in nodes:
if self.dt_.tree_.feature[node] == -2: # leaf node
new_cell = self.cells_by_id_[node]
else:
left_child = self.dt_.tree_.children_left[node]
right_child = self.dt_.tree_.children_right[node]
left_cell = self.cells_by_id_[left_child]
right_cell = self.cells_by_id_[right_child]
new_cell = {'id': int(node), 'ranges': {}, 'categories': {},
'label': None, 'representative': None}
for feature in left_cell['ranges'].keys():
new_cell['ranges'][feature] = {}
new_cell['ranges'][feature]['start'] = left_cell['ranges'][feature]['start']
new_cell['ranges'][feature]['end'] = right_cell['ranges'][feature]['start']
for feature in left_cell['categories'].keys():
new_cell['categories'][feature] = \
list(set(left_cell['categories'][feature]) |
set(right_cell['categories'][feature]))
self._calculate_level_cell_label(left_cell, right_cell, new_cell)
new_cells.append(new_cell)
new_cells_by_id[new_cell['id']] = new_cell
self.cells_ = new_cells
self.cells_by_id_ = new_cells_by_id
# else: nothing to do, stay with previous cells
def _calculate_level_cell_label(self, left_cell, right_cell, new_cell):
new_cell['hist'] = [x + y for x, y in zip(left_cell['hist'], right_cell['hist'])]
new_cell['label'] = int(self.dt_.classes_[np.argmax(new_cell['hist'])])
def _get_nodes_level(self, level):
# level = distance from lowest leaf
node_depth = np.zeros(shape=self.dt_.tree_.node_count, dtype=np.int64)
is_leaves = np.zeros(shape=self.dt_.tree_.node_count, dtype=bool)
stack = [(0, -1)] # seed is the root node id and its parent depth
while len(stack) > 0:
node_id, parent_depth = stack.pop()
node_depth[node_id] = parent_depth + 1
if self.dt_.tree_.children_left[node_id] != self.dt_.tree_.children_right[node_id]:
stack.append((self.dt_.tree_.children_left[node_id], parent_depth + 1))
stack.append((self.dt_.tree_.children_right[node_id], parent_depth + 1))
else:
is_leaves[node_id] = True
max_depth = max(node_depth)
depth = max_depth - level
if depth < 0:
return None
return [i for i, x in enumerate(node_depth) if x == depth or (x < depth and is_leaves[i])]
def _attach_cells_representatives(self, samples, labels, level_nodes):
samples_df = pd.DataFrame(samples, columns=self._features)
labels_df = pd.DataFrame(labels, columns=['label'])
samples_node_ids = self._find_sample_nodes(samples_df, level_nodes)
for cell in self.cells_:
cell['representative'] = {}
# get all rows in cell
indexes = [i for i, x in enumerate(samples_node_ids) if x == cell['id']]
sample_rows = samples_df.iloc[indexes]
sample_labels = labels_df.iloc[indexes]['label'].values.tolist()
# get rows with matching label
indexes = [i for i, label in enumerate(sample_labels) if label == cell['label']]
match_samples = sample_rows.iloc[indexes]
# find the "middle" of the cluster
array = match_samples.values
median = np.median(array, axis=0)
# find the record closest to the median
i = 0
min = len(array)
min_dist = float("inf")
for row in array:
dist = distance.euclidean(row, median)
if dist < min_dist:
min_dist = dist
min = i
i = i + 1
row = match_samples.iloc[min]
# use its values as the representative
for feature in cell['ranges'].keys():
cell['representative'][feature] = row[feature].item()
def _find_sample_nodes(self, samples, nodes):
paths = self.dt_.decision_path(samples).toarray()
nodeSet = set(nodes)
return [(list(set([i for i, v in enumerate(p) if v == 1]) & nodeSet))[0] for p in paths]
def _generalize(self, data, level_nodes, cells, cells_by_id):
representatives = pd.DataFrame(columns=self._features) # empty except for columns
generalized = pd.DataFrame(data, columns=self._features, copy=True) # original data
mapping_to_cells = self._map_to_cells(generalized, level_nodes, cells_by_id)
# iterate over cells (leaves in decision tree)
for i in range(len(cells)):
# This code just copies the representatives from the cells into another data structure
# iterate over features
for feature in self._features:
# if feature has a representative value in the cell and should not be left untouched,
# take the representative value
if feature in cells[i]['representative'] and ('untouched' not in cells[i] or
feature not in cells[i]['untouched']):
representatives.loc[i, feature] = cells[i]['representative'][feature]
# else, drop the feature (removes from representatives columns that do not have a
# representative value or should remain untouched)
elif feature in representatives.columns.tolist():
representatives = representatives.drop(feature, axis=1)
# get the indexes of all records that map to this cell
indexes = [j for j in range(len(mapping_to_cells)) if mapping_to_cells[j]['id'] == cells[i]['id']]
# replaces the values in the representative columns with the representative values
# (leaves others untouched)
if not representatives.columns.empty:
if len(indexes) > 1:
replace = pd.concat([representatives.loc[i].to_frame().T]*len(indexes)).reset_index(drop=True)
replace.index = indexes
else:
replace = representatives.loc[i].to_frame().T
generalized.loc[indexes, representatives.columns] = replace
return generalized.to_numpy()
def _map_to_cells(self, samples, nodes, cells_by_id):
mapping_to_cells = []
for index, row in samples.iterrows():
cell = self._find_sample_cells([row], nodes, cells_by_id)[0]
mapping_to_cells.append(cell)
return mapping_to_cells
def _find_sample_cells(self, samples, nodes, cells_by_id):
node_ids = self._find_sample_nodes(samples, nodes)
return [cells_by_id[nodeId] for nodeId in node_ids]
def _remove_feature_from_generalization(self, samples, nodes, labels, feature_data):
feature = self._get_feature_to_remove(samples, nodes, labels, feature_data)
if not feature:
return None
GeneralizeToRepresentative._remove_feature_from_cells(self.cells_, self.cells_by_id_, feature)
return feature
def _get_feature_to_remove(self, samples, nodes, labels, feature_data):
# We want to remove features with low iLoss (NCP) and high accuracy gain
# (after removing them)
ranges = self.generalizations_['ranges']
range_counts = self._find_range_count(samples, ranges)
total = samples.size
range_min = sys.float_info.max
remove_feature = None
for feature in ranges.keys():
if feature not in self.generalizations_['untouched']:
feature_ncp = self._calc_ncp_numeric(ranges[feature],
range_counts[feature],
feature_data[feature],
total)
if feature_ncp > 0:
# divide by accuracy gain
new_cells = copy.deepcopy(self.cells_)
cells_by_id = copy.deepcopy(self.cells_by_id_)
GeneralizeToRepresentative._remove_feature_from_cells(new_cells, cells_by_id, feature)
generalized = self._generalize(samples, nodes, new_cells, cells_by_id)
accuracy = self.estimator.score(generalized, labels)
feature_ncp = feature_ncp / accuracy
if feature_ncp < range_min:
range_min = feature_ncp
remove_feature = feature
print('feature to remove: ' + (remove_feature if remove_feature else ''))
return remove_feature
def _calculate_generalizations(self):
self.generalizations_ = {'ranges': GeneralizeToRepresentative._calculate_ranges(self.cells_),
'untouched': GeneralizeToRepresentative._calculate_untouched(self.cells_)}
def _find_range_count(self, samples, ranges):
samples_df = pd.DataFrame(samples, columns=self._features)
range_counts = {}
last_value = None
for r in ranges.keys():
range_counts[r] = []
# if empty list, all samples should be counted
if not ranges[r]:
range_counts[r].append(samples_df.shape[0])
else:
for value in ranges[r]:
range_counts[r].append(len(samples_df.loc[samples_df[r] <= value]))
last_value = value
range_counts[r].append(len(samples_df.loc[samples_df[r] > last_value]))
return range_counts
def _calculate_ncp(self, samples, generalizations, feature_data):
# supressed features are already taken care of within _calc_ncp_numeric
ranges = generalizations['ranges']
range_counts = self._find_range_count(samples, ranges)
total = samples.shape[0]
total_ncp = 0
total_features = len(generalizations['untouched'])
for feature in ranges.keys():
feature_ncp = GeneralizeToRepresentative._calc_ncp_numeric(ranges[feature], range_counts[feature], feature_data[feature], total)
total_ncp = total_ncp + feature_ncp
total_features += 1
if total_features == 0:
return 0
return total_ncp / total_features
@staticmethod
def _calculate_ranges(cells):
ranges = {}
for cell in cells:
for feature in [key for key in cell['ranges'].keys() if
'untouched' not in cell or key not in cell['untouched']]:
if feature not in ranges.keys():
ranges[feature] = []
if cell['ranges'][feature]['start'] is not None:
ranges[feature].append(cell['ranges'][feature]['start'])
if cell['ranges'][feature]['end'] is not None:
ranges[feature].append(cell['ranges'][feature]['end'])
for feature in ranges.keys():
ranges[feature] = list(set(ranges[feature]))
ranges[feature].sort()
return ranges
@staticmethod
def _calculate_untouched(cells):
untouched_lists = [cell['untouched'] if 'untouched' in cell else [] for cell in cells]
untouched = set(untouched_lists[0])
untouched = untouched.intersection(*untouched_lists)
return list(untouched)
@staticmethod
def _calc_ncp_numeric(feature_range, range_count, feature_data, total):
# if there are no ranges, feature is supressed and iLoss is 1
if not feature_range:
return 1
# range only contains the split values, need to add min and max value of feature
# to enable computing sizes of all ranges
new_range = [feature_data['min']] + feature_range + [feature_data['max']]
range_sizes = [b - a for a, b in zip(new_range[::1], new_range[1::1])]
normalized_range_sizes = [s * n / total for s, n in zip(range_sizes, range_count)]
average_range_size = sum(normalized_range_sizes) / len(normalized_range_sizes)
return average_range_size / (feature_data['max'] - feature_data['min'])
@staticmethod
def _remove_feature_from_cells(cells, cells_by_id, feature):
for cell in cells:
if 'untouched' not in cell:
cell['untouched'] = []
if feature in cell['ranges'].keys():
del cell['ranges'][feature]
else:
del cell['categories'][feature]
cell['untouched'].append(feature)
cells_by_id[cell['id']] = cell.copy()

View file

@ -2,7 +2,7 @@ from sklearn import datasets, model_selection
import sklearn.preprocessing
import pandas as pd
import ssl
from os import path
from os import path, mkdir
from six.moves.urllib.request import urlretrieve
@ -40,9 +40,13 @@ def get_adult_dataset():
'label']
train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
data_dir = '../datasets/adult'
train_file = '../datasets/adult/train'
test_file = '../datasets/adult/test'
if not path.exists(data_dir):
mkdir(data_dir)
ssl._create_default_https_context = ssl._create_unverified_context
if not path.exists(train_file):
urlretrieve(train_url, train_file)
@ -139,8 +143,12 @@ def get_nursery_dataset(raw: bool = True, test_set: float = 0.2, transform_socia
:return: Dataset and labels as pandas dataframes.
"""
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.data'
data_dir = '../datasets/nursery'
data_file = '../datasets/nursery/data'
if not path.exists(data_dir):
mkdir(data_dir)
ssl._create_default_https_context = ssl._create_unverified_context
if not path.exists(data_file):
urlretrieve(url, data_file)

4
datasets/.gitignore vendored Normal file
View file

@ -0,0 +1,4 @@
# Ignore everything in this directory
*
# Except this file
!.gitignore

View file

@ -22,8 +22,9 @@ copyright = '2021, IBM'
author = 'Abigail Goldsteen'
# The full version, including alpha/beta/rc tags
release = '0.0.1'
release = '0.0.3'
master_doc = 'index'
# -- General configuration ---------------------------------------------------

View file

@ -8,12 +8,16 @@ Welcome to ai-privacy-toolkit's documentation!
This project provides tools for assessing and improving the privacy and compliance of AI models.
The first release of this toolkit contains a single module called anonymization. This
module contains methods for anonymizing ML model training data, so that when
a model is retrained on the anonymized data, the model itself will also be considered
anonymous. This may help exempt the model from different obligations and restrictions
The anonymization module contains methods for anonymizing ML model
training data, so that when a model is retrained on the anonymized data, the model itself will also be
considered anonymous. This may help exempt the model from different obligations and restrictions
set out in data protection regulations such as GDPR, CCPA, etc.
The minimization module contains methods to help adhere to the data
minimization principle in GDPR for ML models. It enables to reduce the amount of
personal data needed to perform predictions with a machine learning model, while still enabling the model
to make accurate predictions. This is done by by removing or generalizing some of the input features.
.. toctree::
:maxdepth: 2
:caption: Getting Started:

View file

@ -8,15 +8,15 @@ apt.anonymization.anonymizer module
-----------------------------------
.. automodule:: apt.anonymization.anonymizer
:members:
:undoc-members:
:show-inheritance:
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: apt.anonymization
:members:
:undoc-members:
:show-inheritance:
:members:
:undoc-members:
:show-inheritance:

View file

@ -5,9 +5,9 @@ Subpackages
-----------
.. toctree::
:maxdepth: 4
apt.anonymization
apt.anonymization
apt.minimization
Submodules
----------
@ -16,15 +16,15 @@ apt.utils module
----------------
.. automodule:: apt.utils
:members:
:undoc-members:
:show-inheritance:
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: apt
:members:
:undoc-members:
:show-inheritance:
:members:
:undoc-members:
:show-inheritance:

View file

@ -8,15 +8,23 @@ tests.test\_anonymizer module
-----------------------------
.. automodule:: tests.test_anonymizer
:members:
:undoc-members:
:show-inheritance:
:members:
:undoc-members:
:show-inheritance:
tests.test\_minimizer module
----------------------------
.. automodule:: tests.test_minimizer
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: tests
:members:
:undoc-members:
:show-inheritance:
:members:
:undoc-members:
:show-inheritance:

View file

@ -1,4 +1,4 @@
numpy==1.19.0
numpy==1.9.0
pandas==1.1.0
scipy==1.4.1
scikit-learn==0.22.2

64
tests/test_minimizer.py Normal file
View file

@ -0,0 +1,64 @@
import pytest
import numpy as np
from sklearn.datasets import load_boston
from apt.minimization import GeneralizeToRepresentative
from sklearn.tree import DecisionTreeClassifier
@pytest.fixture
def data():
return load_boston(return_X_y=True)
def test_minimizer_params(data):
# Assume two features, age and height, and boolean label
cells = [{"id": 1, "ranges": {"age": {"start": None, "end": 38}, "height": {"start": None, "end": 170}}, "label": 0,
"representative": {"age": 26, "height": 149}},
{"id": 2, "ranges": {"age": {"start": 39, "end": None}, "height": {"start": None, "end": 170}}, "label": 1,
"representative": {"age": 58, "height": 163}},
{"id": 3, "ranges": {"age": {"start": None, "end": 38}, "height": {"start": 171, "end": None}}, "label": 0,
"representative": {"age": 31, "height": 184}},
{"id": 4, "ranges": {"age": {"start": 39, "end": None}, "height": {"start": 171, "end": None}}, "label": 1,
"representative": {"age": 45, "height": 176}}
]
features = ['age', 'height']
X = np.array([[23, 165],
[45, 158],
[18, 190]])
print(X.dtype)
y = [1,1,0]
base_est = DecisionTreeClassifier()
base_est.fit(X, y)
gen = GeneralizeToRepresentative(base_est, features=features, cells=cells)
gen.fit()
transformed = gen.transform(X)
print(transformed)
def test_minimizer_fit(data):
features = ['age', 'height']
X = np.array([[23, 165],
[45, 158],
[56, 123],
[67, 154],
[45, 149],
[42, 166],
[73, 172],
[94, 168],
[69, 175],
[24, 181],
[18, 190]])
print(X)
y = [1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
base_est = DecisionTreeClassifier()
base_est.fit(X, y)
predictions = base_est.predict(X)
gen = GeneralizeToRepresentative(base_est, features=features, target_accuracy=0.5)
gen.fit(X, predictions)
transformed = gen.transform(X)
print(X)
print(transformed)