ai-privacy-toolkit/notebooks/attribute_inference_anonymization_nursery.ipynb

791 lines
44 KiB
Text
Raw Normal View History

2021-04-28 14:00:19 +03:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using ML anonymization to defend against attribute inference attacks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial we will show how to anonymize models using the ML anonymization module. \n",
"\n",
"We will demonstrate running inference attacks both on a vanilla model, and then on different anonymized versions of the model. We will run both black-box and white-box attribute inference attacks using ART's inference module (https://github.com/Trusted-AI/adversarial-robustness-toolbox/tree/main/art/attacks/inference). \n",
"\n",
"This will be demonstarted using the Nursery dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/nursery). \n",
"\n",
"The sensitive feature we are trying to infer is the 'social' feature, after turning it into a binary feature (the original value 'problematic' receives the new value 1 and the rest 0). We also preprocess the data such that all categorical features are one-hot encoded."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 136,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"data": {
2022-03-23 21:52:06 +02:00
"text/plain": " parents has_nurs form children housing finance \\\n8450 pretentious very_crit foster 1 less_conv convenient \n12147 great_pret very_crit complete 1 critical inconv \n2780 usual critical complete 4 less_conv convenient \n11924 great_pret critical foster 1 critical convenient \n59 usual proper complete 2 convenient convenient \n... ... ... ... ... ... ... \n5193 pretentious less_proper complete 1 convenient inconv \n1375 usual less_proper incomplete 2 less_conv convenient \n10318 great_pret less_proper foster 4 convenient convenient \n6396 pretentious improper completed 3 less_conv convenient \n485 usual proper incomplete 1 critical inconv \n\n social health \n8450 1 not_recom \n12147 1 recommended \n2780 1 not_recom \n11924 1 not_recom \n59 0 not_recom \n... ... ... \n5193 0 recommended \n1375 1 priority \n10318 0 priority \n6396 1 recommended \n485 1 not_recom \n\n[10366 rows x 8 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>parents</th>\n <th>has_nurs</th>\n <th>form</th>\n <th>children</th>\n <th>housing</th>\n <th>finance</th>\n <th>social</th>\n <th>health</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>8450</th>\n <td>pretentious</td>\n <td>very_crit</td>\n <td>foster</td>\n <td>1</td>\n <td>less_conv</td>\n <td>convenient</td>\n <td>1</td>\n <td>not_recom</td>\n </tr>\n <tr>\n <th>12147</th>\n <td>great_pret</td>\n <td>very_crit</td>\n <td>complete</td>\n <td>1</td>\n <td>critical</td>\n <td>inconv</td>\n <td>1</td>\n <td>recommended</td>\n </tr>\n <tr>\n <th>2780</th>\n <td>usual</td>\n <td>critical</td>\n <td>complete</td>\n <td>4</td>\n <td>less_conv</td>\n <td>convenient</td>\n <td>1</td>\n <td>not_recom</td>\n </tr>\n <tr>\n <th>11924</th>\n <td>great_pret</td>\n <td>critical</td>\n <td>foster</td>\n <td>1</td>\n <td>critical</td>\n <td>convenient</td>\n <td>1</td>\n <td>not_recom</td>\n </tr>\n <tr>\n <th>59</th>\n <td>usual</td>\n <td>proper</td>\n <td>complete</td>\n <td>2</td>\n <td>convenient</td>\n <td>convenient</td>\n <td>0</td>\n <td>not_recom</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>5193</th>\n <td>pretentious</td>\n <td>less_proper</td>\n <td>complete</td>\n <td>1</td>\n <td>convenient</td>\n <td>inconv</td>\n <td>0</td>\n <td>recommended</td>\n </tr>\n <tr>\n <th>1375</th>\n <td>usual</td>\n <td>less_proper</td>\n <td>incomplete</td>\n <td>2</td>\n <td>less_conv</td>\n <td>convenient</td>\n <td>1</td>\n <td>priority</td>\n </tr>\n <tr>\n <th>10318</th>\n <td>great_pret</td>\n <td>less_proper</td>\n <td>foster</td>\n <td>4</td>\n <td>convenient</td>\n <td>convenient</td>\n <td>0</td>\n <td>priority</td>\n </tr>\n <tr>\n <th>6396</th>\n <td>pretentious</td>\n <td>improper</td>\n <td>completed</td>\n <td>3</td>\n <td>less_conv</td>\n <td>convenient</td>\n <td>1</td>\n <td>recommended</td>\n </tr>\n <tr>\n <th>485</th>\n <td>usual</td>\n <td>proper</td>\n <td>incomplete</td>\n <td>1</td>\n <td>critical</td>\n <td>inconv</td>\n <td>1</td>\n <td>not_recom</td>\n </tr>\n </tbody>\n</table>\n<p>10366 rows × 8 columns</p>\n</div>"
2021-04-28 14:00:19 +03:00
},
2022-03-23 21:52:06 +02:00
"execution_count": 136,
2021-04-28 14:00:19 +03:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import os\n",
"import sys\n",
"sys.path.insert(0, os.path.abspath('..'))\n",
"\n",
2022-03-23 21:52:06 +02:00
"from apt.utils.dataset_utils import get_nursery_dataset\n",
2021-04-28 14:00:19 +03:00
"\n",
"(x_train, y_train), (x_test, y_test) = get_nursery_dataset(transform_social=True)\n",
"\n",
"x_train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train decision tree model"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 137,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Base model accuracy: 0.9969135802469136\n"
]
}
],
"source": [
"from sklearn.tree import DecisionTreeClassifier\n",
"from art.estimators.classification.scikitlearn import ScikitlearnDecisionTreeClassifier\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"x_train_str = x_train.astype(str)\n",
2022-03-23 21:52:06 +02:00
"train_encoded = OneHotEncoder(sparse=False).fit_transform(x_train_str)\n",
2021-04-28 14:00:19 +03:00
"x_test_str = x_test.astype(str)\n",
2022-03-23 21:52:06 +02:00
"test_encoded = OneHotEncoder(sparse=False).fit_transform(x_test_str)\n",
2021-04-28 14:00:19 +03:00
" \n",
"model = DecisionTreeClassifier()\n",
"model.fit(train_encoded, y_train)\n",
"\n",
"art_classifier = ScikitlearnDecisionTreeClassifier(model)\n",
"\n",
"print('Base model accuracy: ', model.score(test_encoded, y_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Attack\n",
"### Black-box attack\n",
"The black-box attack basically trains an additional classifier (called the attack model) to predict the attacked feature's value from the remaining n-1 features as well as the original (attacked) model's predictions.\n",
"#### Train attack model"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 138,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from art.attacks.inference.attribute_inference import AttributeInferenceBlackBox\n",
"\n",
"attack_feature = 20\n",
"\n",
"# training data without attacked feature\n",
"x_train_for_attack = np.delete(train_encoded, attack_feature, 1)\n",
"# only attacked feature\n",
"x_train_feature = train_encoded[:, attack_feature].copy().reshape(-1, 1)\n",
"\n",
"bb_attack = AttributeInferenceBlackBox(art_classifier, attack_feature=attack_feature)\n",
"\n",
"# get original model's predictions\n",
"x_train_predictions = np.array([np.argmax(arr) for arr in art_classifier.predict(train_encoded)]).reshape(-1,1)\n",
"\n",
"# use half of training set for training the attack\n",
"attack_train_ratio = 0.5\n",
"attack_train_size = int(len(train_encoded) * attack_train_ratio)\n",
"\n",
"# train attack model\n",
"bb_attack.fit(train_encoded[:attack_train_size])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Infer sensitive feature and check accuracy"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 139,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-23 21:52:06 +02:00
"1.0\n"
2021-04-28 14:00:19 +03:00
]
}
],
"source": [
"# get inferred values\n",
"values=[0, 1]\n",
"\n",
"inferred_train_bb = bb_attack.infer(x_train_for_attack[attack_train_size:], x_train_predictions[attack_train_size:], values=values)\n",
"# check accuracy\n",
"train_acc = np.sum(inferred_train_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_bb)\n",
"print(train_acc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This means that for 64% of the training set, the attacked feature is inferred correctly using this attack."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Whitebox attack\n",
"This attack does not train any additional model, it simply uses additional information coded within the attacked decision tree model to compute the probability of each value of the attacked feature and outputs the value with the highest probability."
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 140,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-23 21:52:06 +02:00
"0.5076210688790276\n"
2021-04-28 14:00:19 +03:00
]
}
],
"source": [
"from art.attacks.inference.attribute_inference import AttributeInferenceWhiteBoxDecisionTree\n",
"\n",
"priors = [6925 / 10366, 3441 / 10366]\n",
"\n",
"wb2_attack = AttributeInferenceWhiteBoxDecisionTree(art_classifier, attack_feature=attack_feature)\n",
"\n",
"# get inferred values\n",
"inferred_train_wb2 = wb2_attack.infer(x_train_for_attack, x_train_predictions, values=values, priors=priors)\n",
"\n",
"# check accuracy\n",
"train_acc = np.sum(inferred_train_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_wb2)\n",
"print(train_acc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The white-box attack is able to correctly infer the attacked feature value in 69% of the training set. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Anonymized data\n",
"## k=100\n",
"\n",
"Now we will apply the same attacks on an anonymized version of the same dataset (k=100). The data is anonymized on the quasi-identifiers: finance, social, health.\n",
"\n",
"k=100 means that each record in the anonymized dataset is identical to 99 others on the quasi-identifier values (i.e., when looking only at those 3 feature, the records are indistinguishable)."
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 141,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"data": {
2022-03-23 21:52:06 +02:00
"text/plain": " parents has_nurs form children housing finance \\\n0 pretentious very_crit foster 1 less_conv convenient \n1 great_pret very_crit complete 1 critical inconv \n2 usual critical complete 4 less_conv convenient \n3 great_pret critical foster 1 critical convenient \n4 usual proper complete 2 convenient convenient \n... ... ... ... ... ... ... \n10361 pretentious less_proper complete 1 convenient inconv \n10362 usual less_proper incomplete 2 less_conv convenient \n10363 great_pret less_proper foster 4 convenient convenient \n10364 pretentious improper completed 3 less_conv convenient \n10365 usual proper incomplete 1 critical convenient \n\n social health \n0 0 not_recom \n1 1 recommended \n2 0 not_recom \n3 0 not_recom \n4 0 not_recom \n... ... ... \n10361 0 recommended \n10362 1 priority \n10363 0 priority \n10364 1 recommended \n10365 0 not_recom \n\n[10366 rows x 8 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>parents</th>\n <th>has_nurs</th>\n <th>form</th>\n <th>children</th>\n <th>housing</th>\n <th>finance</th>\n <th>social</th>\n <th>health</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>pretentious</td>\n <td>very_crit</td>\n <td>foster</td>\n <td>1</td>\n <td>less_conv</td>\n <td>convenient</td>\n <td>0</td>\n <td>not_recom</td>\n </tr>\n <tr>\n <th>1</th>\n <td>great_pret</td>\n <td>very_crit</td>\n <td>complete</td>\n <td>1</td>\n <td>critical</td>\n <td>inconv</td>\n <td>1</td>\n <td>recommended</td>\n </tr>\n <tr>\n <th>2</th>\n <td>usual</td>\n <td>critical</td>\n <td>complete</td>\n <td>4</td>\n <td>less_conv</td>\n <td>convenient</td>\n <td>0</td>\n <td>not_recom</td>\n </tr>\n <tr>\n <th>3</th>\n <td>great_pret</td>\n <td>critical</td>\n <td>foster</td>\n <td>1</td>\n <td>critical</td>\n <td>convenient</td>\n <td>0</td>\n <td>not_recom</td>\n </tr>\n <tr>\n <th>4</th>\n <td>usual</td>\n <td>proper</td>\n <td>complete</td>\n <td>2</td>\n <td>convenient</td>\n <td>convenient</td>\n <td>0</td>\n <td>not_recom</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>10361</th>\n <td>pretentious</td>\n <td>less_proper</td>\n <td>complete</td>\n <td>1</td>\n <td>convenient</td>\n <td>inconv</td>\n <td>0</td>\n <td>recommended</td>\n </tr>\n <tr>\n <th>10362</th>\n <td>usual</td>\n <td>less_proper</td>\n <td>incomplete</td>\n <td>2</td>\n <td>less_conv</td>\n <td>convenient</td>\n <td>1</td>\n <td>priority</td>\n </tr>\n <tr>\n <th>10363</th>\n <td>great_pret</td>\n <td>less_proper</td>\n <td>foster</td>\n <td>4</td>\n <td>convenient</td>\n <td>convenient</td>\n <td>0</td>\n <td>priority</td>\n </tr>\n <tr>\n <th>10364</th>\n <td>pretentious</td>\n <td>improper</td>\n <td>completed</td>\n <td>3</td>\n <td>less_conv</td>\n <td>convenient</td>\n <td>1</td>\n <td>recommended</td>\n </tr>\n <tr>\n <th>10365</th>\n <td>usual</td>\n <td>proper</td>\n <td>incomplete</td>\n <td>1</td>\n <td>critical</td>\n <td>convenient</td>\n <td>0</td>\n <td>not_recom</td>\n </tr>\n </tbody>\n</table>\n<p>10366 rows × 8 columns</p>\n</div>"
2021-04-28 14:00:19 +03:00
},
2022-03-23 21:52:06 +02:00
"execution_count": 141,
2021-04-28 14:00:19 +03:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
2022-03-23 21:52:06 +02:00
"from apt.utils.datasets import ArrayDataset\n",
2021-04-28 14:00:19 +03:00
"from apt.anonymization import Anonymize\n",
"\n",
2022-03-23 21:52:06 +02:00
"features = x_train.columns\n",
2021-04-28 14:00:19 +03:00
"QI = [\"finance\", \"social\", \"health\"]\n",
"categorical_features = [\"parents\", \"has_nurs\", \"form\", \"housing\", \"finance\", \"health\", 'children']\n",
2022-03-23 21:52:06 +02:00
"QI_indexes = [i for i, v in enumerate(features) if v in QI]\n",
"categorical_features_indexes = [i for i, v in enumerate(features) if v in categorical_features]\n",
"anonymizer = Anonymize(100, QI_indexes, categorical_features=categorical_features_indexes)\n",
"anon = anonymizer.anonymize(ArrayDataset(x_train, x_train_predictions))\n",
"anon\n"
2021-04-28 14:00:19 +03:00
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 142,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"data": {
2022-03-23 21:52:06 +02:00
"text/plain": "7585"
2021-04-28 14:00:19 +03:00
},
2022-03-23 21:52:06 +02:00
"execution_count": 142,
2021-04-28 14:00:19 +03:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# number of distinct rows in original data\n",
"len(x_train.drop_duplicates())"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 143,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"data": {
2022-03-23 21:52:06 +02:00
"text/plain": "5766"
2021-04-28 14:00:19 +03:00
},
2022-03-23 21:52:06 +02:00
"execution_count": 143,
2021-04-28 14:00:19 +03:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# number of distinct rows in anonymized data\n",
"len(anon.drop_duplicates())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train decision tree model"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 144,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anonymized model accuracy: 0.9976851851851852\n"
]
}
],
"source": [
"anon_str = anon.astype(str)\n",
2022-03-23 21:52:06 +02:00
"anon_encoded = OneHotEncoder(sparse=False).fit_transform(anon_str)\n",
2021-04-28 14:00:19 +03:00
"\n",
"anon_model = DecisionTreeClassifier()\n",
"anon_model.fit(anon_encoded, y_train)\n",
"\n",
"anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)\n",
"\n",
"print('Anonymized model accuracy: ', anon_model.score(test_encoded, y_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Attack\n",
"### Black-box attack"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 145,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-23 21:52:06 +02:00
"1.0\n"
2021-04-28 14:00:19 +03:00
]
}
],
"source": [
"anon_bb_attack = AttributeInferenceBlackBox(anon_art_classifier, attack_feature=attack_feature)\n",
"\n",
"# get original model's predictions\n",
"anon_x_train_predictions = np.array([np.argmax(arr) for arr in anon_art_classifier.predict(train_encoded)]).reshape(-1,1)\n",
"\n",
"# train attack model\n",
"anon_bb_attack.fit(train_encoded[:attack_train_size])\n",
"\n",
"# get inferred values\n",
"inferred_train_anon_bb = anon_bb_attack.infer(x_train_for_attack[attack_train_size:], anon_x_train_predictions[attack_train_size:], values=values)\n",
"# check accuracy\n",
"train_acc = np.sum(inferred_train_anon_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_anon_bb)\n",
"print(train_acc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### White box attack"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 146,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-23 21:52:06 +02:00
"0.5218985143739148\n"
2021-04-28 14:00:19 +03:00
]
}
],
"source": [
"anon_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon_art_classifier, attack_feature=attack_feature)\n",
"\n",
"# get inferred values\n",
"inferred_train_anon_wb2 = anon_wb2_attack.infer(x_train_for_attack, anon_x_train_predictions, values=values, priors=priors)\n",
"\n",
"# check accuracy\n",
"anon_train_acc = np.sum(inferred_train_anon_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_anon_wb2)\n",
"print(anon_train_acc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The accuracy of the attacks remains more or less the same. Let's check the precision and recall for each case:"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 147,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-23 21:52:06 +02:00
"(0.49415432579890883, 0.48976438779451525)\n",
"(0.49415432579890883, 0.48976438779451525)\n"
2021-04-28 14:00:19 +03:00
]
}
],
"source": [
"def calc_precision_recall(predicted, actual, positive_value=1):\n",
" score = 0 # both predicted and actual are positive\n",
" num_positive_predicted = 0 # predicted positive\n",
" num_positive_actual = 0 # actual positive\n",
" for i in range(len(predicted)):\n",
" if predicted[i] == positive_value:\n",
" num_positive_predicted += 1\n",
" if actual[i] == positive_value:\n",
" num_positive_actual += 1\n",
" if predicted[i] == actual[i]:\n",
" if predicted[i] == positive_value:\n",
" score += 1\n",
" \n",
" if num_positive_predicted == 0:\n",
" precision = 1\n",
" else:\n",
" precision = score / num_positive_predicted # the fraction of predicted “Yes” responses that are correct\n",
" if num_positive_actual == 0:\n",
" recall = 1\n",
" else:\n",
" recall = score / num_positive_actual # the fraction of “Yes” responses that are predicted correctly\n",
"\n",
" return precision, recall\n",
" \n",
"# black-box regular\n",
"print(calc_precision_recall(inferred_train_bb, x_train_feature))\n",
"# black-box anonymized\n",
"print(calc_precision_recall(inferred_train_anon_bb, x_train_feature))"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 148,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-23 21:52:06 +02:00
"(0.9322033898305084, 0.01066925315227934)\n",
"(0.9806763285024155, 0.03937924345295829)\n"
2021-04-28 14:00:19 +03:00
]
}
],
"source": [
"# white-box regular\n",
"print(calc_precision_recall(inferred_train_wb2, x_train_feature))\n",
"# white-box anonymized\n",
"print(calc_precision_recall(inferred_train_anon_wb2, x_train_feature))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Precision and recall remain almost the same, sometimes even slightly increasing.\n",
"\n",
"Now let's see what happens when we increase k to 1000."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## k=1000\n",
"\n",
"Now we apply the attacks on an anonymized version of the same dataset (k=1000). The data has been anonymized on the quasi-identifiers: finance, social, health."
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 149,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [],
"source": [
2022-03-23 21:52:06 +02:00
"anonymizer2 = Anonymize(1000, QI_indexes, categorical_features=categorical_features_indexes)\n",
"anon2 = anonymizer2.anonymize(ArrayDataset(x_train, x_train_predictions))"
2021-04-28 14:00:19 +03:00
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 150,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"data": {
2022-03-23 21:52:06 +02:00
"text/plain": "4226"
2021-04-28 14:00:19 +03:00
},
2022-03-23 21:52:06 +02:00
"execution_count": 150,
2021-04-28 14:00:19 +03:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# number of distinct rows in anonymized data\n",
"len(anon2.drop_duplicates())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train decision tree model"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 151,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anonymized model accuracy: 0.9930555555555556\n"
]
}
],
"source": [
"anon2_str = anon2.astype(str)\n",
2022-03-23 21:52:06 +02:00
"anon2_encoded = OneHotEncoder(sparse=False).fit_transform(anon2_str)\n",
2021-04-28 14:00:19 +03:00
"\n",
"anon2_model = DecisionTreeClassifier()\n",
"anon2_model.fit(anon2_encoded, y_train)\n",
"\n",
"anon2_art_classifier = ScikitlearnDecisionTreeClassifier(anon2_model)\n",
"\n",
"print('Anonymized model accuracy: ', anon2_model.score(test_encoded, y_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Attack\n",
"### Black-box attack"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 152,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-23 21:52:06 +02:00
"1.0\n"
2021-04-28 14:00:19 +03:00
]
}
],
"source": [
"anon2_bb_attack = AttributeInferenceBlackBox(anon2_art_classifier, attack_feature=attack_feature)\n",
"\n",
"# get original model's predictions\n",
"anon2_x_train_predictions = np.array([np.argmax(arr) for arr in anon2_art_classifier.predict(train_encoded)]).reshape(-1,1)\n",
"\n",
"# train attack model\n",
"anon2_bb_attack.fit(train_encoded[:attack_train_size])\n",
"\n",
"# get inferred values\n",
"inferred_train_anon2_bb = anon2_bb_attack.infer(x_train_for_attack[attack_train_size:], anon2_x_train_predictions[attack_train_size:], values=values)\n",
"# check accuracy\n",
"train_acc = np.sum(inferred_train_anon2_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_anon2_bb)\n",
"print(train_acc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### White box attack"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 153,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-23 21:52:06 +02:00
"0.5184256222265098\n"
2021-04-28 14:00:19 +03:00
]
}
],
"source": [
"anon2_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon2_art_classifier, attack_feature=attack_feature)\n",
"\n",
"# get inferred values\n",
"inferred_train_anon2_wb2 = anon2_wb2_attack.infer(x_train_for_attack, anon2_x_train_predictions, values=values, priors=priors)\n",
"\n",
"# check accuracy\n",
"train_acc = np.sum(inferred_train_anon2_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_anon_wb2)\n",
"print(train_acc)"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 154,
2021-04-28 14:00:19 +03:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-23 21:52:06 +02:00
"(0.49415432579890883, 0.48976438779451525)\n",
"(0.49415432579890883, 0.48976438779451525)\n",
"(0.9322033898305084, 0.01066925315227934)\n",
"(1.0, 0.03161978661493695)\n"
2021-04-28 14:00:19 +03:00
]
}
],
"source": [
"# black-box regular\n",
"print(calc_precision_recall(inferred_train_bb, x_train_feature))\n",
"# black-box anonymized\n",
"print(calc_precision_recall(inferred_train_anon2_bb, x_train_feature))\n",
"\n",
"# white-box regular\n",
"print(calc_precision_recall(inferred_train_wb2, x_train_feature))\n",
"# white-box anonymized\n",
"print(calc_precision_recall(inferred_train_anon2_wb2, x_train_feature))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The accuracy of the black-box attack is slightly reduced, as well as the precision and recall in both attacks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## k=100, all QI\n",
"Now let's see what happens if we define all 8 features in the Nursery dataset as quasi-identifiers."
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": 155,
2021-04-28 14:00:19 +03:00
"metadata": {},
2022-03-23 21:52:06 +02:00
"outputs": [
{
"ename": "TypeError",
"evalue": "argument must be a string or number",
"output_type": "error",
"traceback": [
"\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
"\u001B[0;31mTypeError\u001B[0m Traceback (most recent call last)",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/venv/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:112\u001B[0m, in \u001B[0;36m_encode\u001B[0;34m(values, uniques, encode, check_unknown)\u001B[0m\n\u001B[1;32m 111\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 112\u001B[0m res \u001B[38;5;241m=\u001B[39m \u001B[43m_encode_python\u001B[49m\u001B[43m(\u001B[49m\u001B[43mvalues\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43muniques\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mencode\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 113\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mTypeError\u001B[39;00m:\n",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/venv/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:60\u001B[0m, in \u001B[0;36m_encode_python\u001B[0;34m(values, uniques, encode)\u001B[0m\n\u001B[1;32m 59\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m uniques \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[0;32m---> 60\u001B[0m uniques \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43msorted\u001B[39;49m\u001B[43m(\u001B[49m\u001B[38;5;28;43mset\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mvalues\u001B[49m\u001B[43m)\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 61\u001B[0m uniques \u001B[38;5;241m=\u001B[39m np\u001B[38;5;241m.\u001B[39marray(uniques, dtype\u001B[38;5;241m=\u001B[39mvalues\u001B[38;5;241m.\u001B[39mdtype)\n",
"\u001B[0;31mTypeError\u001B[0m: '<' not supported between instances of 'int' and 'str'",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001B[0;31mTypeError\u001B[0m Traceback (most recent call last)",
"Input \u001B[0;32mIn [155]\u001B[0m, in \u001B[0;36m<cell line: 4>\u001B[0;34m()\u001B[0m\n\u001B[1;32m 2\u001B[0m QI2_indexes \u001B[38;5;241m=\u001B[39m [i \u001B[38;5;28;01mfor\u001B[39;00m i, v \u001B[38;5;129;01min\u001B[39;00m \u001B[38;5;28menumerate\u001B[39m(features) \u001B[38;5;28;01mif\u001B[39;00m v \u001B[38;5;129;01min\u001B[39;00m QI2]\n\u001B[1;32m 3\u001B[0m anonymizer3 \u001B[38;5;241m=\u001B[39m Anonymize(\u001B[38;5;241m100\u001B[39m, QI2_indexes, categorical_features\u001B[38;5;241m=\u001B[39mcategorical_features_indexes)\n\u001B[0;32m----> 4\u001B[0m anon3 \u001B[38;5;241m=\u001B[39m \u001B[43manonymizer3\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43manonymize\u001B[49m\u001B[43m(\u001B[49m\u001B[43mArrayDataset\u001B[49m\u001B[43m(\u001B[49m\u001B[43mx_train\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mx_train_predictions\u001B[49m\u001B[43m)\u001B[49m\u001B[43m)\u001B[49m\n",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/apt/anonymization/anonymizer.py:55\u001B[0m, in \u001B[0;36mAnonymize.anonymize\u001B[0;34m(self, dataset)\u001B[0m\n\u001B[1;32m 52\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 53\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mNo data provided\u001B[39m\u001B[38;5;124m'\u001B[39m)\n\u001B[0;32m---> 55\u001B[0m transformed \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_anonymize\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdataset\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mget_samples\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcopy\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mdataset\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mget_labels\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 56\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m dataset\u001B[38;5;241m.\u001B[39mis_pandas:\n\u001B[1;32m 57\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m pd\u001B[38;5;241m.\u001B[39mDataFrame(transformed, columns\u001B[38;5;241m=\u001B[39m\u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_features)\n",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/apt/anonymization/anonymizer.py:68\u001B[0m, in \u001B[0;36mAnonymize._anonymize\u001B[0;34m(self, x, y)\u001B[0m\n\u001B[1;32m 66\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mcategorical_features:\n\u001B[1;32m 67\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mwhen supplying an array with non-numeric data, categorical_features must be defined\u001B[39m\u001B[38;5;124m'\u001B[39m)\n\u001B[0;32m---> 68\u001B[0m x_prepared \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_modify_categorical_features\u001B[49m\u001B[43m(\u001B[49m\u001B[43mx_anonymizer_train\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 69\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 70\u001B[0m x_prepared \u001B[38;5;241m=\u001B[39m x_anonymizer_train\n",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/apt/anonymization/anonymizer.py:144\u001B[0m, in \u001B[0;36mAnonymize._modify_categorical_features\u001B[0;34m(self, x)\u001B[0m\n\u001B[1;32m 142\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21m_modify_categorical_features\u001B[39m(\u001B[38;5;28mself\u001B[39m, x):\n\u001B[1;32m 143\u001B[0m encoder \u001B[38;5;241m=\u001B[39m OneHotEncoder()\n\u001B[0;32m--> 144\u001B[0m one_hot_encoded \u001B[38;5;241m=\u001B[39m \u001B[43mencoder\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit_transform\u001B[49m\u001B[43m(\u001B[49m\u001B[43mx\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 145\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m one_hot_encoded\n",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:372\u001B[0m, in \u001B[0;36mOneHotEncoder.fit_transform\u001B[0;34m(self, X, y)\u001B[0m\n\u001B[1;32m 352\u001B[0m \u001B[38;5;124;03m\"\"\"\u001B[39;00m\n\u001B[1;32m 353\u001B[0m \u001B[38;5;124;03mFit OneHotEncoder to X, then transform X.\u001B[39;00m\n\u001B[1;32m 354\u001B[0m \n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 369\u001B[0m \u001B[38;5;124;03m Transformed input.\u001B[39;00m\n\u001B[1;32m 370\u001B[0m \u001B[38;5;124;03m\"\"\"\u001B[39;00m\n\u001B[1;32m 371\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_validate_keywords()\n\u001B[0;32m--> 372\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit_transform\u001B[49m\u001B[43m(\u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43my\u001B[49m\u001B[43m)\u001B[49m\n",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/venv/lib/python3.8/site-packages/sklearn/base.py:571\u001B[0m, in \u001B[0;36mTransformerMixin.fit_transform\u001B[0;34m(self, X, y, **fit_params)\u001B[0m\n\u001B[1;32m 567\u001B[0m \u001B[38;5;66;03m# non-optimized default implementation; override when a better\u001B[39;00m\n\u001B[1;32m 568\u001B[0m \u001B[38;5;66;03m# method is possible for a given clustering algorithm\u001B[39;00m\n\u001B[1;32m 569\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m y \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[1;32m 570\u001B[0m \u001B[38;5;66;03m# fit method of arity 1 (unsupervised transformation)\u001B[39;00m\n\u001B[0;32m--> 571\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit\u001B[49m\u001B[43m(\u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mfit_params\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241m.\u001B[39mtransform(X)\n\u001B[1;32m 572\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 573\u001B[0m \u001B[38;5;66;03m# fit method of arity 2 (supervised transformation)\u001B[39;00m\n\u001B[1;32m 574\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mfit(X, y, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mfit_params)\u001B[38;5;241m.\u001B[39mtransform(X)\n",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:347\u001B[0m, in \u001B[0;36mOneHotEncoder.fit\u001B[0;34m(self, X, y)\u001B[0m\n\u001B[1;32m 330\u001B[0m \u001B[38;5;124;03m\"\"\"\u001B[39;00m\n\u001B[1;32m 331\u001B[0m \u001B[38;5;124;03mFit OneHotEncoder to X.\u001B[39;00m\n\u001B[1;32m 332\u001B[0m \n\u001B[0;32m (...)\u001B[0m\n\u001B[1;32m 344\u001B[0m \u001B[38;5;124;03mself\u001B[39;00m\n\u001B[1;32m 345\u001B[0m \u001B[38;5;124;03m\"\"\"\u001B[39;00m\n\u001B[1;32m 346\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_validate_keywords()\n\u001B[0;32m--> 347\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_fit\u001B[49m\u001B[43m(\u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mhandle_unknown\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mhandle_unknown\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 348\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mdrop_idx_ \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_compute_drop_idx()\n\u001B[1;32m 349\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\n",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:86\u001B[0m, in \u001B[0;36m_BaseEncoder._fit\u001B[0;34m(self, X, handle_unknown)\u001B[0m\n\u001B[1;32m 84\u001B[0m Xi \u001B[38;5;241m=\u001B[39m X_list[i]\n\u001B[1;32m 85\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mcategories \u001B[38;5;241m==\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124mauto\u001B[39m\u001B[38;5;124m'\u001B[39m:\n\u001B[0;32m---> 86\u001B[0m cats \u001B[38;5;241m=\u001B[39m \u001B[43m_encode\u001B[49m\u001B[43m(\u001B[49m\u001B[43mXi\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 87\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m 88\u001B[0m cats \u001B[38;5;241m=\u001B[39m np\u001B[38;5;241m.\u001B[39marray(\u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mcategories[i], dtype\u001B[38;5;241m=\u001B[39mXi\u001B[38;5;241m.\u001B[39mdtype)\n",
"File \u001B[0;32m~/PycharmProjects/ai-privacy-toolkit-internal/venv/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:114\u001B[0m, in \u001B[0;36m_encode\u001B[0;34m(values, uniques, encode, check_unknown)\u001B[0m\n\u001B[1;32m 112\u001B[0m res \u001B[38;5;241m=\u001B[39m _encode_python(values, uniques, encode)\n\u001B[1;32m 113\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mTypeError\u001B[39;00m:\n\u001B[0;32m--> 114\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mTypeError\u001B[39;00m(\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124margument must be a string or number\u001B[39m\u001B[38;5;124m\"\u001B[39m)\n\u001B[1;32m 115\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m res\n\u001B[1;32m 116\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n",
"\u001B[0;31mTypeError\u001B[0m: argument must be a string or number"
]
}
],
2021-04-28 14:00:19 +03:00
"source": [
"QI2 = [\"parents\", \"has_nurs\", \"form\", \"children\", \"housing\", \"finance\", \"social\", \"health\"]\n",
2022-03-23 21:52:06 +02:00
"QI2_indexes = [i for i, v in enumerate(features) if v in QI2]\n",
"anonymizer3 = Anonymize(100, QI2_indexes, categorical_features=categorical_features_indexes)\n",
"anon3 = anonymizer3.anonymize(ArrayDataset(x_train, x_train_predictions))"
2021-04-28 14:00:19 +03:00
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": null,
2021-04-28 14:00:19 +03:00
"metadata": {},
2022-03-23 21:52:06 +02:00
"outputs": [],
2021-04-28 14:00:19 +03:00
"source": [
"# number of distinct rows in anonymized data\n",
"len(anon3.drop_duplicates())"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": null,
2021-04-28 14:00:19 +03:00
"metadata": {},
2022-03-23 21:52:06 +02:00
"outputs": [],
2021-04-28 14:00:19 +03:00
"source": [
"anon3_str = anon3.astype(str)\n",
2022-03-23 21:52:06 +02:00
"anon3_encoded = OneHotEncoder(sparse=False).fit_transform(anon3_str)\n",
2021-04-28 14:00:19 +03:00
"\n",
"anon3_model = DecisionTreeClassifier()\n",
"anon3_model.fit(anon3_encoded, y_train)\n",
"\n",
"anon3_art_classifier = ScikitlearnDecisionTreeClassifier(anon3_model)\n",
"\n",
"print('Anonymized model accuracy: ', anon3_model.score(test_encoded, y_test))\n",
"\n",
"anon3_bb_attack = AttributeInferenceBlackBox(anon3_art_classifier, attack_feature=attack_feature)\n",
"\n",
"# get original model's predictions\n",
"anon3_x_train_predictions = np.array([np.argmax(arr) for arr in anon3_art_classifier.predict(train_encoded)]).reshape(-1,1)\n",
"\n",
"# train attack model\n",
"anon3_bb_attack.fit(train_encoded[:attack_train_size])\n",
"\n",
"# get inferred values\n",
"inferred_train_anon3_bb = anon3_bb_attack.infer(x_train_for_attack[attack_train_size:], anon3_x_train_predictions[attack_train_size:], values=values)\n",
"# check accuracy\n",
"train_acc = np.sum(inferred_train_anon3_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_anon2_bb)\n",
"print('BB attack accuracy: ', train_acc)\n",
"\n",
"anon3_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon3_art_classifier, attack_feature=attack_feature)\n",
"\n",
"# get inferred values\n",
"inferred_train_anon3_wb2 = anon3_wb2_attack.infer(x_train_for_attack, anon3_x_train_predictions, values=values, priors=priors)\n",
"\n",
"# check accuracy\n",
"train_acc = np.sum(inferred_train_anon3_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_anon_wb2)\n",
"print('WB attack accuracy: ', train_acc)"
]
},
{
"cell_type": "code",
2022-03-23 21:52:06 +02:00
"execution_count": null,
2021-04-28 14:00:19 +03:00
"metadata": {},
2022-03-23 21:52:06 +02:00
"outputs": [],
2021-04-28 14:00:19 +03:00
"source": [
"# black-box regular\n",
"print(calc_precision_recall(inferred_train_bb, x_train_feature))\n",
"# black-box anonymized\n",
"print(calc_precision_recall(inferred_train_anon3_bb, x_train_feature))\n",
"\n",
"# white-box regular\n",
"print(calc_precision_recall(inferred_train_wb2, x_train_feature))\n",
"# white-box anonymized\n",
"print(calc_precision_recall(inferred_train_anon3_wb2, x_train_feature))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accuracy of both attacks has decreased. Precision and recall remain roughly the same in the black-box case. \n",
"\n",
"*In the anonymized version of the white-box attack, no records were predicted with the positive value for the attacked feature."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
2022-03-23 21:52:06 +02:00
}