Support for one-hot encoded features in minimization (#87)

* Initial version with first working test
* Make sure representative values in generalizations for 1-hot encoded features are consistent.
* Updated notebooks for one-hot encoded data
* Review comments

Signed-off-by: abigailt <abigailt@il.ibm.com>
This commit is contained in:
abigailgold 2023-12-24 18:18:18 -05:00 committed by GitHub
parent 5dce961092
commit 6d81cd8ed4
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
4 changed files with 26703 additions and 48 deletions

View file

@ -11,7 +11,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial we will show how to anonymize models using the ML anonymization module, specifically when the inout data is already one-hot encoded. \n",
"In this tutorial we will show how to anonymize models using the ML anonymization module, specifically when the input data is already one-hot encoded. \n",
"\n",
"This will be demonstarted using the Adult dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/adult). "
]
@ -25,7 +25,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 29,
"metadata": {},
"outputs": [
{
@ -81,7 +81,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 30,
"metadata": {},
"outputs": [
{
@ -123,14 +123,14 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Base model accuracy: 0.814446287083103\n"
"Base model accuracy: 0.8143234445058657\n"
]
},
{
@ -168,7 +168,7 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 32,
"metadata": {},
"outputs": [
{
@ -194,14 +194,14 @@
"# QI = (race, sex)\n",
"QI = [53, 52, 51, 50, 49, 48, 47]\n",
"QI_slices = [[47, 48, 49, 50, 51], [52, 53]]\n",
"anonymizer = Anonymize(100, QI)\n",
"anonymizer = Anonymize(100, QI, quasi_identifer_slices=QI_slices)\n",
"anon = anonymizer.anonymize(ArrayDataset(x_train, x_train_predictions))\n",
"print(anon)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 33,
"metadata": {},
"outputs": [
{
@ -210,7 +210,7 @@
"2711"
]
},
"execution_count": 26,
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
@ -222,7 +222,7 @@
},
{
"cell_type": "code",
"execution_count": 27,
"execution_count": 34,
"metadata": {},
"outputs": [
{
@ -231,7 +231,7 @@
"2476"
]
},
"execution_count": 27,
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
@ -250,14 +250,14 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Anonymized model accuracy: 0.8135863890424421\n"
"Anonymized model accuracy: 0.8124808058473066\n"
]
},
{

File diff suppressed because it is too large Load diff