K-Anonymity With Python

Simple Practice
python
technique
data
analysis
Published

December 10, 2022

What is K-Anonymity?

Data Privacy is a problem which users and institutions are starting to care more about. Neither users nor companies want to reveal or be liable for some of the data and conclusions that arise from the data collected. One such popular method of trying to mitigate the problem of de-anonymizing individuals in data is K-Anonymity.

The idea is that given observations and features in data, knowing this information should not allow one to individually identify someone from that data. Thus, K-Anonymity is a metric which states that data is sufficiently anonymized to the k-1 threshold such that given specific columns in the data there exists no such persons where their arrangement leads to less than k individuals in the group.

We’ll use the data from the Wikipedia page to illustrate this idea.

url = 'https://en.wikipedia.org/wiki/K-anonymity'
site = pd.read_html(url)
rawData = site[0]
rawData
Name Age Gender Height Weight State of domicile Religion Disease
0 Ramsha 30 Female 165 cm 72 kg Tamil Nadu Hindu Cancer
1 Yadu 24 Female 162 cm 70 kg Kerala Hindu Viral infection
2 Salima 28 Female 170 cm 68 kg Tamil Nadu Muslim Tuberculosis
3 Sunny 27 Male 170 cm 75 kg Karnataka Parsi No illness
4 Joan 24 Female 165 cm 71 kg Kerala Christian Heart-related
5 Bahuksana 23 Male 160 cm 69 kg Karnataka Buddhist Tuberculosis
6 Rambha 19 Male 167 cm 85 kg Kerala Hindu Cancer
7 Kishor 29 Male 180 cm 81 kg Karnataka Hindu Heart-related
8 Johnson 17 Male 175 cm 79 kg Kerala Christian Heart-related
9 John 19 Male 169 cm 82 kg Kerala Christian Viral infection

Ok, so now that we have this information we can see there are Sensitive Attributes which should not show up in public data. So, using this data we can use one single groupby() call to quickly identify an individual user. Obviously, if we have your name then we should drop or Suppress that.

rawData['Name'] = '*'
rawData
Name Age Gender Height Weight State of domicile Religion Disease
0 * 30 Female 165 cm 72 kg Tamil Nadu Hindu Cancer
1 * 24 Female 162 cm 70 kg Kerala Hindu Viral infection
2 * 28 Female 170 cm 68 kg Tamil Nadu Muslim Tuberculosis
3 * 27 Male 170 cm 75 kg Karnataka Parsi No illness
4 * 24 Female 165 cm 71 kg Kerala Christian Heart-related
5 * 23 Male 160 cm 69 kg Karnataka Buddhist Tuberculosis
6 * 19 Male 167 cm 85 kg Kerala Hindu Cancer
7 * 29 Male 180 cm 81 kg Karnataka Hindu Heart-related
8 * 17 Male 175 cm 79 kg Kerala Christian Heart-related
9 * 19 Male 169 cm 82 kg Kerala Christian Viral infection

But also, if I know your age and I know your religion then I can quickly identify you.

rawData.groupby(['Age','Religion']).size().reset_index(name='Count')
Age Religion Count
0 17 Christian 1
1 19 Christian 1
2 19 Hindu 1
3 23 Buddhist 1
4 24 Christian 1
5 24 Hindu 1
6 27 Parsi 1
7 28 Muslim 1
8 29 Hindu 1
9 30 Hindu 1

Since we want some sort of representation of age in our analysis, we can Generalize the Age values by binning them. Guidance on the actual bin sizes looks to be trial and error; I guess it is expected to use your own judgement in these methods. For my own guideline I’m going to take the rounded standard deviation and step down until the number of k threshold has been met.

int(rawData['Age'].std())
rawData['AgeGroup'] = pd.cut(rawData['Age'], bins=int(rawData['Age'].std()))
rawData.groupby(['AgeGroup']).size().reset_index(name='Count')
/tmp/ipykernel_135547/374673683.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  rawData.groupby(['AgeGroup']).size().reset_index(name='Count')
AgeGroup Count
0 (16.987, 20.25] 3
1 (20.25, 23.5] 1
2 (23.5, 26.75] 2
3 (26.75, 30.0] 4

…. not enough yet:

rawData['AgeGroup'] = pd.cut(rawData['Age'], bins=int(rawData['Age'].std())-1)
rawData.groupby(['AgeGroup']).size().reset_index(name='Count')
/tmp/ipykernel_135547/4046394562.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  rawData.groupby(['AgeGroup']).size().reset_index(name='Count')
AgeGroup Count
0 (16.987, 21.333] 3
1 (21.333, 25.667] 3
2 (25.667, 30.0] 4

That looks good. And, if we take it in combination with another column?

for col in rawData.columns.to_list():
    if col not in ['AgeGroup', 'Age']:
        print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
        print()
           AgeGroup Name  Count
0  (16.987, 21.333]    *      3
1  (21.333, 25.667]    *      3
2    (25.667, 30.0]    *      4

           AgeGroup  Gender  Count
0  (16.987, 21.333]  Female      0
1  (16.987, 21.333]    Male      3
2  (21.333, 25.667]  Female      2
3  (21.333, 25.667]    Male      1
4    (25.667, 30.0]  Female      2
5    (25.667, 30.0]    Male      2

            AgeGroup  Height  Count
0   (16.987, 21.333]  160 cm      0
1   (16.987, 21.333]  162 cm      0
2   (16.987, 21.333]  165 cm      0
3   (16.987, 21.333]  167 cm      1
4   (16.987, 21.333]  169 cm      1
5   (16.987, 21.333]  170 cm      0
6   (16.987, 21.333]  175 cm      1
7   (16.987, 21.333]  180 cm      0
8   (21.333, 25.667]  160 cm      1
9   (21.333, 25.667]  162 cm      1
10  (21.333, 25.667]  165 cm      1
11  (21.333, 25.667]  167 cm      0
12  (21.333, 25.667]  169 cm      0
13  (21.333, 25.667]  170 cm      0
14  (21.333, 25.667]  175 cm      0
15  (21.333, 25.667]  180 cm      0
16    (25.667, 30.0]  160 cm      0
17    (25.667, 30.0]  162 cm      0
18    (25.667, 30.0]  165 cm      1
19    (25.667, 30.0]  167 cm      0
20    (25.667, 30.0]  169 cm      0
21    (25.667, 30.0]  170 cm      2
22    (25.667, 30.0]  175 cm      0
23    (25.667, 30.0]  180 cm      1

            AgeGroup Weight  Count
0   (16.987, 21.333]  68 kg      0
1   (16.987, 21.333]  69 kg      0
2   (16.987, 21.333]  70 kg      0
3   (16.987, 21.333]  71 kg      0
4   (16.987, 21.333]  72 kg      0
5   (16.987, 21.333]  75 kg      0
6   (16.987, 21.333]  79 kg      1
7   (16.987, 21.333]  81 kg      0
8   (16.987, 21.333]  82 kg      1
9   (16.987, 21.333]  85 kg      1
10  (21.333, 25.667]  68 kg      0
11  (21.333, 25.667]  69 kg      1
12  (21.333, 25.667]  70 kg      1
13  (21.333, 25.667]  71 kg      1
14  (21.333, 25.667]  72 kg      0
15  (21.333, 25.667]  75 kg      0
16  (21.333, 25.667]  79 kg      0
17  (21.333, 25.667]  81 kg      0
18  (21.333, 25.667]  82 kg      0
19  (21.333, 25.667]  85 kg      0
20    (25.667, 30.0]  68 kg      1
21    (25.667, 30.0]  69 kg      0
22    (25.667, 30.0]  70 kg      0
23    (25.667, 30.0]  71 kg      0
24    (25.667, 30.0]  72 kg      1
25    (25.667, 30.0]  75 kg      1
26    (25.667, 30.0]  79 kg      0
27    (25.667, 30.0]  81 kg      1
28    (25.667, 30.0]  82 kg      0
29    (25.667, 30.0]  85 kg      0

           AgeGroup State of domicile  Count
0  (16.987, 21.333]         Karnataka      0
1  (16.987, 21.333]            Kerala      3
2  (16.987, 21.333]        Tamil Nadu      0
3  (21.333, 25.667]         Karnataka      1
4  (21.333, 25.667]            Kerala      2
5  (21.333, 25.667]        Tamil Nadu      0
6    (25.667, 30.0]         Karnataka      2
7    (25.667, 30.0]            Kerala      0
8    (25.667, 30.0]        Tamil Nadu      2

            AgeGroup   Religion  Count
0   (16.987, 21.333]   Buddhist      0
1   (16.987, 21.333]  Christian      2
2   (16.987, 21.333]      Hindu      1
3   (16.987, 21.333]     Muslim      0
4   (16.987, 21.333]      Parsi      0
5   (21.333, 25.667]   Buddhist      1
6   (21.333, 25.667]  Christian      1
7   (21.333, 25.667]      Hindu      1
8   (21.333, 25.667]     Muslim      0
9   (21.333, 25.667]      Parsi      0
10    (25.667, 30.0]   Buddhist      0
11    (25.667, 30.0]  Christian      0
12    (25.667, 30.0]      Hindu      2
13    (25.667, 30.0]     Muslim      1
14    (25.667, 30.0]      Parsi      1

            AgeGroup          Disease  Count
0   (16.987, 21.333]           Cancer      1
1   (16.987, 21.333]    Heart-related      1
2   (16.987, 21.333]       No illness      0
3   (16.987, 21.333]     Tuberculosis      0
4   (16.987, 21.333]  Viral infection      1
5   (21.333, 25.667]           Cancer      0
6   (21.333, 25.667]    Heart-related      1
7   (21.333, 25.667]       No illness      0
8   (21.333, 25.667]     Tuberculosis      1
9   (21.333, 25.667]  Viral infection      1
10    (25.667, 30.0]           Cancer      1
11    (25.667, 30.0]    Heart-related      1
12    (25.667, 30.0]       No illness      1
13    (25.667, 30.0]     Tuberculosis      1
14    (25.667, 30.0]  Viral infection      0
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))

Ok, looking at these results it is clear that Religion and Disease are problems still. This is where we would need to start asking ourselves if we had enough data and what our analysis’s goals are. If we were trying to explore what features were related to these Diseases then we will need more data since no matter what we do the No illness will always be alone. In this instance we could drop that value:

filteredData = rawData.loc[ ~rawData.Disease.str.contains('illness')].copy()
filteredData.groupby(['AgeGroup', 'Disease']).size().reset_index(name='Count')
AgeGroup Disease Count
0 (16.987, 21.333] Cancer 1
1 (16.987, 21.333] Heart-related 1
2 (16.987, 21.333] Tuberculosis 0
3 (16.987, 21.333] Viral infection 1
4 (21.333, 25.667] Cancer 0
5 (21.333, 25.667] Heart-related 1
6 (21.333, 25.667] Tuberculosis 1
7 (21.333, 25.667] Viral infection 1
8 (25.667, 30.0] Cancer 1
9 (25.667, 30.0] Heart-related 1
10 (25.667, 30.0] Tuberculosis 1
11 (25.667, 30.0] Viral infection 0

Not quite there. So, another technique which will assist us is generalizing the Disease values. We’ll use a substitution heirarchy to replace some values in the data with slightly less accurate values.

heirarchy = {
    "Cancer": "Body",
    'Viral infection': 'Body',
    'Tuberculosis': 'Chest',
    'Heart-related': 'Chest'
}

filteredData['heirarchy'] = filteredData['Disease'].apply(lambda x: heirarchy[x] if heirarchy.get(x) else x)
filteredData.groupby(['AgeGroup','heirarchy']).size().reset_index(name='count')
AgeGroup heirarchy count
0 (16.987, 21.333] Body 2
1 (16.987, 21.333] Chest 1
2 (21.333, 25.667] Body 1
3 (21.333, 25.667] Chest 2
4 (25.667, 30.0] Body 1
5 (25.667, 30.0] Chest 2

Almost there. Let’s re-examine our age bins and make them one more step smaller.

filteredData['AgeGroup'] = pd.cut(rawData['Age'], bins=int(rawData['Age'].std())-2)
filteredData.groupby(['AgeGroup']).size().reset_index(name='Count')
AgeGroup Count
0 (16.987, 23.5] 4
1 (23.5, 30.0] 5

And, there we go! This would be 4-anonymity with respect to Age and Disease. Finally, it is safe to consider using in a public data set or Algorithm. This is a really small dataset but this proces would be how you would anonymize larger datasets as well.