K-Anonymity With Python

Simple Practice
python
technique
data
analysis
Published

December 10, 2022

What is K-Anonymity?

Data Privacy is a problem which users and institutions are starting to care more about. Neither users nor companies want to reveal or be liable for some of the data and conclusions that arise from the data collected. One such popular method of trying to mitigate the problem of de-anonymizing individuals in data is K-Anonymity.

The idea is that given observations and features in data, knowing this information should not allow one to individually identify someone from that data. Thus, K-Anonymity is a metric which states that data is sufficiently anonymized to the k-1 threshold such that given specific columns in the data there exists no such persons where their arrangement leads to less than k individuals in the group.

We’ll use the data from the Wikipedia page to illustrate this idea.

url = 'https://en.wikipedia.org/wiki/K-anonymity'
site = pd.read_html(url)
rawData = site[0]
rawData
Name Age Gender Height Weight State of domicile Religion Disease
0 Ramsha 30 Female 165cm 72kg Tamil Nadu Hindu Cancer
1 Yadu 24 Female 162cm 70kg Kerala Hindu Viral infection
2 Salima 28 Female 170cm 68kg Tamil Nadu Muslim Tuberculosis
3 Sunny 27 Male 170cm 75kg Karnataka Parsi No illness
4 Joan 24 Female 165cm 71kg Kerala Christian Heart-related
5 Bahuksana 23 Male 160cm 69kg Karnataka Buddhist Tuberculosis
6 Rambha 19 Male 167cm 85kg Kerala Hindu Cancer
7 Kishor 29 Male 180cm 81kg Karnataka Hindu Heart-related
8 Johnson 17 Male 175cm 79kg Kerala Christian Heart-related
9 John 19 Male 169cm 82kg Kerala Christian Viral infection

Ok, so now that we have this information we can see there are Sensitive Attributes which should not show up in public data. So, using this data we can use one single groupby() call to quickly identify an individual user. Obviously, if we have your name then we should drop or Suppress that.

rawData['Name'] = '*'
rawData
Name Age Gender Height Weight State of domicile Religion Disease
0 * 30 Female 165cm 72kg Tamil Nadu Hindu Cancer
1 * 24 Female 162cm 70kg Kerala Hindu Viral infection
2 * 28 Female 170cm 68kg Tamil Nadu Muslim Tuberculosis
3 * 27 Male 170cm 75kg Karnataka Parsi No illness
4 * 24 Female 165cm 71kg Kerala Christian Heart-related
5 * 23 Male 160cm 69kg Karnataka Buddhist Tuberculosis
6 * 19 Male 167cm 85kg Kerala Hindu Cancer
7 * 29 Male 180cm 81kg Karnataka Hindu Heart-related
8 * 17 Male 175cm 79kg Kerala Christian Heart-related
9 * 19 Male 169cm 82kg Kerala Christian Viral infection

But also, if I know your age and I know your religion then I can quickly identify you.

rawData.groupby(['Age','Religion']).size().reset_index(name='Count')
Age Religion Count
0 17 Christian 1
1 19 Christian 1
2 19 Hindu 1
3 23 Buddhist 1
4 24 Christian 1
5 24 Hindu 1
6 27 Parsi 1
7 28 Muslim 1
8 29 Hindu 1
9 30 Hindu 1

Since we want some sort of representation of age in our analysis, we can Generalize the Age values by binning them. Guidance on the actual bin sizes looks to be trial and error; I guess it is expected to use your own judgement in these methods. For my own guideline I’m going to take the rounded standard deviation and step down until the number of k threshold has been met.

int(rawData['Age'].std())
rawData['AgeGroup'] = pd.cut(rawData['Age'], bins=int(rawData['Age'].std()))
rawData.groupby(['AgeGroup']).size().reset_index(name='Count')
AgeGroup Count
0 (16.987, 20.25] 3
1 (20.25, 23.5] 1
2 (23.5, 26.75] 2
3 (26.75, 30.0] 4

…. not enough yet:

rawData['AgeGroup'] = pd.cut(rawData['Age'], bins=int(rawData['Age'].std())-1)
rawData.groupby(['AgeGroup']).size().reset_index(name='Count')
AgeGroup Count
0 (16.987, 21.333] 3
1 (21.333, 25.667] 3
2 (25.667, 30.0] 4

That looks good. And, if we take it in combination with another column?

for col in rawData.columns.to_list():
    if col not in ['AgeGroup', 'Age']:
        print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
        print()
           AgeGroup Name  Count
0  (16.987, 21.333]    *      3
1  (21.333, 25.667]    *      3
2    (25.667, 30.0]    *      4

           AgeGroup  Gender  Count
0  (16.987, 21.333]  Female      0
1  (16.987, 21.333]    Male      3
2  (21.333, 25.667]  Female      2
3  (21.333, 25.667]    Male      1
4    (25.667, 30.0]  Female      2
5    (25.667, 30.0]    Male      2

            AgeGroup Height  Count
0   (16.987, 21.333]  160cm      0
1   (16.987, 21.333]  162cm      0
2   (16.987, 21.333]  165cm      0
3   (16.987, 21.333]  167cm      1
4   (16.987, 21.333]  169cm      1
5   (16.987, 21.333]  170cm      0
6   (16.987, 21.333]  175cm      1
7   (16.987, 21.333]  180cm      0
8   (21.333, 25.667]  160cm      1
9   (21.333, 25.667]  162cm      1
10  (21.333, 25.667]  165cm      1
11  (21.333, 25.667]  167cm      0
12  (21.333, 25.667]  169cm      0
13  (21.333, 25.667]  170cm      0
14  (21.333, 25.667]  175cm      0
15  (21.333, 25.667]  180cm      0
16    (25.667, 30.0]  160cm      0
17    (25.667, 30.0]  162cm      0
18    (25.667, 30.0]  165cm      1
19    (25.667, 30.0]  167cm      0
20    (25.667, 30.0]  169cm      0
21    (25.667, 30.0]  170cm      2
22    (25.667, 30.0]  175cm      0
23    (25.667, 30.0]  180cm      1

            AgeGroup Weight  Count
0   (16.987, 21.333]   68kg      0
1   (16.987, 21.333]   69kg      0
2   (16.987, 21.333]   70kg      0
3   (16.987, 21.333]   71kg      0
4   (16.987, 21.333]   72kg      0
5   (16.987, 21.333]   75kg      0
6   (16.987, 21.333]   79kg      1
7   (16.987, 21.333]   81kg      0
8   (16.987, 21.333]   82kg      1
9   (16.987, 21.333]   85kg      1
10  (21.333, 25.667]   68kg      0
11  (21.333, 25.667]   69kg      1
12  (21.333, 25.667]   70kg      1
13  (21.333, 25.667]   71kg      1
14  (21.333, 25.667]   72kg      0
15  (21.333, 25.667]   75kg      0
16  (21.333, 25.667]   79kg      0
17  (21.333, 25.667]   81kg      0
18  (21.333, 25.667]   82kg      0
19  (21.333, 25.667]   85kg      0
20    (25.667, 30.0]   68kg      1
21    (25.667, 30.0]   69kg      0
22    (25.667, 30.0]   70kg      0
23    (25.667, 30.0]   71kg      0
24    (25.667, 30.0]   72kg      1
25    (25.667, 30.0]   75kg      1
26    (25.667, 30.0]   79kg      0
27    (25.667, 30.0]   81kg      1
28    (25.667, 30.0]   82kg      0
29    (25.667, 30.0]   85kg      0

           AgeGroup State of domicile  Count
0  (16.987, 21.333]         Karnataka      0
1  (16.987, 21.333]            Kerala      3
2  (16.987, 21.333]        Tamil Nadu      0
3  (21.333, 25.667]         Karnataka      1
4  (21.333, 25.667]            Kerala      2
5  (21.333, 25.667]        Tamil Nadu      0
6    (25.667, 30.0]         Karnataka      2
7    (25.667, 30.0]            Kerala      0
8    (25.667, 30.0]        Tamil Nadu      2

            AgeGroup   Religion  Count
0   (16.987, 21.333]   Buddhist      0
1   (16.987, 21.333]  Christian      2
2   (16.987, 21.333]      Hindu      1
3   (16.987, 21.333]     Muslim      0
4   (16.987, 21.333]      Parsi      0
5   (21.333, 25.667]   Buddhist      1
6   (21.333, 25.667]  Christian      1
7   (21.333, 25.667]      Hindu      1
8   (21.333, 25.667]     Muslim      0
9   (21.333, 25.667]      Parsi      0
10    (25.667, 30.0]   Buddhist      0
11    (25.667, 30.0]  Christian      0
12    (25.667, 30.0]      Hindu      2
13    (25.667, 30.0]     Muslim      1
14    (25.667, 30.0]      Parsi      1

            AgeGroup          Disease  Count
0   (16.987, 21.333]           Cancer      1
1   (16.987, 21.333]    Heart-related      1
2   (16.987, 21.333]       No illness      0
3   (16.987, 21.333]     Tuberculosis      0
4   (16.987, 21.333]  Viral infection      1
5   (21.333, 25.667]           Cancer      0
6   (21.333, 25.667]    Heart-related      1
7   (21.333, 25.667]       No illness      0
8   (21.333, 25.667]     Tuberculosis      1
9   (21.333, 25.667]  Viral infection      1
10    (25.667, 30.0]           Cancer      1
11    (25.667, 30.0]    Heart-related      1
12    (25.667, 30.0]       No illness      1
13    (25.667, 30.0]     Tuberculosis      1
14    (25.667, 30.0]  Viral infection      0

Ok, looking at these results it is clear that Religion and Disease are problems still. This is where we would need to start asking ourselves if we had enough data and what our analysis’s goals are. If we were trying to explore what features were related to these Diseases then we will need more data since no matter what we do the No illness will always be alone. In this instance we could drop that value:

filteredData = rawData.loc[ ~rawData.Disease.str.contains('illness')].copy()
filteredData.groupby(['AgeGroup', 'Disease']).size().reset_index(name='Count')
AgeGroup Disease Count
0 (16.987, 21.333] Cancer 1
1 (16.987, 21.333] Heart-related 1
2 (16.987, 21.333] Tuberculosis 0
3 (16.987, 21.333] Viral infection 1
4 (21.333, 25.667] Cancer 0
5 (21.333, 25.667] Heart-related 1
6 (21.333, 25.667] Tuberculosis 1
7 (21.333, 25.667] Viral infection 1
8 (25.667, 30.0] Cancer 1
9 (25.667, 30.0] Heart-related 1
10 (25.667, 30.0] Tuberculosis 1
11 (25.667, 30.0] Viral infection 0

Not quite there. So, another technique which will assist us is generalizing the Disease values. We’ll use a substitution heirarchy to replace some values in the data with slightly less accurate values.

heirarchy = {
    "Cancer": "Body",
    'Viral infection': 'Body',
    'Tuberculosis': 'Chest',
    'Heart-related': 'Chest'
}

filteredData['heirarchy'] = filteredData['Disease'].apply(lambda x: heirarchy[x] if heirarchy.get(x) else x)
filteredData.groupby(['AgeGroup','heirarchy']).size().reset_index(name='count')
AgeGroup heirarchy count
0 (16.987, 21.333] Body 2
1 (16.987, 21.333] Chest 1
2 (21.333, 25.667] Body 1
3 (21.333, 25.667] Chest 2
4 (25.667, 30.0] Body 1
5 (25.667, 30.0] Chest 2

Almost there. Let’s re-examine our age bins and make them one more step smaller.

filteredData['AgeGroup'] = pd.cut(rawData['Age'], bins=int(rawData['Age'].std())-2)
filteredData.groupby(['AgeGroup']).size().reset_index(name='Count')
AgeGroup Count
0 (16.987, 23.5] 4
1 (23.5, 30.0] 5

And, there we go! This would be 4-anonymity with respect to Age and Disease. Finally, it is safe to consider using in a public data set or Algorithm. This is a really small dataset but this proces would be how you would anonymize larger datasets as well.