What is K-Anonymity?

Data Privacy is a problem which users and institutions are starting to care more about. Neither users nor companies want to reveal or be liable for some of the data and conclusions that arise from the data collected. One such popular method of trying to mitigate the problem of de-anonymizing individuals in data is K-Anonymity.

The idea is that given observations and features in data, knowing this information should not allow one to individually identify someone from that data. Thus, K-Anonymity is a metric which states that data is sufficiently anonymized to the k-1 threshold such that given specific columns in the data there exists no such persons where their arrangement leads to less than k individuals in the group.

We’ll use the data from the Wikipedia page to illustrate this idea.

url = 'https://en.wikipedia.org/wiki/K-anonymity'
site = pd.read_html(url)
rawData = site[0]
rawData

	Name	Age	Gender	Height	Weight	State of domicile	Religion	Disease
0	Ramsha	30	Female	165 cm	72 kg	Tamil Nadu	Hindu	Cancer
1	Yadu	24	Female	162 cm	70 kg	Kerala	Hindu	Viral infection
2	Salima	28	Female	170 cm	68 kg	Tamil Nadu	Muslim	Tuberculosis
3	Sunny	27	Male	170 cm	75 kg	Karnataka	Parsi	No illness
4	Joan	24	Female	165 cm	71 kg	Kerala	Christian	Heart-related
5	Bahuksana	23	Male	160 cm	69 kg	Karnataka	Buddhist	Tuberculosis
6	Rambha	19	Male	167 cm	85 kg	Kerala	Hindu	Cancer
7	Kishor	29	Male	180 cm	81 kg	Karnataka	Hindu	Heart-related
8	Johnson	17	Male	175 cm	79 kg	Kerala	Christian	Heart-related
9	John	19	Male	169 cm	82 kg	Kerala	Christian	Viral infection

Ok, so now that we have this information we can see there are Sensitive Attributes which should not show up in public data. So, using this data we can use one single groupby() call to quickly identify an individual user. Obviously, if we have your name then we should drop or Suppress that.

rawData['Name'] = '*'
rawData

	Name	Age	Gender	Height	Weight	State of domicile	Religion	Disease
0	*	30	Female	165 cm	72 kg	Tamil Nadu	Hindu	Cancer
1	*	24	Female	162 cm	70 kg	Kerala	Hindu	Viral infection
2	*	28	Female	170 cm	68 kg	Tamil Nadu	Muslim	Tuberculosis
3	*	27	Male	170 cm	75 kg	Karnataka	Parsi	No illness
4	*	24	Female	165 cm	71 kg	Kerala	Christian	Heart-related
5	*	23	Male	160 cm	69 kg	Karnataka	Buddhist	Tuberculosis
6	*	19	Male	167 cm	85 kg	Kerala	Hindu	Cancer
7	*	29	Male	180 cm	81 kg	Karnataka	Hindu	Heart-related
8	*	17	Male	175 cm	79 kg	Kerala	Christian	Heart-related
9	*	19	Male	169 cm	82 kg	Kerala	Christian	Viral infection

But also, if I know your age and I know your religion then I can quickly identify you.

rawData.groupby(['Age','Religion']).size().reset_index(name='Count')

	Age	Religion	Count
0	17	Christian	1
1	19	Christian	1
2	19	Hindu	1
3	23	Buddhist	1
4	24	Christian	1
5	24	Hindu	1
6	27	Parsi	1
7	28	Muslim	1
8	29	Hindu	1
9	30	Hindu	1

Since we want some sort of representation of age in our analysis, we can Generalize the Age values by binning them. Guidance on the actual bin sizes looks to be trial and error; I guess it is expected to use your own judgement in these methods. For my own guideline I’m going to take the rounded standard deviation and step down until the number of k threshold has been met.

int(rawData['Age'].std())
rawData['AgeGroup'] = pd.cut(rawData['Age'], bins=int(rawData['Age'].std()))
rawData.groupby(['AgeGroup']).size().reset_index(name='Count')

/tmp/ipykernel_135547/374673683.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  rawData.groupby(['AgeGroup']).size().reset_index(name='Count')

	AgeGroup	Count
0	(16.987, 20.25]	3
1	(20.25, 23.5]	1
2	(23.5, 26.75]	2
3	(26.75, 30.0]	4

…. not enough yet:

rawData['AgeGroup'] = pd.cut(rawData['Age'], bins=int(rawData['Age'].std())-1)
rawData.groupby(['AgeGroup']).size().reset_index(name='Count')

/tmp/ipykernel_135547/4046394562.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  rawData.groupby(['AgeGroup']).size().reset_index(name='Count')

	AgeGroup	Count
0	(16.987, 21.333]	3
1	(21.333, 25.667]	3
2	(25.667, 30.0]	4

That looks good. And, if we take it in combination with another column?

for col in rawData.columns.to_list():
    if col not in ['AgeGroup', 'Age']:
        print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
        print()

           AgeGroup Name  Count
0  (16.987, 21.333]    *      3
1  (21.333, 25.667]    *      3
2    (25.667, 30.0]    *      4

           AgeGroup  Gender  Count
0  (16.987, 21.333]  Female      0
1  (16.987, 21.333]    Male      3
2  (21.333, 25.667]  Female      2
3  (21.333, 25.667]    Male      1
4    (25.667, 30.0]  Female      2
5    (25.667, 30.0]    Male      2

            AgeGroup  Height  Count
0   (16.987, 21.333]  160 cm      0
1   (16.987, 21.333]  162 cm      0
2   (16.987, 21.333]  165 cm      0
3   (16.987, 21.333]  167 cm      1
4   (16.987, 21.333]  169 cm      1
5   (16.987, 21.333]  170 cm      0
6   (16.987, 21.333]  175 cm      1
7   (16.987, 21.333]  180 cm      0
8   (21.333, 25.667]  160 cm      1
9   (21.333, 25.667]  162 cm      1
10  (21.333, 25.667]  165 cm      1
11  (21.333, 25.667]  167 cm      0
12  (21.333, 25.667]  169 cm      0
13  (21.333, 25.667]  170 cm      0
14  (21.333, 25.667]  175 cm      0
15  (21.333, 25.667]  180 cm      0
16    (25.667, 30.0]  160 cm      0
17    (25.667, 30.0]  162 cm      0
18    (25.667, 30.0]  165 cm      1
19    (25.667, 30.0]  167 cm      0
20    (25.667, 30.0]  169 cm      0
21    (25.667, 30.0]  170 cm      2
22    (25.667, 30.0]  175 cm      0
23    (25.667, 30.0]  180 cm      1

            AgeGroup Weight  Count
0   (16.987, 21.333]  68 kg      0
1   (16.987, 21.333]  69 kg      0
2   (16.987, 21.333]  70 kg      0
3   (16.987, 21.333]  71 kg      0
4   (16.987, 21.333]  72 kg      0
5   (16.987, 21.333]  75 kg      0
6   (16.987, 21.333]  79 kg      1
7   (16.987, 21.333]  81 kg      0
8   (16.987, 21.333]  82 kg      1
9   (16.987, 21.333]  85 kg      1
10  (21.333, 25.667]  68 kg      0
11  (21.333, 25.667]  69 kg      1
12  (21.333, 25.667]  70 kg      1
13  (21.333, 25.667]  71 kg      1
14  (21.333, 25.667]  72 kg      0
15  (21.333, 25.667]  75 kg      0
16  (21.333, 25.667]  79 kg      0
17  (21.333, 25.667]  81 kg      0
18  (21.333, 25.667]  82 kg      0
19  (21.333, 25.667]  85 kg      0
20    (25.667, 30.0]  68 kg      1
21    (25.667, 30.0]  69 kg      0
22    (25.667, 30.0]  70 kg      0
23    (25.667, 30.0]  71 kg      0
24    (25.667, 30.0]  72 kg      1
25    (25.667, 30.0]  75 kg      1
26    (25.667, 30.0]  79 kg      0
27    (25.667, 30.0]  81 kg      1
28    (25.667, 30.0]  82 kg      0
29    (25.667, 30.0]  85 kg      0

           AgeGroup State of domicile  Count
0  (16.987, 21.333]         Karnataka      0
1  (16.987, 21.333]            Kerala      3
2  (16.987, 21.333]        Tamil Nadu      0
3  (21.333, 25.667]         Karnataka      1
4  (21.333, 25.667]            Kerala      2
5  (21.333, 25.667]        Tamil Nadu      0
6    (25.667, 30.0]         Karnataka      2
7    (25.667, 30.0]            Kerala      0
8    (25.667, 30.0]        Tamil Nadu      2

            AgeGroup   Religion  Count
0   (16.987, 21.333]   Buddhist      0
1   (16.987, 21.333]  Christian      2
2   (16.987, 21.333]      Hindu      1
3   (16.987, 21.333]     Muslim      0
4   (16.987, 21.333]      Parsi      0
5   (21.333, 25.667]   Buddhist      1
6   (21.333, 25.667]  Christian      1
7   (21.333, 25.667]      Hindu      1
8   (21.333, 25.667]     Muslim      0
9   (21.333, 25.667]      Parsi      0
10    (25.667, 30.0]   Buddhist      0
11    (25.667, 30.0]  Christian      0
12    (25.667, 30.0]      Hindu      2
13    (25.667, 30.0]     Muslim      1
14    (25.667, 30.0]      Parsi      1

            AgeGroup          Disease  Count
0   (16.987, 21.333]           Cancer      1
1   (16.987, 21.333]    Heart-related      1
2   (16.987, 21.333]       No illness      0
3   (16.987, 21.333]     Tuberculosis      0
4   (16.987, 21.333]  Viral infection      1
5   (21.333, 25.667]           Cancer      0
6   (21.333, 25.667]    Heart-related      1
7   (21.333, 25.667]       No illness      0
8   (21.333, 25.667]     Tuberculosis      1
9   (21.333, 25.667]  Viral infection      1
10    (25.667, 30.0]           Cancer      1
11    (25.667, 30.0]    Heart-related      1
12    (25.667, 30.0]       No illness      1
13    (25.667, 30.0]     Tuberculosis      1
14    (25.667, 30.0]  Viral infection      0

/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))
/tmp/ipykernel_135547/159031434.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  print(rawData.groupby(['AgeGroup', col]).size().reset_index(name='Count'))

Ok, looking at these results it is clear that Religion and Disease are problems still. This is where we would need to start asking ourselves if we had enough data and what our analysis’s goals are. If we were trying to explore what features were related to these Diseases then we will need more data since no matter what we do the No illness will always be alone. In this instance we could drop that value:

filteredData = rawData.loc[ ~rawData.Disease.str.contains('illness')].copy()
filteredData.groupby(['AgeGroup', 'Disease']).size().reset_index(name='Count')

	AgeGroup	Disease	Count
0	(16.987, 21.333]	Cancer	1
1	(16.987, 21.333]	Heart-related	1
2	(16.987, 21.333]	Tuberculosis	0
3	(16.987, 21.333]	Viral infection	1
4	(21.333, 25.667]	Cancer	0
5	(21.333, 25.667]	Heart-related	1
6	(21.333, 25.667]	Tuberculosis	1
7	(21.333, 25.667]	Viral infection	1
8	(25.667, 30.0]	Cancer	1
9	(25.667, 30.0]	Heart-related	1
10	(25.667, 30.0]	Tuberculosis	1
11	(25.667, 30.0]	Viral infection	0

Not quite there. So, another technique which will assist us is generalizing the Disease values. We’ll use a substitution heirarchy to replace some values in the data with slightly less accurate values.

heirarchy = {
    "Cancer": "Body",
    'Viral infection': 'Body',
    'Tuberculosis': 'Chest',
    'Heart-related': 'Chest'
}

filteredData['heirarchy'] = filteredData['Disease'].apply(lambda x: heirarchy[x] if heirarchy.get(x) else x)
filteredData.groupby(['AgeGroup','heirarchy']).size().reset_index(name='count')

	AgeGroup	heirarchy	count
0	(16.987, 21.333]	Body	2
1	(16.987, 21.333]	Chest	1
2	(21.333, 25.667]	Body	1
3	(21.333, 25.667]	Chest	2
4	(25.667, 30.0]	Body	1
5	(25.667, 30.0]	Chest	2

Almost there. Let’s re-examine our age bins and make them one more step smaller.

filteredData['AgeGroup'] = pd.cut(rawData['Age'], bins=int(rawData['Age'].std())-2)
filteredData.groupby(['AgeGroup']).size().reset_index(name='Count')

	AgeGroup	Count
0	(16.987, 23.5]	4
1	(23.5, 30.0]	5

And, there we go! This would be 4-anonymity with respect to Age and Disease. Finally, it is safe to consider using in a public data set or Algorithm. This is a really small dataset but this proces would be how you would anonymize larger datasets as well.