Pseudonymization, with K-anonymity

These pseudonymized tables are generated by applying K-anonymity to columns that contain Personally Identifying Information (PII).

K-anonymization was done using the ARX anonymization tool. This tool provides a number of static anonymization methods as well as tools for evaluating risk. By static anonymization, we mean that the tool operates once on the data, producing a new anonymized or de-identified table.

To produce the pseudonymized table, columns were configured as being one of quasi-identifying, identifying, or sensitive.

Identifying columns are those that, taken individually, can identify a user. These are completely masked (the values are replaced with an asterik symbol). For all tables, we configured uid, ssn, and email as identifying. For the banking tables, we additionally configured the account_id as identifying.

Quasi-identifying columns are those whose values taken individually are unlikely to identify a user, but taken together may do so. For instance, first name, last name, and zip code taken together may identify a user. K-anonymity ensures that there any set of quasi-identifying columns pertains to at least K users. In all cases, we configured lastname, gender, street, and zip as quasi-identifying. (Note that we tried to also configure firstname and birthdate as quasi-identifying, but for some reason we never resolved, ARX prevented us from configuring more than four quasi-identifying columns. Therefore we configured firstname and birthdate as identifying. We don’t think this affected the result, because in any event the quasi-identifying columns were completely masked.)

Sensitive columns are not modified. ARX required that we assign a t-closeness value to sensitive columns, so we used the recommended value of t=0.001.

We produced two sets of tables, one for K=2 and one for K=5.

The result of all this is that pseudonymization with K-anonymity produced effectively the same result as pseudonymization by column suppression: the quasi-identifying columns (containing PII) were completely masked, which is equivalent to suppressing the columns. This is the case for both K=2 and K=5. We believe that with some effort we could have retained some of the quasi-identifying information, so these particular datasets should be considered as “low-effort” pseudonymization through K-anonymization.

The data can be explored via SQL client at https://db001.gda-score.org/. The database names are formatted as k_anon_K_table_partial, where K is either 2 or 5, and table is either banking, taxi, census, or scihub.

Other databases

The GDA Score project offers a number of real databases that can be used to test and measure anonymization methods.

Raw USA Census Database

This database is taken from the US Census of 2013. 

Find out more

Raw Czech Banking Data

This dataset contains a set of banking transactions and other data from a Czech bank.

Find out more

Raw Scihub Database

This database contains one week’s worth of downloads from the Sci-Hub scientific papers free download system. 

Find out more