K-anonymization

The K-anonymized tables are generated by applying K-anonymity to all columns.

K-anonymization was done using the ARX anonymization tool. This tool provides a number of static anonymization methods as well as tools for evaluating risk. By static anonymization, we mean that the tool operates once on the data, producing a new anonymized or de-identified table.

To produce the anonymized table, columns were configured as either quasi-identifying or identifying.

Identifying columns are those that, taken individually, can identify a user. These are completely masked (the values are replaced with an asterik symbol). For all tables, we configured uid, ssn, and email as identifying. For the banking tables, we additionally configured the account_id as identifying.

In principle all other columns would then be quasi-identifying, but we had trouble getting ARX to work with more than 4 columns labeled as quasi-identifying. Because of this, we configured lastname, gender, street, and zip as quasi-identifying. In this case, we could obtain gender and for instance the first couple characters of the street. The other columns were masked out. Based on this, we assume that any additional columns would also have been masked out, and so as an expedient we configured all remaining columns as identifying.

We produced two sets of tables, one for K=2 and one for K=5.

The result is that all data is destroyed except the gender column and a couple characters of another column. This is the case for both K=2 and K=5. We believe that with some effort (i.e. manually categorizing data into hierarchies) we could have retained slightly more quasi-identifying information, but for the most part trying for full K-anonymization destroys all but the simplest data.

The data can be explored via SQL client at https://db001.gda-score.org/. The database names are formatted as k_anon_K_table_full, where K is either 2 or 5, and table is either banking, taxi, census, or scihub.

Other databases

The GDA Score project offers a number of real databases that can be used to test and measure anonymization methods.

Raw Czech Banking Data

This dataset contains a set of banking transactions and other data from a Czech bank.

Find out more

Pseudonymization, Column Suppression

The column-suppressed pseudonymized tables are generated by simply deleting columns that contain Personally Identifying Information (PII).

Find out more

Raw USA Census Database

This database is taken from the US Census of 2013. 

Find out more