Pseudonymization, Column Suppression

The column-suppressed pseudonymized tables are generated by simply deleting columns that contain Personally Identifying Information (PII).

Each of the “raw” tables (raw_taxi, raw_banking, raw_census, and raw_scihub) has a pseudonymized version. These are called pseudo_taxi, pseudo_banking, pseudo_census, and pseudo_scihub respectively.

All of the tables have certain common columns. In each pseudonymized table, the following common columns were deleted:

lastname, firstname, birthdate, ssn, email, and street.

In addition, the account_id and birth_number were deleted from all of the tables in raw_banking.

The data can be explored via SQL client at https://db001.gda-score.org/.

Other databases

The GDA Score project offers a number of real databases that can be used to test and measure anonymization methods.

K-anonymization

The K-anonymized tables are generated by applying K-anonymity to all columns.

Find out more

Raw NYC Taxi Database

This database contains four hours of New York City taxi rides (from Jan. 8, 2013, 8AM to noon).

Find out more

Raw USA Census Database

This database is taken from the US Census of 2013. 

Find out more