The Open GDA Score Project uses a number of publicly available datasets for producing measures. This short article discusses how we adapt those datasets to the project.
Each database was publicly released as micro-data, meaning that each row in the database pertains to one individual. Each database was at least pseudonymized before public release. Some databases had additional de-identification applied. The databases nevertheless originate from real data. We refer to these publicly-released databases as raw databases.
We added additional columns to the raw databases. These columns are meant to re-introduce some of the personally identifying information that was lost in the pseudonymization process that took place prior to public release. Specifically, we added the following columns: firstname, lastname, street, zip, ssn, email, gender, and birthdate. These columns were synthetically generated using the GenerateData tool at https://www.generatedata.com/. Note that no attempt was made to generate sensible values. For instance, the birthdates are not distributed as they would be in a real population, women’s first names are given to males and vice versa, and the zip codes derive from different countries’ formats. The goal with respect to these additional columns is not to emulate real data, but to provide columns that may be attacked.
We also either designated an existing column as being a unique identifier, or created a new column to serve as a unique identifier. This column is necessary for producing the GDA defense score: it allows the system to measure whether anonymity has been compromised or not.
Finally, in most cases we reduced the size of the public database, either by limiting the time period from which the data is taken, or randomly selecting users. We do this as a convenience to make defense and utility measures run faster, since these measures involve hundreds of queries.
We apply a variety of static de-identification methods to the raw databases, for instance pseudonymization and (soon) K-anonymity. Additional static de-identification methods such as Differential Privacy and synthetic data are planned. All of the resulting databases are made available through a postgres query interface. We also provide query interfaces for dynamic de-identification methods, like Diffix and (under development) Differential Privacy.