Author: Paul Francis, MPI-SWS
The Open GDA Score Project provides a software toolkit for building attacks and automatically computing a GDA Score. Example attacks can be found at https://github.com/gda-score/attacks. This article describes how to use the toolkit to build attacks.
Recommended Background Reading
This article describes the design of the GDA score: https://www.gda-score.org/what-is-a-gda-score/
This article shows how to read a GDA score diagram: https://www.gda-score.org/a-brief-guide-score-diagrams/
Existing attacks may be found at: https://github.com/gda-score/attacks. See the README for how to install.
API documentation may be found at: https://gda-score.github.io
Basic Attack Score Components
As described in https://www.gda-score.org/what-is-a-gda-score/, the GDA Attack Score has 5 components:
Susceptibility: how susceptible the various dataset attributes are to attack (how much of the data can be attacked),
Confidence Improvement and Claim Probability: the accuracy of what is learned relative to how much of the susceptible data the attacker chooses to attack,
Prior Knowledge: the amount and type of prior knowledge needed by the attacker (what the attacker knows about the protected data or related external data), and
Work: the amount of “work” needed to do the attack (i.e. the number of queries).
All of these components can be computed by the class
gdaAttack(), but the API must be used properly for the scores to be correct. Note in particular that
gdaAttack() does not enforce correct usage: an attack designer can manipulate the API into producing any score he or she wishes.
General Exploration of Data
gdaAttack() API allows one to explore the data without this exploration affecting the score. Most generally this is done through the
askExplore()/getExplore() API calls, which executes and answers SQL queries. This interface is useful for any program that wishes to explore data, and is for instance used by the software that measures utility.
gdaAttack() API also provides a number of convenience functions for data exploration. Often in the context of an attack, it is reasonable to assume that there is general public knowledge of certain aspects of a dataset, and that this knowledge should not be regarded as special prior knowledge associated with the attack.
For instance, it is reasonable to assume that the name of the table, table columns, and column types is all commonly known information. The API provides the following helper functions to obtain this information:
getAttackTableName() getColNames() getColNamesAndTypes() getTableNames()
It is also reasonable to assume common public knowledge of certain values in a column, specifically values that are likely to pertain to many users. The
gdaAttack() API provides a function that returns these values for the named column:
Public Database for Linkability attacks
Linkability attacks assume that there is a public database that contains data that may be linked to a protected database. This public database can be explored with the
askExplore()/getExplore() interface, or browsed with the SQL client at https://db001.gda-score.org/.
If an attack requires specific prior knowledge (beyond the assumed common public knowledge just mentioned), then that knowledge should be obtained using the either the
getPriorKnowledge() (preferred) or the
askKnowledge()/getKnowledge() interface. The
getPriorKnowledge() interface allows prior knowledge to be requested according to a number of criteria, for instance whether the known rows or users are randomly selected or are selected according to specific values of a given column. The
getPriorKnowledge() interface is required by the Diffix bounty program. By contrast, the
askKnowledge()/getKnowledge() returns prior knowledge according to a SQL query.
Both interfaces record the number of cells (column values) that are returned by all of the calls. Note that if the same data is requested more than once, the interface will over-count.
Work is computed through the
askAttack()/getAttack() interface. As with the
ask/get interfaces mentioned above,
askAttack()/getAttack() executes the supplied SQL query on the anonymous database (as specified by the
anonDb parameter) and provides the answer.
askAttack()/getAttack() records the number of cells (column values) that are returned by all of the queries.
Confidence Improvement and Claim Probability
Confidence improvement and claim probability are computed with the
Recall from https://www.gda-score.org/what-is-a-gda-score/ that there are three criteria for measuring anonymity, singling out, inference, and linkability. Each of these criteria has an associated claim:
Singling-out: There is a single user with attributes A, B, C, …
Linkability: A given set of one or more users in a known dataset are also in the protected dataset
Inference: All users with attributes A, B, C, … also have attribute X
getAttack() only supports equality claims, for instance
column = value.
getAttack() does not support non-equality claims (
column != value) or inequality claims (
column <= value or
column BETWEEN val1 and val2). These will be added in the future as required.
For all three criteria, the
askClaim()/getClaim() interface requires a set of
column = value equalities (where each such equality is an attribute). The equalities are conveyed in the
spec data structure, which is a list of
getAttack() uses this
spec to generate the appropriate queries needed to determine if the claim is correct or not.
The equalities in the
spec are labeled as being either
guess. For the inference claim, attributes A, B, C, … are the known attributes (inequalities), while X would be the guessed attribute.
askClaim() generates queries the raw database for attributes A, B, C, …, and then checks the answer to see if attribute X holds for all returned rows.
spec labels are also used to compute confidence improvement for the singling-out and inference criteria. For instance, suppose the attacker is making a claim that the gender of a user is male. If 50% of all users are male, then confidence improvement is the extent to which an analyst improves over a statistical guess of 50%. In order to compute confidence improvement,
gdaAttack() needs to know that the attacker is trying to guess that gender is male so that it may measure the statistical probability of gender being male. In this case, the gender equality is labeled as
guess, and any other equalities in the claim are labeled as
Claim probability is also computed using the
askClaim()/getClaim() interface. Recall from https://www.gda-score.org/what-is-a-gda-score/ that an attacker can improve confidence by only making claims where the attacker has high confidence of being correct. While this increases confidence for the attacker, it also means that the attacker learns about fewer users (lower claim probability). To measure claim probability, the
askClaim()/getClaim() interface requires an indication as to whether the claim should be counted towards the confidence improvement score or not. This is done by setting
claim = True or
claim = False.
gdaAttack() does not automatically measure susceptability. In principle it could do so by forcing the attack to attack all susceptable columns, and then measuring which columns were attacked, but currently we don’t do that.
By default all columns are labeled as fully susceptable to attack. If this is not the case, then the
assignColumnSusceptibility() call in the
gdaScores() method can be used to assign a different column susceptibility score.
Viewing Score Graphs
Once you’ve created a score (json file), you can view a graphical representation of the score by drag-and-drop to https://www.gda-score.org/preview-graph/. The credentials for this service are username: gdascore, password: previewgraph.