Tutorial: Privacy preservation with synthetic data
Image credits: Unsplash

Tutorial: Privacy preservation with synthetic data

Given an original dataset, it is possible to generate a synthetic dataset that preserves the original statistical properties for a variety of use cases. If augmentation of data quality is the purpose, the only thing that matters is the utility or fidelity of the synthetic data with respect to the original one.

But if the goal is to facilitate privacy-sensitive data sharing, we need to consider all the relevant aspects around data protection. We would like to have synthetic data that is statistically very similar but at the same time, individual record-wise, very different from the original data.

Wait, what?

Yes you read that right, it could actually solve the massive utility vs privacy conundrum.

Synthetic data, if generated correctly, protects against privacy risks. To prove this, we need metrics that help us to quantify and manage these risks.

This article revolves around the analysis of privacy risks with synthetic data from an engineering point of view. We'll get our hands dirty and go straight to the methods that we can implement. We will try to understand what it means to protect privacy with synthetic data and go into the details of some metrics useful to assess risks.

When does it makes sense to talk about data privacy

According to GDPR and related regulations, the scope of privacy measures are directed at personal data, i.e., information that pertains to living individuals also known as natural persons. This may sound trivial, but it is good to clarify. If the original dataset does not contain direct or indirect personal data about individuals, there is no privacy risk. Otherwise data privacy regulations apply.

There are several technical, organisational and policy measures to be adopted to avoid disclosure of personal data. In this post we discuss the technical measures in terms of anonymization or pseudonymization techniques, beyond the obvious cyber security measures that one needs to consider for privacy preservation.

Of course, data anonymization techniques have been available for decades, long before the rise of synthetic data generation. The risks we describe here apply to more traditional protection techniques as well. With those techniques there’s always a huge trade-off between privacy and data utility.

Synthetic data generation can better deal with this trade-off : synthetic data, if generated properly, is private, realistic and retains the original statistical properties. Assessing the privacy risks can help make the synthetic generation solution even more complete.

The starting scenario of our analysis

This is the starting scenario on which we will proceed for our analysis: we have an original tabular dataset O containing sensitive data about individuals. We train a generative model that gives us a synthetic dataset S with a certain degree of guarantees in terms of de-identification. A hypothetical attacker (or adversary) A tries to derive some sensitive information from S. We want to measure the privacy risks, we need practical metrics that we can calculate.

Image credits: Clearbox AI

As we move forward in the analysis we will modify these assumptions slightly, but it is good to have a common reference framework right from the start.

How do we identify and manage privacy risks?

With a good synthetic data generation, where new data records are generated from a distribution learnt by a (ML) model, there is by construction no unique mapping between the synthetic records and the original ones. This is true in general, but we should not underestimate the risks. In some cases, for example, the generative model may overfit on the original data and produce synthetic instances too close to the real data. And even in the case of synthetic data that seems anonymous, sophisticated attacks could still lead to re-identification of individuals in the original dataset.

Identity disclosure

Let’s say s is a record in our private synthetic dataset S. If an attacker is able, with or without any background knowledge, to assign a real identity to that record, we call this an identity disclosure.

Only correct identity assignments matter. If the attacker assigns a wrong identity to a synthetic record or assigns an identity of a real person who was not present in the original training dataset, there is no identity disclosure.

Assessing the risk and relevance of an identity disclosure, we must take into account the background knowledge and resulting information gain of the attacker. Did the attacker learn something new about the identified subject? If it is able to assign a real identity to a synthetic record, but it already knew all the attributes of that individual, it did not learn anything new from this assignment, the information gain is zero. We should be particularly concerned only with the risks derived from identity disclosures where the information gain is greater than zero.

A good synthetic data generation protects from these meaningful identity disclosures.

Inferential disclosure

Direct identity disclosure is highly unlikely by construction with fully synthetic data but an attacker could use data analysis to derive information about a particular group of individuals without assigning an identity to a specific synthetic record. Let’s say the original dataset contains sensitive medical information. The generated synthetic dataset preserves the original statistical properties. Given the synthetic dataset the attacker could derive that a certain group of people with similar characteristics have a certain risk of contracting a disease. This could be done with basic statistical techniques or in a more sophisticated way with a machine learning model. At that point, if it knows another individual (who may not even be part of the original dataset) who has the same characteristics as that group, it can derive that that person has the same risk of contracting the disease. The adversary learnt something new about an individual without any actual identity disclosure. This is called inferential disclosure.

This is not a disclosure from which synthetic data can protect us and for this very reason we should be aware of it and quantify it to make decisions on the proportionality principles. Let me explain in more detail.

Deriving inferences from data is the essence of statistics and data analysis. A dataset is useful only if it is possible to identify some relationship or pattern within it and derive additional information, otherwise it’s just noise. Synthetic data with a high utility will retain the original relationships in the data. Therefore no matter the level of synthetic data privacy, if they preserve the statistical properties of the original ones, you should be aware that potentially someone can learn something new about (a group of) individuals from them. This can be harmful to those individuals, but this holds for any kind of information derived from data. Data synthesis cannot help here, these kinds of concerns have to be handled in other ways.

Membership Inference Attacks

Membership Inference attacks try to infer membership of an individual in the original dataset from which the synthetic dataset was generated. For example, if the original dataset used to train the generative model consists of records of people with cancer (the cancer status is not a column of the dataset, but it’s shared as a clinical study of cancer patients), being able to infer from the synthetic data that an individual was included in the training set will reveal that that individual has cancer.

As in the case of inferential disclosure, with MIAs an adversary could learn something new about an individual without an actual identity disclosure but the focus here is about the membership of a known individual (background knowledge) to the original training set.

This is not an exhaustive list of all possible risks related to privacy preservation of synthetic data, but it should give a clear idea of the issue. We like to be thorough about all the possible risks but we need to be aware that risk and remedy when it comes to data protection is based on the proportionality of the risk and impact . In any case, synthetic data is a great contender as a safeguard mechanism in managing risks.

How to quantify privacy risks?

Now that we introduced the data privacy problem and defined some specific risks to be taken into account, let's delve into the technical details. We are about to present some methodologies to actually quantify the risks.

Let’s go back to the previous scenario. Starting from an original dataset O we generate a synthetic dataset S with a certain degree of protection in terms of privacy. We want to prevent re-identification of any kind. We therefore want to prevent a possible attacker A, having S at their disposal, from being able to derive sensitive information about real individuals in O. So, first of all, we have to measure how different, or distant, synthetic individuals are from real individuals. The more distant they are, the more difficult re-identification is. We need a reasonable measure of distance between each pair of records (s, o).

Distance between individuals

We are talking about personal data in tabular form that may contain qualitative (categorical features) and quantitative (ordinal features) information. Here is an extract from the popular UCI Adult dataset that well represents this type of data:


Image credits: Clearbox AI


How can we measure the distance between two rows in this dataset? We need a similarity coefficient that combines the difference between both types of features. We want the distance to always be between 0 (identical individuals) and 1 (individuals at the maximum distance). For this purpose, the Gower distance is a great solution. Given two individuals i and j, each consisting of p features (the columns of the dataset), we can thus define their Gower distance as

where d_ijk is the distance between the k-th feature and w_ijk is an (optional) weight assigned to the k-th feature. To make things simpler we can consider w_ijk=1 for all features i.e each feature contributes equally to the overall distance, and thus obtain the following simplified formula:

The distance d_ijk is distinct for categorical and ordinal features:

  • For categorical features:

  • For ordinal features:

where R_k is the range for the k-th feature.

Now that we have a way to calculate the distance between two individuals, we can use it to construct our first privacy metric.

Distance to Closest Record

We aim to verify that the synthetic individuals in S are not a simple copy or the result of a simple perturbation (addition of noise) of the real individuals in O. We can define the Distance to Closest Record for a given individual s in S as the minimum distance between s and every original individual o in O:

??????(s) = ?????? ??(s,o) for each o∈O

DCR(s) = 0 means that s is an identical copy (clone) of at least one real individual in the original dataset O. We calculate the DCR for each synthetic record s in S and plot the resulting values as a histogram to observe their distribution. From such a chart we can quickly obtain a first insight into the overall privacy level of our synthetic dataset.

Image credits: Clearbox AI

See the initial spike at 0? It means that many synthetic instances are matching at least one real individual, i.e. many real individuals are copied identically in the synthetic dataset. Not exactly what we would expect from a synthetic dataset that should prevent re-identification of the original individuals.

If we get such a histogram, it is wise to stop and examine the situation better. The risk of re-identification could be high. The synthetic data generation process may need to be reconsidered, there may have been overfitting on the original dataset for example.

However, this situation does not always indicate a high risk of re-identification: in some datasets the cardinality “covered” by the features might be so low as not to allow for a sufficiently diverse generation. Let me explain this better with an extreme example: if the original dataset has exactly 4 categorical features/columns and each has only two possible values there would be exactly 16 possible individuals, consequently any sufficiently large synthetic dataset will present identical instances to the original ones.

On the other hand, here’s how a good (in terms of privacy risk) DCR histogram looks like:

Image credits: Clearbox AI

The values distribution is sufficiently far from 0 and notably no individual has a DCR exactly equal to 0, i.e. none of the real individuals are identically present in the synthetic dataset. First test passed, we can carry on with our analysis.

Synthetic vs Holdout

Let’s slightly modify the initial scenario and assume that we have split the original dataset in two sets: one part, the training set T, is used for synthetic generation while the second part, the holdout set H, is kept aside without being used by the generative model. If you have developed machine learning models in the past, this split should be familiar to you.

Image credits: Clearbox AI

This way, in addition to calculating the DCR between the generated synthetic dataset and the training set as we did before, we can also calculate the DCR between the synthetic and the holdout dataset. Ideally we would like to check that the synthetic individuals are not systematically closer to those in the training set than those in the holdout set. We can again represent these distributions with a histogram.

Image credits: Clearbox AI

In this example, we immediately notice that the synthetic dataset is much closer to the training set than the holdout. This is not good in terms of privacy. It means that we were not able to generate synthetic instances sufficiently different from the training ones and this results in a high risk of re-identification.

Image credits: Clearbox AI

Here instead is a satisfactory histogram. The distribution of distances between synthetic and training instances is very similar (or at least not systematically smaller) than the distribution of distances between synthetic and holdout instances. With T, S and H available, it is virtually impossible to tell which of T and H was actually used to train the generative model. This is a good sign in terms of privacy risk.

We can summarise this metric into a single value that at a glance tells us our level of privacy. Having DCR values for all synthetic instances against both training set and holdout set instances, we can calculate the percentage of synthetic instances that are actually closer to a training set instance than to a holdout set instance. In the best cases we obtain a percentage close to (or below to) 50%: this means that the synthetic dataset does not provide any information that could lead an attacker to assume whether a certain individual was actually present in the training set.

Focus: Membership Inference Attack

We are now going to define a specific risk, a Membership Inference Attack, and illustrate a method for calculating and protecting against it.

Let's go back to the scenario we previously defined: starting from an original dataset O we generate a synthetic dataset S with a certain degree of protection in terms of privacy. We want to prevent re-identification of any kind. We therefore want to stop a possible attacker A, having S at their disposal, from being able to derive sensitive information about real individuals in O.

We define a Membership Inference Attack as an attempt by a hypothetical attacker A to derive the membership of a specific individual in the original dataset O from which the synthetic dataset S was generated. In this case, the attacker doesn't have to be able to identify the row in S that refers to a real individual, but only that he is able to verify that a specific individual was contained in the original dataset. This is a considerable risk that should not be underestimated. Let’s assume that the original dataset contains sensitive data on the customers of a bank. For instance, all customers with a large debt to the bank. If the attacker is able to deduce that a particular individual belongs to the original dataset, he may conclude that that person is a debtor of the bank. We must try to avoid this risk as much as possible.

Let us simulate this type of attack by assuming the scenario we have already illustrated. We have split the original dataset into two sets: one part, the training set T, is used for synthetic generation while the second part, the holdout set H, is kept aside without being used by the generative model.

Image credits: Clearbox AI

Let’s assume that attacker A has access to, in addition to the synthetic dataset S, an additional dataset K. This dataset K is a subset of the original dataset O, in particular half of its instances belong to the training dataset T and half belong to the holdout dataset H. Half of the instances in K were actually used for the generation of S, while the other half were not. It is not important to know how the attacker got hold of K, he may have obtained it through another attack or maybe the data in K are public.

The attacker has at his disposal a subset of individuals that may or may not belong to the training dataset on which the generative model for S was trained. The question we ask is: having K and S available, how difficult is it for attacker A to discover for each k in K whether k belongs to the training dataset T?

The attack consists of the following steps for each individual k in K:

  1. The attacker identifies the individual s in S “closest” to k by means of the Distance to Closest Record (DCR) function we illustrated in the second post of this series;
  2. The attacker establishes that k belongs to the training dataset T if the DCR value found in the previous step is below a certain fixed threshold.

Following this procedure, the adversary will have determined which and how many k individuals belong to the training dataset used for synthetic generation. To quantify this risk, let’s assess the success rate of this strategy.

Let ‘s keep in mind that in our scenario we know exactly for each k in K which was extracted from T (so it was used for the generation) and which from K (so it did not take part in the training of the generative model). For each k in K we can then know whether the adversary has correctly established its membership in T. We then have true positives, true negatives, false positives and false negatives. We can proceed to calculate the Precision of the attack strategy. In this case, the precision represents the percentage of correct decisions by the attacker.

Since 50% of the instances in K come from the training set and the other 50% come from the holdout set, we ideally aim for a precision of 0.5 or less. A precision of 0.5 means that the attacker has made a random choice in determining whether or not a known instance (k) belongs to the training dataset. Despite the availability of the synthetic dataset S, he is not able to determine with a higher probability than random choice the membership of a known instance k in T. As the value of precision increases, the attacker's ability to identify the membership of the training dataset increases and we will therefore have an increasing level of disclosure risk.

Image credits: Clearbox AI

Proportionality of risk and impact

This was the analysis of privacy risks with synthetic data from an engineering point of view. We explained what it means to protect privacy with synthetic data and went into the details of some metrics useful to assess risks.

You should not consider this an exhaustive list of all possible risks related to data privacy, but it should give a clear idea of the issue. Keep in mind that risk and remedy when it comes to data protection are based on the proportionality of the risk and impact . Synthetic data is a great contender as a safeguard mechanism in managing risks.







要查看或添加评论,请登录

社区洞察

其他会员也浏览了