Anonymization and Pseudonymization Explained

Anonymization and Pseudonymization Explained

Article by David K.

Many companies or organizations have personal data that they would like to analyse or process further. However, this is not so easy due to data protection regulations, as personal data must be handled with the utmost care. This data includes all data that can identify a person. At first you might only think of the first name and surname, perhaps in combination with date of birth, telephone number, email address and address, but there are also other details that can uniquely identify a person, for example when talking about a pediatrician in a small town where there is only one pediatrician.

Only if this personal data is sufficiently anonymized and/or pseudonymized may it be forwarded and processed and evaluated by third parties. The legal basis for this is the General Data Protection Regulation (GDPR).

What is meant by the terms anonymize and pseudonymize?

Anonymization is the alteration of personal data in such a way that the individual details of personal or factual circumstances can no longer be attributed to an identified or identifiable natural person, or only with a disproportionate amount of time, cost and effort. (BDSG old version § 3 para. 6)

If data is completely anonymized, the General Data Protection Regulation does not apply, a personal reference and thus a re-identification is almost impossible. Only when every possible combination of data leads to two or more persons in a data set is a data set anonymized.

Pseudonymization is the replacement of the name and other identification features with an identifier for the purpose of excluding or significantly complicating the identification of the data subject. (BDSG old version § 3 para. 6a)

Anonymization:

Two main points must be guaranteed by anonymization: The data must be irreversible and it must be impossible to clearly assign it to a person.

In general, a distinction is made between three types of anonymization, which are described below: absolute, formal and factual anonymization.

Absolute anonymization:

This is the strongest type of anonymization; all personal details are removed so that identification is impossible. Absolutely anonymized data can be made publicly available for all data analyses, but the data is often so strongly alienated by the anonymization that the benefit that can be drawn from the data is low.

Formal anonymization:

This is the simplest form of anonymization. In this case, only the direct identifiers of a person are removed, such as name, telephone number and address.

Factual or relative anonymization

Anonymization is carried out to such an extent that it is almost impossible to assign the data to a person or can only be carried out with a disproportionate amount of effort, but sufficient information is still available to carry out an analysis of the data with regard to non-personal content. This data may not be made generally available, but may only be used for scientific projects in accordance with the Federal Statistics Act.

When the data is sufficiently anonymized depends on the information contained in the data set, but also on the conditions or techniques used for anonymization. For example, what additional information such as keys are available or whether the data is used externally or internally plays a decisive role. Accordingly, it can be decided to what extent the information should or must be anonymized. The effort required for anonymization and the benefit of the data should be determined for each analysis.

Various anonymization techniques can be used to achieve de facto anonymization. The GDPR does not specify which anonymization techniques are to be used. To avoid violating the GDPR, it is advisable to involve a data protection officer.

Some anonymization techniques are presented below:

Removal of identifiers:

This involves completely deleting data that can identify a person from a data set, these are for example, name, address, date of birth, account data, social security number, photograph, but also sensitive attributes such as illnesses or a very old age.

Randomization:

Various techniques are used to disrupt the data to such an extent that the link between data and people is broken. These techniques include the swapping of data (the values of one person are randomly or pseudo-randomly swapped with the values of another person, whereby care must be taken to ensure that the swapping does not result in the data coincidentally being the original data of a person again), synthetic data generation (artificial data sets are created using the characteristics of the original data set), or perturbation (data is replaced with artificially generated values so that statistical characteristics of the original data set remain).

Aggregation:

Different approaches are available to generalize the data set. For example, numerical data, such as age, is categorized into intervals, or a woman's name is replaced with 'woman' or a job title with 'occupation'. It should be determined at the beginning to what extent a data set or individual details should be generalized.

Pseudonymization:

In addition to anonymization, pseudonymization plays an important role, but there are a few points that need to be considered in connection with pseudonymization. In pseudonymization, the link between a person and the specified values is not completely removed, but placeholders are used that can be traced back to the person using a key. If the key is not sent along with the pseudonymized data record, the data record is anonymized for the recipient.

In general, however, pseudonymized data is still personal data, as it can be assigned to a person using a key.

In order to make this data available for analysis, special care must therefore be taken to ensure that the necessary key is stored securely and is not lost.

If pseudonymization is used instead of anonymization, it is important to take a close look at the Data Protection Act or involve a data protection officer to ensure that no violations occur.

A combination of anonymization and pseudonymization can also be applied to a data set so that only the data that absolutely cannot be anonymized is replaced by pseudonyms. This increases the certainty that no breaches of data protection law will occur.

The following infographic shows the differences between anonymization and pseudonymization.

Artificial intelligence:

Advances in the field of artificial intelligence also enable new approaches to anonymized data analysis. Several possibilities can be considered, two of which are described below:

Create synthetic values or datasets:

This approach is used to create artificial data that has similar statistical properties to the original dataset, for example. This can be used to create a dataset whose data is anonymized but has sufficient data for analysis.

Federated learning:

The idea here is that data sets are not copied to a central server to carry out the analysis, but that the training takes place on each individual user's computer. The models created in the process are then collected on a central server and aggregated into a model. The original data therefore remains on the user's computer and never comes into the hands of the analyst. The big advantage here is that the amount of data does not have to be reduced.

Furthermore, artificial intelligence can be used to formally anonymize or pseudonymize data records, so that the direct identifiers of persons, i.e. name, address, birthday, etc., are automatically found and deleted in texts, license plates, faces, etc. are automatically recognized and noisily removed from photos, or names, addresses, etc. are automatically recognized and noisily removed from audio.

Applications of anonymization:

All data that identifies a person must be anonymized before data analysis is carried out or the data is passed on for another purpose. This does not only apply to personal data in texts, but also to data in photos or audio files that identify a person.

For example, surveys are recorded on audio that must be anonymized for analysis by masking personal information with noise, for example. If photos showing people or cars are forwarded or edited, faces and license plates, among other things, must be made unrecognizable.

Medical patient data, which contains a lot of personal data, is particularly sensitive and it must be examined in detail which data must be anonymized or pseudonymized, i.e. not made available for analysis, in order to obtain meaningful information from the data.

Conclusion and Recommendations

Ensuring the privacy and security of personal data through anonymization and pseudonymization is vital for compliance with the General Data Protection Regulation (GDPR). As we've discussed, understanding the nuances between different techniques and their applications is essential for effective data management.

At deepsight, we offer comprehensive solutions to help businesses navigate these complexities and maintain data integrity. Our expert team is here to support you in implementing the best practices for data protection, leveraging advanced AI technologies for superior results.

?? Interested in learning more? Reach out to us at deepsight and discover how we can help safeguard your data while unlocking its full potential.

要查看或添加评论,请登录

deepsight的更多文章

社区洞察

其他会员也浏览了