D = For data
#Automationabc D = For data

D = For data

D = For Data

Let’s face it we need data for everything what we do as a tester. And if this data holds value to us, we consider it information. In a time of extensive data collection and increasing concern for privacy. Data is a hot topic.

As automation engineers we work with data on a daily base, we;

-????????????? Generate test data

-????????????? Anonymize large datasets

-????????????? Ensure data is protected

Types of data If you would ask me to explain the difference between synthetic, masked and anonymous test data I probably couldn’t give you a straight answer right away. So, examining this is of great value to me as well.

Let’s see what we can find:

Personal data[1]

Data by itself is not inherently personal. But when it either directly or indirectly refers or relates to an individual it becomes personal data. So, an entry of a birthday by itself is not personal data, but when it’s a combination of fields like; “07-07-1977” and the name “John Doe” it becomes personal. So what about a common name? Like John Smith in the US, or Piet de Vries in the Netherlands. Is this considered personal data? It is if additional context narrows it down. For example, Piet de Vries living on the Herengracht 44 is personal data.

Anonymous data[2]

So what is anonymous data? According to the European Union’s data protection laws, in particular the General Data Protection Regulation (GDPR) , anonymous data is “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”

?Let’s break this down; “information which does not relate to an identified or identifiable natural person” In this context

identified means: Someone who can be directly identified with data.

Identifiable means someone who can be recognized indirectly from the available data. ? Identified: - Name - Data of Birth - Social security number

Identifiable: - Ip Address - Mac address - Geolocation

Now for the 2nd part: Personal data rendered anonymous in such a matter that the data subject is no longer identifiable. So when anonymizing testdata it’s important that subject is no longer identifiable. So if you anonymize all personal data but forget the identifiable data. It’s still possible to identify someone. Also don’t forget the other way data can be used to draw conclusions: “You work at the traffic fine department as a tester. All personal data has been anonymized, street addresses are static. however, the amount of traffic fines is not altered” When looking at such situations you can still identify that your next-door neighbor received 5 traffic fines in a 3 month time. Even though he is known in the test data by a different name and probably also drives a different type of car.?

Synthetic data

So next up what is synthetic data? Again, the European Data protection supervisor helped us with wisdom[3] In my opinion a very clear definition: “Synthetic data is artificial data that is generated from original data and a model that is trained to reproduce the characteristics and structure of the original data.” The data is generated by using existing data, but So does this mean synthetic data guarantees anonymity? Well…[4] According to this very clear article written by Marina Anagnostaki it really depends on the feasibility and possibility of re-identification.

Pseudonymous data[5]

So what is pseudonymous data? And how does it differ from synthetic data? The main difference is that pseudonymous data is still derived from real data and has a higher risk of re-identification. So, for testing synthetic data would be the weapon of choice.

Use cases where test data matters most

Large amounts of test-data are usually required for the more intensive tests such as E2E or performance test. Also to test database migrations.

?Conclusion For testing, synthetic data is your friend. As a tester you work risk based. Working with sensitive data is a way of introducing additional risks. ?To get the data which matches the requirements for your test scenario’s it’s best to discuss this with a (senior) data engineer who can synthesize the data for you. In addition there are several tools available such as DATPROF - Test Data Simplified who offer out of the box solutions.


?


[1] https://geo-data-support.sites.uu.nl/personal-data/personal-vs-anonymous-data/

[2] https://www.edps.europa.eu/system/files/2021-04/21-04-27_aepd-edps_anonymisation_en_5.pdf

[3] https://www.edps.europa.eu/press-publications/publications/techsonar/synthetic-data_en

[4] https://www.datenschutz-notizen.de/synthetic-data-anonymized-data-or-pseudonymized-data-3541386/

[5] https://www.dataprotection.ie/en/dpc-guidance/anonymisation-pseudonymisation

要查看或添加评论,请登录

Willem K.的更多文章

  • F = For full stack automation engineer analyst

    F = For full stack automation engineer analyst

    In my current role as a practice lead, I spot opportunities and assignments for my team. But lately I’ve been spotting…

    7 条评论
  • E = For exploratory Automation

    E = For exploratory Automation

    E = For exploratory testing (in automation) This week I will describe exploratory testing, but with an automated twist.…

  • C = For CI/CD &CD

    C = For CI/CD &CD

    C = For CI/CD You might have expected me to talk extensively about Cucumber after handling BDD in the previous week…

  • B = For BDD

    B = For BDD

    Welcome back to the #AutomationABC. This week is the week for the letter B and this time I chose a topic that is close…

    4 条评论
  • A = For Automation

    A = For Automation

    Welcome to the #AutomationABC In this weekly series I will explore topics regarding automation & testing. Each week a…

    3 条评论

社区洞察

其他会员也浏览了