GDPR - Data anonymisation

I had an opportunity to work for a GDPR(General Data Protection Regulation) Live data compliance project and would like to share my knowledge and experience working in this project.

The key objective of this project is to anonymise the PII data available in non production environments.

This is not a one time change as there is a regular refresh of prod data into non-prod environment required for various SDLC processes required by the Financial applications. The data stores are held in Oracle database, Netezza, and Mongo DB.

Being my first compliance based project, this was a new experience of understanding the compliance requirement, analysing the anonymisation techniques and learning the holistic perspective in implementing the solutions. Also the company was using Talend and selected the tool for implementation.

As a Data Engineer, the challenges I faced were to understand the applications that holds/process sensitive data; understand the entity relationship and database objects like constraints, index and triggers; implement anonymisation techniques using Talend; design job to include non-technical requirements for different data volumes; analyse on functional relationship between entities; rolling out the solution for different database versions; automating the execution process.

Before diving into how GDPR was implemented, an overview of GDPR(General Data Protection Regulation):

Before GDPR, the companies had followed Data Protection Act 1998 which provided guidance on how the companies can use personal data. But as technology and business evolved towards data driven culture, the Data Protection Act was not sufficient and allowed organisations to use personal data for their analytical purpose.

An example is Cambridge Analytica data scandal. Hence there arose a need to modernise the Data Protection Act 1998 keeping consumer as the core interest and there came the GDPR and Data Protection Act 2018. According to this act, companies cannot use personal data without the consent of the individuals.

As this has been made mandatory, Organisations started implementing this act through principles, policies, and procedures. As protecting individuals data is the prime motive, any sensitive data that the company holds has to be encrypted in production(on rest and on motion) while in non-production environments, the data must be made insensitive.

What is meant by sensitive data?

Any information supports to identify a natural persons is called sensitive data. This sensitive information is classified as personally identifiable information (PII), financially identifiable information(FII) and classified information(SPII). Personally identifiable information - Name, Address, DOB, SSN, Biometric, Signature. Financially identifiable information - Bank accounts, Card details, Credits. Classified information - Health, Religion, Race/Ethnic origin.

How to transform sensitive data as insensitive:

Two different approaches to protect data:

?    Anonymisation - is transformation of data such that the transformed data can no longer identify a particular person even with the additional knowledge of transformed data.

?    Pseudo anonymisation - is partly transformation of data but with the help of additional knowledge the person can be identified.

Choices of Tools: Though there are several tools available in the market, each have their own plus and minuses.With respect to Talend, the positives are - written in java and provide components to write customised code. - this tool supports both relation, non-relational and ETL databases. - this tool provides some standard anonymisation technique and was also in the organisational technological stack list. - Easy to maintain and overall performance was good.

Cons of Talend: - this is not a complete anonymisation tool. It does not support all anonymisation techniques and few customised ones have to be developed. - Post anonymisation the output is in-consistent. - Mapping and transforming relational data having dynamic schemas to JSON or XML format is not supported.

About Talend: A drag and drop tool written in Java, uses Maven for built and supports Git. It has 2 editions - standard and enterprise. Data Quality components like tDataMasking and tDataShuffling used for anonymisation is only available in enterprise edition. This tool has a code editor which exposes the java code that helps to analyse the code developed for the job(application) and also provides debug options.

Community support is available and good. The job is exported as executable jar and can be run from any operating system which has Java version at-least 8.

Anonymisation Techniques: The main anonymisation technique that were used in this project are:

?    Masking - Modify the data partially with a constant value like '*' or 'x'. Example: masking part of account number with xxxx. This technique is used where partially exposing data is acceptable.

?    Shuffling or Swapping - re-arranging the data as a single group or distribute the data into several groups and re-arrange within each group. Example: Forename, divide into male and female groups and re-arrange within its group.

This technique maintains the cardinality/attributes and is used for circumstances where post-anon the data should be synthetic.

?    Randamisation - Replacing the data with randomly generated seeded value. Example: Randomising the phone number. This technique is used where data accuracy is not critical.

?    Suppression - Removing or clearing the column. Example: Clearing the free format text fields that might have sensitive data. This technique is used if the data cannot be anonymised by other techniques.

?    Hashing - Substituting data from a column with the same hash value of the data from a list of fake values. Example: Surname can be replaced with a surname having same hash value. This technique is used when the replacement value should be consistent and synthetic.

?    Scrambling - Mixing and rearranging the letters so intensely such that the original data cannot be determined. Example: References which might contain some sensitive data is totally rearranged.

This technique is used where data accuracy is not critical. These techniques were chosen based on the usability of data post-anonymisation and degree of anonymisation expected.

Anonymisation Strategy: As this is a compliance requirement, anonymisation was included in the organisational security clearance checklist (i.e) whenever the request to refresh non-prod environment from prod is raised, the request has to goes through organisational security clearance approval where one of the checklist is that the data is anonymised.

This process was implemented in 2 stages:

stage 1: a one-off process where all the non-prod environments(active and inactive) are anonymised

stage 2: build a standard process where-in any non-prod refresh request goes as BAU activity. For both these stages a generic anonymisation pipeline was designed.

Analysing, Designing and building Anonymisation pipeline:

Challenge 1: Our initial challenges were to understand the entire landscape of applications, databases, different environments. This knowledge was shared by SME's of the respective applications. This knowledge we transformed into a mapping sheet which had application-wise tables and columns mentioning referential integrity relationship and functional relationship.

The columns were identified as either sensitive or insensitive columns. Upon the sensitive columns, the above anonymisation technique was mapped. This was a key task as it determines how the data would look like post-anonymisation.

Security management and application SME's greatly played a key role in mapping the anon techniques to the individual column or group of columns. The referential integrity and functional relationship was used in designing the job execution sequence.

Challenge 2: The next challenge was in designing the job which enforces functionality, performance, transaction integrity, maintainability, data quality and completeness.

Two basic job templates were prepared. One template to execute as single JVM process and other as multiple JVM process that runs in parallel. Single JVM process was used for tables having less that 40 Millions rows and multiple JVM process design was used for table size greater than 40 Millions rows.

Talend has an orchestration component that allowed to run the job as parent and child processes, where each child process runs as an independent JVM process and can be run in parallel.

Also Talend provided both ETL and ELT components. But, anonymisation techniques can only be applied on ETL components and not available in ELT. And we know that ELT is faster than ETL. So, ETL was used only in situation where the table requires independent anonymisation. For situation where anonymisation data can be propagated to dependent tables ELT was used.

By this approach the overall performance of job flow improved. The job design was a file based approach (i.e) the checkpoint/restartability, job execution/completion status and logs were all maintained in files rather than tables. The file based approach seemed to be more reliable than having the job execution details in tables.

Transaction integrity was enforced by having commit point at the end of successful anonymisation for that run. In case of failure, the entire run is rollbacked. Data quality was enforced through the techniques and completeness of data getting anonymised was ensured by validating the processed records count in logs.

Challenge 3: Another performance improvement technique was to drop constraints, triggers and inactivate non-primary index. Also create necessary index required to improve the anonymisation query performance. Also had a sql script that validates the constraints and index violations to take neccessary actions.

Challenge 4: To reduce additional disk space, job was designed to operate as in-situ i.e. Read the rows from a table, transform in Talend and update the same table all operations as a single transaction. This also reduced the additional effort required from the DBA.

Situations were primary key of a table had sensitive data, insert operations was used instead of update. DBA had to allocate additional database space for staging tables.

Challenge 5: The initial consideration was to run the jobs from Talend scheduler (TAC). However it has some limitations as below:

1. TAC is not efficient as a scheduler. If the job is killed from scheduler, it does not kill the respective Oracle process running in the background. DBA has to kill the DB process.

2. Debugging is difficult in TAC.

3. Since TAC runs on Talend servers, there is lot of I/O involved in transferring the data from database server to Talend server for transformation. This reduced the throughput of the job.

4. Talend job server and runtime server is different according to the Talend architecture. This lead to preparing the job execution environment in a different server and the logs maintained in another server.

5. Git integration with Talend is not efficient. There are couple of times that I lost the job and had to re-create the job while working in offline mode. In Talend offline mode is faster than online mode and is recommended to use in development environment.

6. Setting Database connection takes time and sometimes the request times-out.

Hence the decision was changed to run as batch job from the database server. Shell script was developed to execute the jobs in parallel batches and within batches job run in sequence.

Challenge 6: Another challenge we encountered while doing performance testing was the throughput of the job and frequent deadlocks. Apart from Talend job design there are several factors that contribute to the overall performance of job: database design, disk throughput, memory, networking, and CPU.

a) Database design - Data manipulate operations is slower for a large unsegmented tables that scan across disk. Very frequent deadlocks errored if parallel operations occur on the rows that reside in the same block as Talend enforced block level locks.

For such situation avoid using multiple processes and use single process job.

b) Disk throughput - solid-state disks (SSD) as SSD proved to increase the throughput while compared to traditional hard drives (HDD).

c) Memory - Memory was increased to hold the undo logs and reduce deadlocks. Also having more memory available to the system for page cache improved the performance of anonymisation.

d) Network and CPU - As increasing network bandwidth and CPU did not prove any significantly improvement. Hence, this was not considered as the primary factor in selecting hardwares.

Challenge 7: Reduced downtime of the application while rolling out anonymised data. This was achieved by cloning the database or database snapshots to a standalone environment and running the anonynisation jobs in that standalone environment.

Once the database is anonymised, rolling out the anon data to non-prod environment by bringing the environment down. So that the downtime is only for the rollout duration and not the entire anonymisation process.

Any specific data prepared between the snapshot time and rollout time will be lost and was accepted.

Challenge 8: As a pre-step, additional storage space should be requested to maintain the Gold copy of the original database(incase of backout with retention period) and anon databases(for re-usability).

Key takeaways from this project: 1. Look for holistic approach if you wanted to improve performance.

2. Understand the importance of anonymisation columns from the users (business, testers, developers) perspective.

3. Deadlock can arise for several reasons other than the general interlock reason.

4. Look for alternate approaches to mitigate the limitations of the software.

I enjoyed and learned working in a time sensitive project that impacts multiple stakeholders and in a wonderful team.

Gnanasekaran Sundaramurthy

Release Train Manager / Relationship Manager at Natwest

4 年

Fantastic write up Revathy. Well done!

回复
Sreenivas Puducheri

Senior GCP Data Engineer

4 年

Nice Post.. ??

回复

要查看或添加评论,请登录

Revathy Mourouguessane的更多文章

社区洞察

其他会员也浏览了