How to replicate and anonymize data in a simple and easy to maintain way

How to replicate and anonymize data in a simple and easy to maintain way

What are we going to discuss in this article?

We want to share with some simple examples how to replicate and anonymize date using just some data governance rules.

  • We will use Atlas to set a classification.
  • We will rely on Ranger to create some masking policies..
  • And we will use Replication Manager to replicate data to other environments.



Introduction:

Some years ago, when we were organizing some informative talks at the university, I met a professor specialized in photonics. Later that day we had the chance to speak a bit and he told me something that changed the way I see things.

"When I am reviewing a paper, and it pop ups with a formula with many constants and a huge number of operations, then I know they were not able to find what they were looking for. Did you realize that all formulas are quite simple? People use to over complicate things when there are much simpler solutions that will even work better."

It may not be always like that, but I have always follow that rule, try to make it simple.

Today I will try to apply this idea on how to replicate and anonymize data.

1- Why?:

It's fairly common to carry data from a productive environment to another to be able to be able to perform test without risks.

But what happens when that data carries sensitive information? It should not be open to anyone, nobody wants their personal data to be exposed to unknown people, cell phone numbers, addresses or health stories should be kept secured.

What can be done? Easy, anonymize data, transforming sensitive date into something different so your privacy is not at risk.


But traditionally this has been a time consuming tasks:


  1. First, you have to analyze and select which data will be anonymize.
  2. Then, you create and ad-hoc process that will copy data and anonymize it while making that copy.
  3. You were now able to replicate that data to the target environment.

This procedure was valid, but can be improved if we addresses its weak points:

  • It is a procedure with some complexity that we have to schedule and monitor.
  • When something changes, like adding new fields, we have to make some changes on the procedure.
  • Governance has no way to be sure what data is being anonymize, and therefore some inconsistencies may happen.


2- How can we improve this?

Easy answer: Not having to take care of this.

Sounds too good to be true, so let me explain how to summarize the procedure:

  • Governance will set some classifications to sensitive data.
  • We will create some masking policies to anonymize that data.
  • We will use some SQL sentences to copy data that will be automatically anonymized.
  • We will replicate this anonymized data to other environments.

And we will rely on a solution provided by Cloudera that unifies security, governance and lineage:

We will get the following using this solution:

  • Governance will be able to see through Atlas which data is going to be anonymized.
  • Lineage integration will propagate the classifications to other entities, and the same anonymization rules will be followed.
  • Need to add or remove fields from anonymization? No problem, you can just set or remove the classification for a field in Atlas. Governance is autonomous in order to manage this, no need to notify anyone.
  • No code to maintain, no need to compile anything after an update.

3- Step by step procedure

For this example we created a simple table as shown below:

create external table t1 (id int, idontknowwhat int, address string, name string);        

And we insert some data:

insert into t1 (id, idontknowwhat, address, name) values (1, 1, 'calle de la piruleta número 7', 'pepe');
insert into t1 (id, idontknowwhat, address, name) values (2, 2, 'Avenida de la gloria 12', 'susana');        

This is how it looks:

Paso 1 - Choose the sensitive fields:

A new classification was created at Atlas: "SensitiveData". We set this classification to anything we want to anonymize, for our tiny examples we choose the field "address", as you can see, it is quite easy to do:

Paso 2 - Create a masking policy at Ranger

A Masking policy must be crated at Ranger to mask everything tagged as "SensitiveData".

As an example, we crate the masking policy and will be used just for the user devel1, and we will hash the data tagged with the SensitiveData classification.

We can test our policy for the user devel1, and make sure it is working:

Paso 3 - Copy the date to anonymize it

We run the copy operation with the user devel1 to copy data to a new table. In this example we are copying the full table but we can use other approach like copying only the new records.

 create external table t1_copia as select * from t1;        

This procedure can be scheduled using Oozie for example.

Paso 4 - Replication Manager will copy data to the target environment

We just need to create a replication policy for that table.

Paso 5 - Review that it works

We can connect to the target environment and read data to confirm the procedure works as expected:

Conclussions

As you can see we can rely on Atlas and Ranger to create a simple but effective replication procedure that will anonymize data, and we will also get other nice treats, maintenance cost is reduced, data governance can manage this procedure, and data lineage can be used to avoid leaks.

要查看或添加评论,请登录

Javier Gómez Santos的更多文章

社区洞察

其他会员也浏览了