How to replicate and anonymize data in a simple and easy to maintain way
What are we going to discuss in this article?
We want to share with some simple examples how to replicate and anonymize date using just some data governance rules.
Introduction:
Some years ago, when we were organizing some informative talks at the university, I met a professor specialized in photonics. Later that day we had the chance to speak a bit and he told me something that changed the way I see things.
"When I am reviewing a paper, and it pop ups with a formula with many constants and a huge number of operations, then I know they were not able to find what they were looking for. Did you realize that all formulas are quite simple? People use to over complicate things when there are much simpler solutions that will even work better."
It may not be always like that, but I have always follow that rule, try to make it simple.
Today I will try to apply this idea on how to replicate and anonymize data.
1- Why?:
It's fairly common to carry data from a productive environment to another to be able to be able to perform test without risks.
But what happens when that data carries sensitive information? It should not be open to anyone, nobody wants their personal data to be exposed to unknown people, cell phone numbers, addresses or health stories should be kept secured.
What can be done? Easy, anonymize data, transforming sensitive date into something different so your privacy is not at risk.
But traditionally this has been a time consuming tasks:
This procedure was valid, but can be improved if we addresses its weak points:
2- How can we improve this?
Easy answer: Not having to take care of this.
Sounds too good to be true, so let me explain how to summarize the procedure:
And we will rely on a solution provided by Cloudera that unifies security, governance and lineage:
We will get the following using this solution:
领英推荐
3- Step by step procedure
For this example we created a simple table as shown below:
create external table t1 (id int, idontknowwhat int, address string, name string);
And we insert some data:
insert into t1 (id, idontknowwhat, address, name) values (1, 1, 'calle de la piruleta número 7', 'pepe');
insert into t1 (id, idontknowwhat, address, name) values (2, 2, 'Avenida de la gloria 12', 'susana');
This is how it looks:
Paso 1 - Choose the sensitive fields:
A new classification was created at Atlas: "SensitiveData". We set this classification to anything we want to anonymize, for our tiny examples we choose the field "address", as you can see, it is quite easy to do:
Paso 2 - Create a masking policy at Ranger
A Masking policy must be crated at Ranger to mask everything tagged as "SensitiveData".
As an example, we crate the masking policy and will be used just for the user devel1, and we will hash the data tagged with the SensitiveData classification.
We can test our policy for the user devel1, and make sure it is working:
Paso 3 - Copy the date to anonymize it
We run the copy operation with the user devel1 to copy data to a new table. In this example we are copying the full table but we can use other approach like copying only the new records.
create external table t1_copia as select * from t1;
This procedure can be scheduled using Oozie for example.
Paso 4 - Replication Manager will copy data to the target environment
We just need to create a replication policy for that table.
Paso 5 - Review that it works
We can connect to the target environment and read data to confirm the procedure works as expected:
Conclussions
As you can see we can rely on Atlas and Ranger to create a simple but effective replication procedure that will anonymize data, and we will also get other nice treats, maintenance cost is reduced, data governance can manage this procedure, and data lineage can be used to avoid leaks.