登录查看更多内容

Is Guessing Enough for Your GDPR Project?

Tomas Kratky

Founder of Manta, an IBM company

发布日期: 2018年6月28日

I will tell you one thing — I am tired of GDPR buzz. Don’t get me wrong, I value privacy and data protection very much, but I hate the way how almost every vendor uses it to sell their goods and services, so much that the original idea is almost lost.

It is similar to BCBS or other past data-oriented regulations. Consulting companies, legal firms, data governance/ security/ metadata vendors, we are all the same — buy our “thing” and you will be ok or at least safer with us! Every second book out there tells us that every change is an opportunity to improve, to evolve. So, what is the improvement here with GDPR? If I look around I see a lot of legal work being done, adding tons of words (very small letters, as always) to already long Terms & Conditions. And you know what? I don’t think there is any real improvement in it.

But things are not always so bad. There is also a lot of good stuff going on with one important goal—to better understand and govern data and its lifecycle in a company. And there is one challenging but critical part I want to discuss today—Data Lineage. That means how data is moved around in your organization. You must understand that a customer’s email address or their credit card number is not just in your CRM but is spread all over your company in tens or even hundreds of systems — your ERP, data warehouse, reporting, new data lake with analytics, customer portal, numerous Excel sheets and even external systems. The path of the data you collect can be very complex, and if you think about all possible ways you can move and transform data in your company, one thing should be clear — your data lineage has to be automated as much as possible.

Different Approaches to Data Lineage

That being said, I feel it is important to talk about different approaches to data lineage that are used by data governance vendors today. Because when you talk about metadata, you very often think about simple things — tables, columns, reports. But data lineage is more about logic — programming code in any form. It can be an SQL script, PL/SQL stored procedure, Java program or complex macro in your Excel sheet. It can literally be anything that somehow moves your data from one place to another, transforms it, modifies it. So, what are your options for understanding that logic?

Option 1: Ignore it! (aka data similarity lineage)

No, I am not crazy! There are products building lineage information without actually touching your code. They read metadata about tables, columns, reports, etc. They profile data in your tables too. And then they use all that information to create lineage based on similarities. Tables, columns with similar names and columns with very similar data values are examples of such similarities. And if you find a lot of them between two columns, you link them together in your data lineage diagram. And to make it even more cool, vendors usually call it AI (another buzzword I really hate). There is one great thing about this approach — if you watch data only, and not algorithms, you do not worry about technologies and it is no big deal if the customer uses Teradata, Oracle or MongoDB with Java on top of it. But on the other hand, this approach is not very accurate, performance impact can be significant (you work with data) and data privacy is at risk (you work with data). There are also a lot of details missing (like transformation logic for example, which is very often requested by customers) and lineage is limited to the database world, ignoring the application part of your environment.

Option 2: Do the “business” lineage manually

This approach usually starts from the top by mapping and documenting the knowledge in people’s heads. Talking to application owners, data stewards and data integration specialists should give you fair but often contradictory information about the movement of data in your organization. And if you miss talking to someone you simply don’t know about, a piece of the flow is missing! This often results in the dangerous situation where you have lineage but are unable to use it for real case scenarios — not only can you not trust your data, you cannot trust the lineage either.

Option 3: Do the technical lineage manually

I will get straight to the point here — trying to analyze technical flows manually is simply destined to fail. With the volume of code you have, the complexity of it and the rate of change, there’s no way to keep up with it. When you start considering the complexity of the code and especially the need to reverse engineer the existing code, this becomes extremely time consuming and sooner or later such manually managed lineage will fall out of sync with the actual data transfers within the environment and you will end up with the feeling that you have lineage you cannot actually trust.

Now that we know that automation is key, let’s take a look at some less labor intensive and error prone approaches.

Option 4: Trace it! (aka data tagging lineage)

Do you know the story of Theseus and the Minotaur? The Minotaur lives in a labyrinth and so does Ariadne who is in charge of the labyrinth. Ariadne gives Theseus a ball of thread to help him navigate the labyrinth by being able to retrace his path.

And this approach is a bit similar. The whole idea is that each piece of data that is being moved or transformed is tagged/labeled by a transformation engine which then tracks that label the whole way from start to finish. It is like Theseus. This approach looks great, but it only works well as long as the transformation engine controls the data’s every movement. A good example is a controlled environment like Cloudera. If anything happens outside its walls, the lineage is broken. It is also important to realize that the lineage is only there if the transformation logic is executed. But think about all the exceptions and rules that apply only once every couple of years. You will not see them in your lineage till they are executed. That is not exactly healthy for your data governance, especially if some of those pieces are critical to your organization.

Option 5: Control it! (aka self-lineage)

The whole idea here is that you have an all-in-one environment that gives you everything you need—you can define your logic there, track lineage, manage master data and metadata easily, etc. There are several tools like this, especially with the new big data/ data lake hype. If you have a software product of this kind, everything happens under its control — every data movement, every change in data. And so, it is easy for a such a tool to track lineage. But here you have the very same issue as in the previous case with data tagging. Everything that happens outside the controlled environment is invisible, especially when you consider long-term manageability. Over time, as new needs appear and new tools are acquired to address them, gaps in the lineage start to appear.

Option 6: Decode it! (aka decoded lineage)

Ok, so now we know that logic can be ignored, traced with tags and controlled. But all those approaches fall short in most real-life scenarios. Why? Simply because the world is complex, heterogeneous, wild and most importantly — it is constantly evolving. But there is still another way — to read all the logic, to understand it and to reverse engineer it. That literally means to understand every programming language used in your organization for data transformations and movements. And by programming language I mean really everything, including graphic and XML based languages used by ETL tools and reports. And that is the challenging part. It is not easy to develop sufficient support for one language, let alone the tens of them you need in most cases to cover the basics of your environment. Another challenging issue is when the code is dynamic, which means that you build your expressions on the fly based on program inputs, data in tables, environmental variables, etc. But there are ways to handle such situations. On the other hand, this approach is the most accurate and complete as every single piece of logic is processed. It also guarantees the most detailed lineage of all.

And that’s it. This was not meant to be a scientific article, but I wanted to show you the pros and cons of several popular data lineage approaches. Which leads me back to my first GDPR paragraph. I see enterprises investing a lot of money in data governance solutions with insufficient data lineage capabilities, offering tricks like data similarity, data tagging and even self-lineage. But that is just guesswork, nothing more. Guesswork with a lot of issues and manual labor to correct the lineage.

So, I am asking you once again — is guessing good enough for your GDPR project?

This article, together with the presentation from the DGIQ 2018 conference, was also published on MANTA's Blog.

Akshay Jobanputra

CEO at ANJ WebTech LLC

6 年

Hi, let us know if we can help We?ANJ Webtech?experienced Technology Firm based in USA, India and physical presence in UK. We have experienced team in India experienced in most of all domains. and we assure the customer service in all the products and service we provide. let's connect [email protected]

要查看或添加评论，请登录

Tomas Kratky的更多文章

100 mantas on board!

2021年9月29日

100 mantas on board!

This week, we are celebrating at MANTA, we are 100 people strong now! It is hard to express my feelings but it is a mix…

13 条评论
How to Avoid Paying the Price for Uncertainty in 2021

2020年12月31日

How to Avoid Paying the Price for Uncertainty in 2021

An explosion of data should not scare us. It is an explosion of data processing that might be our doom.
A Few Extra Words About Our $13M Series A1 Funding to Redefine Data Management with Automation

2020年10月8日

A Few Extra Words About Our $13M Series A1 Funding to Redefine Data Management with Automation

On October 6, 2020 the word got out that MANTA has closed a $13-million round led by Bessemer Venture Partners…

8 条评论
Mess in, Mess Out: How Low Quality Data Ruins Your Analytics

2019年9月16日

Mess in, Mess Out: How Low Quality Data Ruins Your Analytics

A few days ago, I found an interesting article published by Moshe Kranc, CTO at Ness Digital Engineering, on Aug 1…

1 条评论
The Year of MANTA and Why We’ve Published Our Pricing Online

2017年12月31日

The Year of MANTA and Why We’ve Published Our Pricing Online

We’ve seen a massive surge in the world of data lineage over the last year. More buzz, more leads, more customers for…

3 条评论
A Metadata Map Story: How We Got Lost When Looking for a Meeting Room

2017年9月1日

A Metadata Map Story: How We Got Lost When Looking for a Meeting Room

You may think that I have gone crazy after reading the title above or hope that our blog is finally becoming a much…
Return of the Metadata Bubble

2017年7月27日

Return of the Metadata Bubble

The bubble around metadata in BI is back - with all it's previous sins and even more just around the corner. [LONG…

12 条评论
ONE SMALL STEP FOR MANTA, ONE BIG LEAP FOR MANKIND

2017年6月30日

ONE SMALL STEP FOR MANTA, ONE BIG LEAP FOR MANKIND

We just recently published a blog post announcing one new feature – MANTA now works not only with physical lineage but…
The Dark Side of the Metadata & Data Lineage World

2017年6月14日

The Dark Side of the Metadata & Data Lineage World

You wouldn't believe it, but there is a dark side to the metadata & data lineage world as well. I would like to dig…

6 条评论
Metadata as Explained to My Grandma

2016年9月25日

Metadata as Explained to My Grandma

A few weeks ago, I talked to my grandma. I travel a lot and she lives quite far away, so we hadn’t seen each other for…

5 条评论

See all articles

Is Guessing Enough for Your GDPR Project?

Tomas Kratky

Founder of Manta, an IBM company

Tomas Kratky的更多文章

社区洞察

其他会员也浏览了

The who, what, how, when, where and why of data

Spring Cleaning for Your Martech Stack

How do you foster a data security and privacy culture among data integration stakeholders and users?

DataSphere by TJC Group

Does your company operate with inaccurate or incomplete data? Of course it does, most do. This article explores how it is holding your company back

Data Masking vs. Pseudonymization: How do they differ?

Governance in a Decentralised Data Platform

Data Governance and Security in the New Digital Era of Business Intelligence

Data Protection - When Legal Meets Data Analytics (Part 3 of 3)

Is Data Governance Just Government for Data?

Tomas Kratky的更多文章

100 mantas on board!

How to Avoid Paying the Price for Uncertainty in 2021

A Few Extra Words About Our $13M Series A1 Funding to Redefine Data Management with Automation

Mess in, Mess Out: How Low Quality Data Ruins Your Analytics

The Year of MANTA and Why We’ve Published Our Pricing Online

A Metadata Map Story: How We Got Lost When Looking for a Meeting Room

Return of the Metadata Bubble

ONE SMALL STEP FOR MANTA, ONE BIG LEAP FOR MANKIND

The Dark Side of the Metadata & Data Lineage World

Metadata as Explained to My Grandma

社区洞察

其他会员也浏览了

The who, what, how, when, where and why of data

Spring Cleaning for Your Martech Stack

How do you foster a data security and privacy culture among data integration stakeholders and users?

DataSphere by TJC Group

Does your company operate with inaccurate or incomplete data? Of course it does, most do. This article explores how it is holding your company back

Data Masking vs. Pseudonymization: How do they differ?

Governance in a Decentralised Data Platform

Data Governance and Security in the New Digital Era of Business Intelligence

Data Protection - When Legal Meets Data Analytics (Part 3 of 3)

Is Data Governance Just Government for Data?