登录查看更多内容

The Dark Side of the Metadata & Data Lineage World

Tomas Kratky

Founder of Manta, an IBM company

发布日期: 2017年6月14日

You wouldn't believe it, but there is a dark side to the metadata & data lineage world as well. I would like to dig deep and explain how you can get into trouble.

It has been a wonderful spring this year, hasn't it? The first months of 2017 were hot for us. Data governance, metadata, and data lineage are everywhere. Everyone is talking about them, everyone is looking for a solution. It's an amazing time. But there is also the other side, the dark side.

The Reality of Metadata Solutions

As we meet more and more large companies, industry experts & analysts, investors and just data professionals, we see a huge gap between their perception of reality and reality itself. What am I talking about? About the end-to-end data lineage ghost. With data being used to make decisions every single day, with regulators like FINRA, Fed, SEC, FCC, and ECB requesting reports, with initiatives like BCBS 239 or GDPR (a new European Data Protection Directive), proper governance and a detailed understanding of the data environment is a must for every enterprise. And E2E (end-to-end) data lineage has become a great symbol of this need. Every metadata/data governance player on the market is talking about it and their marketing is full of wonderful promises (in the end, that is the main purpose of every marketing leaflet, isn't it?). But what's the reality?

The Automated Baby Beast

The truth is, that E2E data lineage is a very tough beast to tame. Just imagine how many systems and data sources you have in your organization, how much data processing logic, how many ETL jobs, how many stored procedures, how many lines of programming code, how many reports, how many ad-hoc excel sheets, etc. It is huge. Overwhelming!

If your goal is to track every single piece of data and to record every single processing step, every "hop" of data flow through your organization, you have a lot of work to do. And even if you split your big task into smaller ones and start with selected data sets (so-called "critical data elements") one by one, it can still be so exhausting that you will never finish or even really start. And now data governance players have come in with gorgeous promises packaged in one single word - AUTOMATION.

The promise itself is quite simple to explain - their solutions will analyze all data sources and systems, every single piece of logic, extract metadata from them (so-called metadata harvesting), link them up (so-called metadata stitching), store them and make them accessible to analysts, architects, and other users through best-in-class, award-winning user interfaces. And all of this through automation. No manual work necessary, or just a little bit. Is it so tempting that you are open to it, you want to believe. And so you buy the tool. And then the fun part starts.

The Machine Built to Fail

Almost nothing works as expected. But somehow you progress with the help of hired (and usually overpriced) experienced consultants. Databases (tables, columns) are there, your nice graphically created ETL jobs are there, your first simple reports also, but hey! There is something missing! Why? Simply because you used a nasty complex SQL statement in your beautiful Cognos report. And you used another one when you were not satisfied with the performance of one Informatica PowerCenter job. And hey! Here, lineage is completely broken? Why is THAT? Hmmm, it seems that you decided to write some logic inside stored procedures and not to draw a terrifying ETL workflow, simply because it was so much easier with all those Oracle advanced features. Ok, I believe you have got it. Different kinds of SQL code (and not just SQL but also Java, C, Python and many others) are everywhere in your BI environment. Usually, there are millions and millions of lines of code everywhere. And unfortunately (at least for all metadata vendors) programming code is super tough to analyze and extract the necessary metadata from. But without it, there is no E2E data lineage.

At this moment, marketing leaflets hit the wall of reality. As of today, we have met a lot of enterprises but only very few solutions capable of automated metadata extraction from programming code. So what do most big vendors usually do in this situation (or big system integrators implementing their solutions)? Simply finish the rest of the work manually. Yes, you heard me! No automation anymore. Just good old manual labor. But you know what - it can be quite expensive. For example, a year ago we helped one of our customers reduce the time needed to "finish" their metadata project from four months to just one week! They were ready to invest the time of five smart guys, four months per person, to manually analyze hundreds and hundreds of BTEQ scripts, extract metadata from them and store them on the existing metadata tool. In the United States, we typically meet clients with several hundreds of thousands of database scripts and stored procedures. That's sooo many! Who is going to pay for that? The vendor? The system integrator? No, you know the answer. In most cases, the customer is the one who has to pay for it.

Know Your Limits

I have been traveling a lot the last few weeks and have met a lot of people, mostly investors and industry analysts, but also a few professionals. And I was amazed by how little they know about the real capabilities and limitations of existing solutions. Don't get me wrong, I think those big guys do a great job. You can't imagine how hard it is to provide a really easy-to-use metadata or data governance solution. There are so many different stakeholders, needs and requirements. I admire those big ones. But it should not mean that we close our eyes and pretend that those solutions have no limitations. They have limitations and fortunately, the big guys, or at least some of them, have finally realized, that it is much better to provide open API and to allow third parties like Manta to integrate and fill the existing gaps. I love the way IBM and Collibra have opened their platforms and I feel that others will soon follow.

How can you protect yourself as a customer? Simply conduct proper testing before you buy. Programming code in BI is ignored so often, maybe because it is very low-level and it is typically not the main topic of discussion among C-level guys. (But there are exceptions - just recently I met a wonderful CDO of a huge US bank who knows everything about the scripts and code they have inside their BI. It was so enlightening, after quite a long time.) It is also very hard to choose a reasonable subset of the data environment for testing. But you must do it properly if you want to be sure about the software you are going to buy. With proper testing, you will realize much sooner that there are some limitations and you will start looking for a solution to solve them up-front, not in the middle of your project, already behind schedule and with a C-level guy sitting on your neck.

It is time to admit that marketing leaflets lie in most cases (oh, sorry, they just add a little bit of color to the truth), and you must properly test every piece of software you want to buy. Your life will be much easier and success stories of nicely implemented metadata projects won't be so scarce.

Also published on MANTA's blog.

Sam Benedict

Award-winning technical software sales leader, IT leader, with deep experience in data governance, data quality, and data management practices.

7 年

It all starts with documentation, and most of the developers I have worked with do not like documentation, nor do they do it well. Tools can help with that - creating easier and more automated ways of deriving documentation, which can then dynamically depict lineage, impact analysis and have the ability to automatically scan metadata, and update the documentation (including the lineage, mappings and impacts automatically based on changes to the source and target). It is not completely flawless, but it is light years ahead of manual process and human intervention as a means for maintaining. $.02

2 次回应

Jason Williscroft

President & Founder at John Galt Services

7 年

Well... you aren't wrong. Plus the ideal of complete lineage transparency is a receding horizon: - Over time the automation becomes more capable of extracting lineage from code of a given complexity, but... - Over time the tooling supports the expression of increasingly complex code. At the end of the day we have to ask ourselves what is more important: to have a 100% accurate, real-time picture of the logical lineage of every data element? Or to have a sufficiently accurate, sufficiently real-time picture of the shape of the data flow? The former is probably impossible, at least in a steady state where new development occurs regularly. As my former platoon sergeant used to say: want in one hand and, uh, SPIT in the other and just see which one fills up first. The latter is completely doable. While that solution might not answer every question directly, it will always be able to tell you precisely where to go to find the answer, which is good enough far more often than it is not.

1 次回应

Nigel Higgs

Data Architecture Strategist & Mentor | Data Modelling | Metadata

7 年

Excellent article Tomas Kratky

1 次回应

查看更多评论

要查看或添加评论，请登录

Tomas Kratky的更多文章

100 mantas on board!

2021年9月29日

100 mantas on board!

This week, we are celebrating at MANTA, we are 100 people strong now! It is hard to express my feelings but it is a mix…

13 条评论
How to Avoid Paying the Price for Uncertainty in 2021

2020年12月31日

How to Avoid Paying the Price for Uncertainty in 2021

An explosion of data should not scare us. It is an explosion of data processing that might be our doom.
A Few Extra Words About Our $13M Series A1 Funding to Redefine Data Management with Automation

2020年10月8日

A Few Extra Words About Our $13M Series A1 Funding to Redefine Data Management with Automation

On October 6, 2020 the word got out that MANTA has closed a $13-million round led by Bessemer Venture Partners…

8 条评论
Mess in, Mess Out: How Low Quality Data Ruins Your Analytics

2019年9月16日

Mess in, Mess Out: How Low Quality Data Ruins Your Analytics

A few days ago, I found an interesting article published by Moshe Kranc, CTO at Ness Digital Engineering, on Aug 1…

1 条评论
Is Guessing Enough for Your GDPR Project?

2018年6月28日

Is Guessing Enough for Your GDPR Project?

I will tell you one thing — I am tired of GDPR buzz. Don’t get me wrong, I value privacy and data protection very much,…

1 条评论
The Year of MANTA and Why We’ve Published Our Pricing Online

2017年12月31日

The Year of MANTA and Why We’ve Published Our Pricing Online

We’ve seen a massive surge in the world of data lineage over the last year. More buzz, more leads, more customers for…

3 条评论
A Metadata Map Story: How We Got Lost When Looking for a Meeting Room

2017年9月1日

A Metadata Map Story: How We Got Lost When Looking for a Meeting Room

You may think that I have gone crazy after reading the title above or hope that our blog is finally becoming a much…
Return of the Metadata Bubble

2017年7月27日

Return of the Metadata Bubble

The bubble around metadata in BI is back - with all it's previous sins and even more just around the corner. [LONG…

12 条评论
ONE SMALL STEP FOR MANTA, ONE BIG LEAP FOR MANKIND

2017年6月30日

ONE SMALL STEP FOR MANTA, ONE BIG LEAP FOR MANKIND

We just recently published a blog post announcing one new feature – MANTA now works not only with physical lineage but…
Metadata as Explained to My Grandma

2016年9月25日

Metadata as Explained to My Grandma

A few weeks ago, I talked to my grandma. I travel a lot and she lives quite far away, so we hadn’t seen each other for…

5 条评论

See all articles

The Dark Side of the Metadata & Data Lineage World

Tomas Kratky

Founder of Manta, an IBM company

The Reality of Metadata Solutions

The Automated Baby Beast

The Machine Built to Fail

Know Your Limits

Tomas Kratky的更多文章

社区洞察

其他会员也浏览了

The Evolution of the Sausage Machine: Supporting an Effective Data Strategy

Does Data Mesh Guarantee the Quality of Your Data?

Enabling Business Users in Data Quality

Mastering Data Variety at Enterprise Scale

Coalesce and Our Award-Winning Data Consultancy: A Partnership for the Future

What Is The Modern Data Stack? What’s The Difference Between A Traditional MDS and A Fully-managed MDS?

The Shift to Decentralized Data Ownership

Upgrade your Business Intelligence to Real-Time x Digazu

Revolutionize Your Data Management with Automated Data Lakehouse: The Future of Efficient and Secure Analytics!

The Chicken or the Egg Dilemma: Data Foundation vs. Data Products in Start-ups

The Reality of Metadata Solutions

The Automated Baby Beast

The Machine Built to Fail

Know Your Limits

Tomas Kratky的更多文章

100 mantas on board!

How to Avoid Paying the Price for Uncertainty in 2021

A Few Extra Words About Our $13M Series A1 Funding to Redefine Data Management with Automation

Mess in, Mess Out: How Low Quality Data Ruins Your Analytics

Is Guessing Enough for Your GDPR Project?

The Year of MANTA and Why We’ve Published Our Pricing Online

A Metadata Map Story: How We Got Lost When Looking for a Meeting Room

Return of the Metadata Bubble

ONE SMALL STEP FOR MANTA, ONE BIG LEAP FOR MANKIND

Metadata as Explained to My Grandma

社区洞察

其他会员也浏览了

The Evolution of the Sausage Machine: Supporting an Effective Data Strategy

Does Data Mesh Guarantee the Quality of Your Data?

Enabling Business Users in Data Quality

Mastering Data Variety at Enterprise Scale

Coalesce and Our Award-Winning Data Consultancy: A Partnership for the Future

What Is The Modern Data Stack? What’s The Difference Between A Traditional MDS and A Fully-managed MDS?

The Shift to Decentralized Data Ownership

Upgrade your Business Intelligence to Real-Time x Digazu

Revolutionize Your Data Management with Automated Data Lakehouse: The Future of Efficient and Secure Analytics!

The Chicken or the Egg Dilemma: Data Foundation vs. Data Products in Start-ups