The Dark Side of the Metadata & Data Lineage World

The Dark Side of the Metadata & Data Lineage World

You wouldn't believe it, but there is a dark side to the metadata & data lineage world as well. I would like to dig deep and explain how you can get into trouble. 

It has been a wonderful spring this year, hasn't it? The first months of 2017 were hot for us. Data governance, metadata, and data lineage are everywhere. Everyone is talking about them, everyone is looking for a solution. It's an amazing time. But there is also the other side, the dark side.

The Reality of Metadata Solutions

As we meet more and more large companies, industry experts & analysts, investors and just data professionals, we see a huge gap between their perception of reality and reality itself. What am I talking about? About the end-to-end data lineage ghost. With data being used to make decisions every single day, with regulators like FINRA, Fed, SEC, FCC, and ECB requesting reports, with initiatives like BCBS 239 or GDPR (a new European Data Protection Directive), proper governance and a detailed understanding of the data environment is a must for every enterprise. And E2E (end-to-end) data lineage has become a great symbol of this need. Every metadata/data governance player on the market is talking about it and their marketing is full of wonderful promises (in the end, that is the main purpose of every marketing leaflet, isn't it?). But what's the reality?

The Automated Baby Beast

The truth is, that E2E data lineage is a very tough beast to tame. Just imagine how many systems and data sources you have in your organization, how much data processing logic, how many ETL jobs, how many stored procedures, how many lines of programming code, how many reports, how many ad-hoc excel sheets, etc. It is huge. Overwhelming!

If your goal is to track every single piece of data and to record every single processing step, every "hop" of data flow through your organization, you have a lot of work to do. And even if you split your big task into smaller ones and start with selected data sets (so-called "critical data elements") one by one, it can still be so exhausting that you will never finish or even really start. And now data governance players have come in with gorgeous promises packaged in one single word - AUTOMATION.

The promise itself is quite simple to explain - their solutions will analyze all data sources and systems, every single piece of logic, extract metadata from them (so-called metadata harvesting), link them up (so-called metadata stitching), store them and make them accessible to analysts, architects, and other users through best-in-class, award-winning user interfaces. And all of this through automation. No manual work necessary, or just a little bit. Is it so tempting that you are open to it, you want to believe. And so you buy the tool. And then the fun part starts.

The Machine Built to Fail

Almost nothing works as expected. But somehow you progress with the help of hired (and usually overpriced) experienced consultants. Databases (tables, columns) are there, your nice graphically created ETL jobs are there, your first simple reports also, but hey! There is something missing! Why? Simply because you used a nasty complex SQL statement in your beautiful Cognos report. And you used another one when you were not satisfied with the performance of one Informatica PowerCenter job. And hey! Here, lineage is completely broken? Why is THAT? Hmmm, it seems that you decided to write some logic inside stored procedures and not to draw a terrifying ETL workflow, simply because it was so much easier with all those Oracle advanced features. Ok, I believe you have got it. Different kinds of SQL code (and not just SQL but also Java, C, Python and many others) are everywhere in your BI environment. Usually, there are millions and millions of lines of code everywhere. And unfortunately (at least for all metadata vendors) programming code is super tough to analyze and extract the necessary metadata from. But without it, there is no E2E data lineage.

At this moment, marketing leaflets hit the wall of reality. As of today, we have met a lot of enterprises but only very few solutions capable of automated metadata extraction from programming code. So what do most big vendors usually do in this situation (or big system integrators implementing their solutions)? Simply finish the rest of the work manually. Yes, you heard me! No automation anymore. Just good old manual labor. But you know what - it can be quite expensive. For example, a year ago we helped one of our customers reduce the time needed to "finish" their metadata project from four months to just one week! They were ready to invest the time of five smart guys, four months per person, to manually analyze hundreds and hundreds of BTEQ scripts, extract metadata from them and store them on the existing metadata tool. In the United States, we typically meet clients with several hundreds of thousands of database scripts and stored procedures. That's sooo many! Who is going to pay for that? The vendor? The system integrator? No, you know the answer. In most cases, the customer is the one who has to pay for it.

Know Your Limits

I have been traveling a lot the last few weeks and have met a lot of people, mostly investors and industry analysts, but also a few professionals. And I was amazed by how little they know about the real capabilities and limitations of existing solutions. Don't get me wrong, I think those big guys do a great job. You can't imagine how hard it is to provide a really easy-to-use metadata or data governance solution. There are so many different stakeholders, needs and requirements. I admire those big ones. But it should not mean that we close our eyes and pretend that those solutions have no limitations. They have limitations and fortunately, the big guys, or at least some of them, have finally realized, that it is much better to provide open API and to allow third parties like Manta to integrate and fill the existing gaps. I love the way IBM and Collibra have opened their platforms and I feel that others will soon follow.

How can you protect yourself as a customer? Simply conduct proper testing before you buy. Programming code in BI is ignored so often, maybe because it is very low-level and it is typically not the main topic of discussion among C-level guys. (But there are exceptions - just recently I met a wonderful CDO of a huge US bank who knows everything about the scripts and code they have inside their BI. It was so enlightening, after quite a long time.) It is also very hard to choose a reasonable subset of the data environment for testing. But you must do it properly if you want to be sure about the software you are going to buy. With proper testing, you will realize much sooner that there are some limitations and you will start looking for a solution to solve them up-front, not in the middle of your project, already behind schedule and with a C-level guy sitting on your neck.

It is time to admit that marketing leaflets lie in most cases (oh, sorry, they just add a little bit of color to the truth), and you must properly test every piece of software you want to buy. Your life will be much easier and success stories of nicely implemented metadata projects won't be so scarce.

Also published on MANTA's blog.


Sam Benedict

Award-winning technical software sales leader, IT leader, with deep experience in data governance, data quality, and data management practices.

7 年

It all starts with documentation, and most of the developers I have worked with do not like documentation, nor do they do it well. Tools can help with that - creating easier and more automated ways of deriving documentation, which can then dynamically depict lineage, impact analysis and have the ability to automatically scan metadata, and update the documentation (including the lineage, mappings and impacts automatically based on changes to the source and target). It is not completely flawless, but it is light years ahead of manual process and human intervention as a means for maintaining. $.02

Jason Williscroft

President & Founder at John Galt Services

7 年

Well... you aren't wrong. Plus the ideal of complete lineage transparency is a receding horizon: - Over time the automation becomes more capable of extracting lineage from code of a given complexity, but... - Over time the tooling supports the expression of increasingly complex code. At the end of the day we have to ask ourselves what is more important: to have a 100% accurate, real-time picture of the logical lineage of every data element? Or to have a sufficiently accurate, sufficiently real-time picture of the shape of the data flow? The former is probably impossible, at least in a steady state where new development occurs regularly. As my former platoon sergeant used to say: want in one hand and, uh, SPIT in the other and just see which one fills up first. The latter is completely doable. While that solution might not answer every question directly, it will always be able to tell you precisely where to go to find the answer, which is good enough far more often than it is not.

Nigel Higgs

Data Architecture Strategist & Mentor | Data Modelling | Metadata

7 年

Excellent article Tomas Kratky

要查看或添加评论,请登录

Tomas Kratky的更多文章

社区洞察

其他会员也浏览了