Special edition: 19 gotchas to look out for when evaluating data lineage
Prukalpa ?
Co-Founder at Atlan –?Home for Data Teams | Forbes30 & Fortune40 lists | TED Speaker
Lineage, lineage, lineage... ??
The holy grail is the end-to-end lineage that connects your source systems (databases, SaaS tools, etc.) and maps data flows to your final usage layer, BI tools. Achieve this and you’ll finally know how data flows across your ecosystem —?one of the most impactful tools a data team can have for problems like impact analysis, data observability, root cause analysis, and cost optimization.
Yet, in my opinion, creating data lineage is one of the most complex implementation problems in the data stack.
Why? Because data lineage is an edge-case problem.
Fivetran wrote an extremely nuanced blog in 2021 titled?“How we built the most reliable data pipeline ever”.
“Here’s the central lesson we’ve learned: You can’t build a data replication solution once and expect it to work reliably forever, because source systems are too complex. APIs break or work unexpectedly, and there are so many edge cases that only time — and a lot of customers — can help you find and address them. On top of that, source systems continuously evolve and challenge us to adjust to those changes to make sure the replication works quickly and reliably.”
Data lineage suffers from the same challenges as data replication — complex source systems, APIs that break or work unexpectedly, warehouses with slight variations in SQL logic, and BI tools that model data assets differently (think LookML). Not to mention the edge cases in how every team writes code or models their data pipelines.
We’ve found that for lineage, the devil is in the details. It’s not enough to get a “lineage tool”. Do you need…
On this holiday week, we’re devoting the issue to the thorny, complex challenge of lineage. Keep reading for our brand new lineage ebook, lots of links, and a lineage-driven “Metadata in Action” video.
??Spotlight: 19 questions and gotchas to look for when evaluating lineage
For the past few months, Mark Pavletich and Swami Kumar from our team teamed up to review data from all our data with hundreds of data teams. They’ve identified the 19 questions and gotchas that anybody evaluating data lineage should know, bundled in what is IMHO the most comprehensive guide to evaluating data lineage.
Keep reading for a snippet of 7 of those questions and a link to the full ebook. ??
1. Which types of SQL statements are supported?
Most lineage tools include automated SQL parsing, which ensures that your lineage graph includes data from systems without a lineage API.
Most SQL parsers support SQL?CREATE?and, in some cases,?MERGE?statements. However, many don’t support?INSERT INTO?and?UPDATE?statements. These account for most transformations in data warehouses, so they are important for full lineage coverage.
Look for lineage tools that can also parse?MERGE,?INSERT INTO, and?UPDATE?statements.
2. Does it offer lineage down to the column level?
Table-level lineage is considered “table stakes”, but column-level lineage should be too. It’s crucial for a range of use cases:
Data engineers and analysts may miss key depth during their investigations without the ability to dive into granular columns or field lineage.
Look for a native column-level experience in the UI, including viewing graph linkages at the column level.
3. Does it support field-level lineage for BI dashboards?
Anyone doing root cause analysis needs to dive into an incorrect field (i.e. dimension, measure, calculated field, etc.) in the dashboard, and work backward to zero in on the upstream fields or columns that are broken. This is only possible with field-level lineage for the BI tool.
Field-level lineage is also important for impact analysis. If a data engineer is trying to make a schema change, they need to understand the specific downstream columns and fields that will be affected —?not just which dashboards will be affected in some unspecified way.
领英推荐
Some platforms support lineage for a few fields but don’t go deep with BI fields that are crucial for this type of analysis.
Look for two key features:
4. Does it incorporate other types of metadata to give additional context for assets in the lineage graph?
In isolation, lineage only tells part of the story and, therefore, only provides part of the value. Lineage becomes actionable when it’s combined with key metadata and context:
Tools often usually provide lineage graphs as a siloed view. Without the other metadata for these assets, it can be hard to put lineage in context.
Look for three key features:
5. Can it be used not just to investigate issues, but also to drive action programmatically?
In addition to enabling data people’s work, lineage can also enable automated system actions and workflows.
For example, if an upstream table has data quality issues, it’s important to automatically add announcements to downstream BI dashboards. This keeps business users from creating “Garbage In, Garbage Out” analysis, and saves data analysts and engineers from manually sending alerts or warnings.
Some platforms don’t have the underlying architecture and scalability to perform automated actions based on lineage.
Look for open APIs, the ability to build or customize automated workflows, and the ability to read metadata-change events and trigger changes in linked assets across the lineage graph.
?? Metadata in action: Using data lineage to drive root cause analysis
ICYMI: Last week, we introduced a new section to Metadata Weekly —?metadata in action, where we highlight real use cases of active metadata.
As anyone who works with data knows, answering the question “That number doesn’t look right” ) is far from easy. While I guess you could do root cause analysis without lineage, you don’t want to! Lineage lets you zoom into all the key data, context, and changes across a diverse set of tools and systems. One company reduced their six-hour RCA process to just 10 minutes with Atlan’s lineage. ??
For those who missed this video from last week, learn how great data lineage is the key to faster, easier root cause analysis. Stay tuned next week for brand new Metadata in Action video!
???More from my reading list
Learn more about data lineage with our favorite recent lineage links:
Wishing all of you a week full of happiness and all the pie you can eat ??
P.S. Liked reading this edition of the newsletter? Check out the archive here.