登录查看更多内容

Special edition: 19 gotchas to look out for when evaluating data lineage

Prukalpa ?

Co-Founder at Atlan –?Home for Data Teams | Forbes30 & Fortune40 lists | TED Speaker

发布日期: 2022年11月26日

Lineage, lineage, lineage... ??

The holy grail is the end-to-end lineage that connects your source systems (databases, SaaS tools, etc.) and maps data flows to your final usage layer, BI tools. Achieve this and you’ll finally know how data flows across your ecosystem —?one of the most impactful tools a data team can have for problems like impact analysis, data observability, root cause analysis, and cost optimization.

Yet, in my opinion, creating data lineage is one of the most complex implementation problems in the data stack.

Why? Because data lineage is an edge-case problem.

Fivetran wrote an extremely nuanced blog in 2021 titled?“How we built the most reliable data pipeline ever”.

“Here’s the central lesson we’ve learned: You can’t build a data replication solution once and expect it to work reliably forever, because source systems are too complex. APIs break or work unexpectedly, and there are so many edge cases that only time — and a lot of customers — can help you find and address them. On top of that, source systems continuously evolve and challenge us to adjust to those changes to make sure the replication works quickly and reliably.”

Data lineage suffers from the same challenges as data replication — complex source systems, APIs that break or work unexpectedly, warehouses with slight variations in SQL logic, and BI tools that model data assets differently (think LookML). Not to mention the edge cases in how every team writes code or models their data pipelines.

We’ve found that for lineage, the devil is in the details. It’s not enough to get a “lineage tool”. Do you need…

column-level lineage or table-level lineage?
cross-system lineage or lineage that’s isolated in the data warehouse?
SQL parsing for generating lineage?
parser support for?MERGE,?INSERT INTO, and?UPDATE?statements, in addition to the usual?CREATE?statements?

On this holiday week, we’re devoting the issue to the thorny, complex challenge of lineage. Keep reading for our brand new lineage ebook, lots of links, and a lineage-driven “Metadata in Action” video.

??Spotlight: 19 questions and gotchas to look for when evaluating lineage

For the past few months, Mark Pavletich and Swami Kumar from our team teamed up to review data from all our data with hundreds of data teams. They’ve identified the 19 questions and gotchas that anybody evaluating data lineage should know, bundled in what is IMHO the most comprehensive guide to evaluating data lineage.

Keep reading for a snippet of 7 of those questions and a link to the full ebook. ??

1. Which types of SQL statements are supported?

Most lineage tools include automated SQL parsing, which ensures that your lineage graph includes data from systems without a lineage API.

Most SQL parsers support SQL?CREATE?and, in some cases,?MERGE?statements. However, many don’t support?INSERT INTO?and?UPDATE?statements. These account for most transformations in data warehouses, so they are important for full lineage coverage.

Look for lineage tools that can also parse?MERGE,?INSERT INTO, and?UPDATE?statements.

2. Does it offer lineage down to the column level?

Table-level lineage is considered “table stakes”, but column-level lineage should be too. It’s crucial for a range of use cases:

Tracing sensitive data classifications for transformed PII data
Impact analysis from things like schema changes
Root cause analysis —?e.g. investigating why a dashboard looks off by tracing a BI field to upstream columns in the data warehouse

Data engineers and analysts may miss key depth during their investigations without the ability to dive into granular columns or field lineage.

Look for a native column-level experience in the UI, including viewing graph linkages at the column level.

3. Does it support field-level lineage for BI dashboards?

Anyone doing root cause analysis needs to dive into an incorrect field (i.e. dimension, measure, calculated field, etc.) in the dashboard, and work backward to zero in on the upstream fields or columns that are broken. This is only possible with field-level lineage for the BI tool.

Field-level lineage is also important for impact analysis. If a data engineer is trying to make a schema change, they need to understand the specific downstream columns and fields that will be affected —?not just which dashboards will be affected in some unspecified way.

领英推荐

Snowflake Data Marketplace

Lyftrondata 2 个月前

Snowflake Data Marketplace

Lyftrondata 4 个月前

MDS Newsletter #31

Aayush Jain 2 年前

Some platforms support lineage for a few fields but don’t go deep with BI fields that are crucial for this type of analysis.

Look for two key features:

Coverage of both?column-level lineage for SQL sources and BI field-level lineage.
Which BI objects are supported and exposed?in the lineage?for your BI tool. (E.g. in Looker, will lineage cover all the fields/objects you care about, such as Dashboards, Looks, Explores, Tiles, Fields, and Views?)

4. Does it incorporate other types of metadata to give additional context for assets in the lineage graph?

In isolation, lineage only tells part of the story and, therefore, only provides part of the value. Lineage becomes actionable when it’s combined with key metadata and context:

Operational metadata: How and when were assets orchestrated?
Quality and anomaly metadata: What state are the assets in? Are they reliable?
Business/semantic metadata: How do the assets link to key business terms or KPIs?
Owner and expert metadata: Who should you contact or collaborate with during troubleshooting?
Social metadata: What is the human context for this asset —?e.g. relevant?Slack?discussions or?Jira?tickets about the asset? This is what machines alone will miss.

Tools often usually provide lineage graphs as a siloed view. Without the other metadata for these assets, it can be hard to put lineage in context.

Look for three key features:

Openness: An “open by design”, an extensible platform where you can harvest data and metadata from any source via APIs (including custom-built connectors).
Flexibility: Support for a wide range of technical, operational, anomaly/quality, and business/semantic metadata from these sources.
Personalization: A personalized data experience, where each persona sees the metadata that is right for them, rather than drowning in all the metadata.

5. Can it be used not just to investigate issues, but also to drive action programmatically?

In addition to enabling data people’s work, lineage can also enable automated system actions and workflows.

For example, if an upstream table has data quality issues, it’s important to automatically add announcements to downstream BI dashboards. This keeps business users from creating “Garbage In, Garbage Out” analysis, and saves data analysts and engineers from manually sending alerts or warnings.

Some platforms don’t have the underlying architecture and scalability to perform automated actions based on lineage.

Look for open APIs, the ability to build or customize automated workflows, and the ability to read metadata-change events and trigger changes in linked assets across the lineage graph.

Read the full ebook with all 19 questions and lots more detail???

?? Metadata in action: Using data lineage to drive root cause analysis

ICYMI: Last week, we introduced a new section to Metadata Weekly —?metadata in action, where we highlight real use cases of active metadata.

As anyone who works with data knows, answering the question “That number doesn’t look right” ) is far from easy. While I guess you could do root cause analysis without lineage, you don’t want to! Lineage lets you zoom into all the key data, context, and changes across a diverse set of tools and systems. One company reduced their six-hour RCA process to just 10 minutes with Atlan’s lineage. ??

For those who missed this video from last week, learn how great data lineage is the key to faster, easier root cause analysis. Stay tuned next week for brand new Metadata in Action video!

???More from my reading list

Learn more about data lineage with our favorite recent lineage links:

The many layers of data lineage?by Borja Vazquez
Untapped potential of data lineage?by Petr Janda
Building and scaling data lineage at Netflix to improve data infrastructure reliability, and efficiency?by Di Lin,?Girish Lingappa,?and Jitender Aswani
Data lineage, the lost child of data science?by Bernard Willer
Data lineage: State-of-the-art and implementation challenges?by Dion Ricky

Wishing all of you a week full of happiness and all the pie you can eat ??

P.S. Liked reading this edition of the newsletter? Check out the archive here.

Metadata Weekly

9,852 位关注者

要查看或添加评论，请登录

Prukalpa ?的更多文章

How to craft the ultimate business case for data governance - Part 2

2024年11月1日

How to craft the ultimate business case for data governance - Part 2

As a data leader, you’ve probably faced the challenge of keeping stakeholders on board with a data governance project…

5 条评论
How to craft the ultimate business case for data governance - Part 1

2024年9月12日

How to craft the ultimate business case for data governance - Part 1

Selling data governance can feel like an uphill battle. It’s a big investment that often gets turned down because the…

25 条评论
How companies are making Forrester’s idea of modern data cataloging a reality

2024年8月30日

How companies are making Forrester’s idea of modern data cataloging a reality

The unified control plane in action Last week, I explored a major shift in the data world — a transformation that…

2 条评论
What the recent Forrester Wave means for data catalogs

2024年8月14日

What the recent Forrester Wave means for data catalogs

A massive transformation — data cataloging now includes governance, quality, security, monitoring, and more Quick…

4 条评论
The War of the Catalogs

2024年8月2日

The War of the Catalogs

Databricks Unity Catalog, Snowflake Polaris, and the future of cataloging Apparently this summer is the “War of the…

13 条评论
3-step framework for scaling data quality in the age of generative AI

2024年7月18日

3-step framework for scaling data quality in the age of generative AI

Apply what we've learned from healthcare to data quality I’ve found that data quality isn’t really about cleanliness or…

4 条评论
4 practical lessons from data governance leaders at Dropbox, General Motors, and Patagonia

2024年5月30日

4 practical lessons from data governance leaders at Dropbox, General Motors, and Patagonia

I think anyone working in data today would agree that governance is tough. I talked recently about why it fails and my…

4 条评论
Why data governance fails in today’s AI world

2024年5月13日

Why data governance fails in today’s AI world

Welcome back to this cozy corner of the internet where I share my (meta ??) thoughts on everything metadata. You may…

3 条评论
A Shared Language for Enterprise Data ?

2023年8月4日

A Shared Language for Enterprise Data ?

It’s 1993 and you’ve just graduated from college. You’re going job fair to job fair, looking through alumni…

1 条评论
Modernizing Data Stack ?

2023年6月29日

Modernizing Data Stack ?

Austin Capital Bank, a fast-growing community bank, sought to modernize its data stack to support its evolution into a…

See all articles

Special edition: 19 gotchas to look out for when evaluating data lineage

Prukalpa ?

Co-Founder at Atlan –?Home for Data Teams | Forbes30 & Fortune40 lists | TED Speaker

??Spotlight: 19 questions and gotchas to look for when evaluating lineage

1. Which types of SQL statements are supported?

2. Does it offer lineage down to the column level?

3. Does it support field-level lineage for BI dashboards?

领英推荐

4. Does it incorporate other types of metadata to give additional context for assets in the lineage graph?

5. Can it be used not just to investigate issues, but also to drive action programmatically?

?? Metadata in action: Using data lineage to drive root cause analysis

???More from my reading list

Metadata Weekly

9,852 位关注者

Prukalpa ?的更多文章

社区洞察

其他会员也浏览了

Data Integration from Fabric Lakehouse to Snowflake Database using Data Pipeline

Data Warehousing with Star and Snowflake schemas

Role Hierarchies in Snowflake: Effective Data Access Management

Reduce spend on data tools.

Snowflake Supports SELECT FROM Stored Procedures

Data Management News for the Week of March 3; Updates from Dremio, Fivetran, Qlik & More

How to create snowflake dynamic tables in coalesce.io ?

Dynamic Tables in Snowflake: Revolutionizing Continuous Data Pipelines

Apache Hudi 0.14 Announces hudi_table_changes: Making CDC and Incremental Queries Easier

Snowflake Data Sharing Architecture Part 2 of 4 - Data Sharing Workflow

??Spotlight: 19 questions and gotchas to look for when evaluating lineage

1. Which types of SQL statements are supported?

2. Does it offer lineage down to the column level?

3. Does it support field-level lineage for BI dashboards?

领英推荐

4. Does it incorporate other types of metadata to give additional context for assets in the lineage graph?

5. Can it be used not just to investigate issues, but also to drive action programmatically?

?? Metadata in action: Using data lineage to drive root cause analysis

???More from my reading list

Metadata Weekly

9,852 位关注者

Prukalpa ?的更多文章

How to craft the ultimate business case for data governance - Part 2

How to craft the ultimate business case for data governance - Part 1

How companies are making Forrester’s idea of modern data cataloging a reality

What the recent Forrester Wave means for data catalogs

The War of the Catalogs

3-step framework for scaling data quality in the age of generative AI

4 practical lessons from data governance leaders at Dropbox, General Motors, and Patagonia

Why data governance fails in today’s AI world

A Shared Language for Enterprise Data ?

Modernizing Data Stack ?

社区洞察

其他会员也浏览了

Data Integration from Fabric Lakehouse to Snowflake Database using Data Pipeline

Data Warehousing with Star and Snowflake schemas

Role Hierarchies in Snowflake: Effective Data Access Management

Reduce spend on data tools.

Snowflake Supports SELECT FROM Stored Procedures

Data Management News for the Week of March 3; Updates from Dremio, Fivetran, Qlik & More

How to create snowflake dynamic tables in coalesce.io ?

Dynamic Tables in Snowflake: Revolutionizing Continuous Data Pipelines

Apache Hudi 0.14 Announces hudi_table_changes: Making CDC and Incremental Queries Easier

Snowflake Data Sharing Architecture Part 2 of 4 - Data Sharing Workflow