登录查看更多内容

Tips for building an advanced data platform #data #building : #2/10

Dr. RVS Praveen Ph.D

Director - Product Engineering at LTIMindtree

发布日期: 2023年2月12日

Tip #2: Use asset-based approach to data instead of pipelines

In comparison with tip #1(https://www.dhirubhai.net/pulse/tips-building-advanced-data-platform-110-dr-rvs-praveen-ph-d), this, in my opinion, is the most impactful thing you can do for yourself as a data platform engineer or manager.

Traditionally, when data engineers create data pipelines, they write instructions (code) that declare what operations to execute. Imagine a daily data pipeline in Airflow or cron-scheduled shell script that gets data from source A, transforms data into file B, uploads file data to table C, aggregates it into a table D and runs a refresh via API on server E. We have a clearly defined process consisting of 5 operations and expected output of this process.

What if this pipeline fails on step D? We need to fix the process and re-run it from the start.
What if one pipeline depends on an output from the other? Now we need to create a mechanism that connects those two pipelines. And if one of data outputs in the second pipeline doesn’t actually depend on the result of the first pipeline, then we have an unnecessary blocker when the first pipeline fails.

As many data engineers know, dealing with such pipelines is a cumbersome process that costs a lot of time, effort and makes data engineering unnecessarily hard.

What if instead we put our focus not on what operations to execute, but what we want to exist? What if we embrace the declarative approach to data and start declaring?data assets?instead of pipelines?

领英推荐

Mastering Data Variety at Enterprise Scale

Andy Palmer 11 个月前

File Formats vs Table Formats - What’s The Difference?

Andrew Madson MSc, MBA 1 个月前

What Dagster Believes About Data Platforms

Dagster Labs 4 个月前

Much like Terraform’s approach to infrastructure resources, we can just define?what?data asset (table in a data warehouse, file, ML model etc) we want to create,?how?to create it and what?dependencies?on other data assets it has. This gives each data asset its own entity so we can work with them independently. When we want to materialize several data assets at once, we just need to specify which ones to materialize without the need to worry about their relationships. Since dependencies are already defined, each data asset knows from what other data assets to pull the data for processing, or to what data assets push processed data.

Moreover, declared dependencies enable?automatic data lineage?and?operational data catalog?without any extra effort. Operational data catalog is the one that the data platform team uses to work with data assets and their lifecycle.

dbt?is one such popular tool that enabled asset-based approach to the data warehouse. Instead of creating data pipelines, you just write a SELECT SQL query for the table you want to create, give this table (model) a name, create optional YAML configuration and specify what other tables it depends on. Then just schedule your jobs and let dbt do the rest. Simple to create, maintain, and you get the data lineage out of the box.

But I believe the real breakthrough in this field is?Dagster?with its?Software-Defined Assets?approach. The general idea is similar to the one of dbt, but instead can be applied to?any data asset. Whether it’s a table in a database, a data extract from an API, a file in a cloud storage bucket or a machine learning model — you just need to define it in Python code and specify dependencies. Create a job with a selection of assets you want to materialize and add a schedule or a sensor. Dagster will automatically materialize selected assets while preserving order of dependencies. Moreover, you can import data assets from other tools, including dbt!

Dagster and dbt are probably two most powerful tools that can be implemented in your Data Platform ("Euclid" as referred in my previous tip #1). They can replace complex and fragile Airflow pipelines, custom shell scripts and a couple of detached data tools. You can also get clear observability over your data assets thanks to the automated operational data catalog, your daily data operations will get much easier, and you can visibly increase the reliability of your new data platform.

I really recommend spending some time studying Dagster and understanding asset-based approach to data to see how it can improve your current data workflows. It was such a huge positive upgrade for our data engineers & managers so that it’s hard to imagine going back to previous ways of doing data engineering.

要查看或添加评论，请登录

Dr. RVS Praveen Ph.D的更多文章

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

2024年8月4日

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

"Craft Compelling Data Stories: A 7-Step Recipe for Effective Insight and Information Dissemination" If data can be…
The Future of Data Integration: Moving Beyond Traditional ETL

2024年7月8日

The Future of Data Integration: Moving Beyond Traditional ETL

Forward-looking technologies are typically innovative and embraced by early adopters, providing some business value…

1 条评论
Fostering Analytical Maturity in Organizations (AMO)

2024年4月18日

Fostering Analytical Maturity in Organizations (AMO)

Several straightforward frameworks to identify your organization's analytical requirements and enhance its data-driven…

1 条评论
Revealing Contemporary Data Frameworks: From Warehouses to Meshes

2024年4月15日

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Traversing the Data Revolution Embark on an expedition through the transformative terrain of modern data architecture…

1 条评论
Part 5: Navigating Generative AI in Retail & Commercial Banking

2024年4月2日

Part 5: Navigating Generative AI in Retail & Commercial Banking

In the contemporary digital landscape, retail and commercial banking encounter a plethora of hurdles, ranging from the…
Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

2024年3月27日

Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

The banking sector has long been a pioneer in embracing technology to optimize operational efficiency, minimize costs…
Part 3: Exploring Generative AI Applications in Banking

2024年3月21日

Part 3: Exploring Generative AI Applications in Banking

wIn this segment of the series, we navigate the realm of Generative AI models, uncovering their intricacies and the…
Part 2: Exploring Generative AI in Banking: An Overview

2024年3月4日

Part 2: Exploring Generative AI in Banking: An Overview

In the rapidly evolving realm of technology, the advent of artificial intelligence (AI) has proven transformative…
Part 1: Introduction to Generative AI Playbook for Banking

2024年3月3日

Part 1: Introduction to Generative AI Playbook for Banking

In the swiftly changing terrain of financial services, the emergence of Generative Artificial Intelligence (AI)…

3 条评论
Constructing a Data Platform in 2024

2024年2月18日

Constructing a Data Platform in 2024

A guide to developing a contemporary, adaptable data platform to drive your analytics and data science initiatives…

3 条评论

See all articles

Tips for building an advanced data platform #data #building : #2/10

Dr. RVS Praveen Ph.D

Director - Product Engineering at LTIMindtree

领英推荐

Dr. RVS Praveen Ph.D的更多文章

社区洞察

其他会员也浏览了

Warping through Data pipelines

MetaFlex DataHub

How to Build Your Data Platform like a Product

Note 1: Architecting Data Solutions: A High-Level Overview

Data Management: Knit a Fabric or Mesh Around?

Delta Live Tables — Part 5— Exploring Advanced Features and Optimization Techniques in Delta Live Tables

Data Vault is no longer required in our Archipelago on the high Seas of Data

Episode #129: How to scale self-serve analytics tools to thousands of users at Datadog with Jean-Mathieu Saponaro

MDS Newsletter #31

How data catalogs should look, Data lineage from a business perspective and data catalogs in the mesh and beyond

领英推荐

Dr. RVS Praveen Ph.D的更多文章

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

The Future of Data Integration: Moving Beyond Traditional ETL

Fostering Analytical Maturity in Organizations (AMO)

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Part 5: Navigating Generative AI in Retail & Commercial Banking

Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

Part 3: Exploring Generative AI Applications in Banking

Part 2: Exploring Generative AI in Banking: An Overview

Part 1: Introduction to Generative AI Playbook for Banking

Constructing a Data Platform in 2024

社区洞察

其他会员也浏览了

Warping through Data pipelines

MetaFlex DataHub

How to Build Your Data Platform like a Product

Note 1: Architecting Data Solutions: A High-Level Overview

Data Management: Knit a Fabric or Mesh Around?

Delta Live Tables — Part 5— Exploring Advanced Features and Optimization Techniques in Delta Live Tables

Data Vault is no longer required in our Archipelago on the high Seas of Data

Episode #129: How to scale self-serve analytics tools to thousands of users at Datadog with Jean-Mathieu Saponaro

MDS Newsletter #31

How data catalogs should look, Data lineage from a business perspective and data catalogs in the mesh and beyond