Tips for building an advanced data platform #data #building : #2/10

Tips for building an advanced data platform #data #building : #2/10

Tip #2: Use asset-based approach to data instead of pipelines

In comparison with tip #1(https://www.dhirubhai.net/pulse/tips-building-advanced-data-platform-110-dr-rvs-praveen-ph-d), this, in my opinion, is the most impactful thing you can do for yourself as a data platform engineer or manager.

Traditionally, when data engineers create data pipelines, they write instructions (code) that declare what operations to execute. Imagine a daily data pipeline in Airflow or cron-scheduled shell script that gets data from source A, transforms data into file B, uploads file data to table C, aggregates it into a table D and runs a refresh via API on server E. We have a clearly defined process consisting of 5 operations and expected output of this process.

No alt text provided for this image

  • What if this pipeline fails on step D? We need to fix the process and re-run it from the start.
  • What if one pipeline depends on an output from the other? Now we need to create a mechanism that connects those two pipelines. And if one of data outputs in the second pipeline doesn’t actually depend on the result of the first pipeline, then we have an unnecessary blocker when the first pipeline fails.

As many data engineers know, dealing with such pipelines is a cumbersome process that costs a lot of time, effort and makes data engineering unnecessarily hard.

What if instead we put our focus not on what operations to execute, but what we want to exist? What if we embrace the declarative approach to data and start declaring?data assets?instead of pipelines?

No alt text provided for this image
Defining assets & dependencies

Much like Terraform’s approach to infrastructure resources, we can just define?what?data asset (table in a data warehouse, file, ML model etc) we want to create,?how?to create it and what?dependencies?on other data assets it has. This gives each data asset its own entity so we can work with them independently. When we want to materialize several data assets at once, we just need to specify which ones to materialize without the need to worry about their relationships. Since dependencies are already defined, each data asset knows from what other data assets to pull the data for processing, or to what data assets push processed data.

No alt text provided for this image
Each block is a separate data asset

Moreover, declared dependencies enable?automatic data lineage?and?operational data catalog?without any extra effort. Operational data catalog is the one that the data platform team uses to work with data assets and their lifecycle.

dbt?is one such popular tool that enabled asset-based approach to the data warehouse. Instead of creating data pipelines, you just write a SELECT SQL query for the table you want to create, give this table (model) a name, create optional YAML configuration and specify what other tables it depends on. Then just schedule your jobs and let dbt do the rest. Simple to create, maintain, and you get the data lineage out of the box.

But I believe the real breakthrough in this field is?Dagster?with its?Software-Defined Assets?approach. The general idea is similar to the one of dbt, but instead can be applied to?any data asset. Whether it’s a table in a database, a data extract from an API, a file in a cloud storage bucket or a machine learning model — you just need to define it in Python code and specify dependencies. Create a job with a selection of assets you want to materialize and add a schedule or a sensor. Dagster will automatically materialize selected assets while preserving order of dependencies. Moreover, you can import data assets from other tools, including dbt!

Dagster and dbt are probably two most powerful tools that can be implemented in your Data Platform ("Euclid" as referred in my previous tip #1). They can replace complex and fragile Airflow pipelines, custom shell scripts and a couple of detached data tools. You can also get clear observability over your data assets thanks to the automated operational data catalog, your daily data operations will get much easier, and you can visibly increase the reliability of your new data platform.

I really recommend spending some time studying Dagster and understanding asset-based approach to data to see how it can improve your current data workflows. It was such a huge positive upgrade for our data engineers & managers so that it’s hard to imagine going back to previous ways of doing data engineering.

要查看或添加评论,请登录

Dr. RVS Praveen Ph.D的更多文章

社区洞察

其他会员也浏览了