How can Modern Data Stack help in data democratization?
In the?first part of this blog series, we established that data democratization is the cornerstone of a data-driven culture—however, the centralized data delivery model bottlenecks the data democratization. We argued how a hybrid data delivery model could help create scale by establishing a delivery model as in the image below.
This blog post will discuss establishing this hybrid data delivery model using the modern tech stack.
The most significant change in the modern tech stack that is also the essential message of this blog is the separation of the data pipeline's Replication?and?Transformation?layers. The below block diagram provides the high-level architecture of this tech stack with the icons of the most dominating players in their respective technology domains.
*The logos are used as examples in the block diagram. Please don't consider these logos as the vendor recommendations
Reading the block diagram from left to right, we start with the modern cloud-based SAAS data sources accessed through APIs. These applications provide standard services; hence they have standard data objects as well, irrespective of who is using these apps. The popularity and the standardization offered by these apps open up the opportunity for the?Automated E/L tools.
Automated E/L tools are the integration tools with source-aware connectors to connect to the various popular SAAS applications. These tools understand the metadata of the source and replicate data incrementally and in near-real-time. All these functionalities are available right out of the box without requiring any logic to be written. Such niche features of these apps help create replication pipelines that are quick, scalable, and low maintenance.
These tools replicate the source data to a centralized data storage layer that could be a data warehouse/ data lake or a Lakehouse. This centrally available data is now available to the data analyst community, and this is where the central data team's?managed services?end.
Moving on to the top of the diagram, DBT is the second most crucial technology apart from automated E/L tools that enable the creation of a hybrid data delivery model. Data analysts can leverage the self-service transformation functionalities to perform the business logic transformation on the copy of the centrally available data. DBT then pushes the transformed and augmented data back to the central data storage.
Data Science and Reporting teams can now leverage this?golden?data, or it can also be written back to the transactional apps using the reverse ETL tools.
Why is the separation of EL and T such a paradigm shift in Data Engineering?
What are the most significant features of the Automated EL tools?
领英推荐
How can DBT help in decentralizing data transformation?
The second crucial leg of the EL-T enablement is the decentralization of the?Transformation?portion of the data pipelines. The transformations can be decentralized now because of the advent of the open-source tool called Data Build Tool or, more popularly,?DBT. DBT tool can be considered as SQL on steroids. DBT can do everything which SQL can do plus some more. It has the following distinct features:
Balancing Act
While the modern data stack provides many benefits over the traditional architecture, the whole ecosystem is still in its nascent stage. The number of logos in the above block diagram can indicate that this is still an evolving space that could be ripe for consolidation shortly.
Hence, I suggest performing a?balancing act?here and being cautious before introducing these technologies in mission-critical applications. My advice will be to follow the crawl-walk-run strategy as we move towards modernizing the tech stack.
Something more crucial to imbibe from this modern tech stack is the concept of decoupling EL and T components of the data pipelines compared to the actual tools themselves. For example, data engineers can use the existing ETL tools only for the EL part of the data pipelines. Similarly, the transformation component can be unboxed to the end-users by leveraging database native technologies such as materialized views or exposing the logic to the reporting tool itself.
Conclusion
If there can be only one takeaway from this blog, then it ought to be —
Separate EL and T components of the data pipelines to achieve scale in data engineering, leading to a far better data democratization in the organization
The modern data stack makes this end-user enablement a reality. Data engineering teams should embrace these technologies and delivery models to help establish a data-driven culture in their organizations.
Please reach out to me on?Linkedin?for further conversations.
Here is the link to the first part of the series for reference.
If you enjoyed reading this article, don’t forget to comment, share, and follow me. Feedback is welcome!
Many thanks!