How can Modern Data Stack help in data democratization?
Scale Data Engineering by moving from service to enablement

How can Modern Data Stack help in data democratization?

In the?first part of this blog series, we established that data democratization is the cornerstone of a data-driven culture—however, the centralized data delivery model bottlenecks the data democratization. We argued how a hybrid data delivery model could help create scale by establishing a delivery model as in the image below.

No alt text provided for this image

This blog post will discuss establishing this hybrid data delivery model using the modern tech stack.

The most significant change in the modern tech stack that is also the essential message of this blog is the separation of the data pipeline's Replication?and?Transformation?layers. The below block diagram provides the high-level architecture of this tech stack with the icons of the most dominating players in their respective technology domains.

*The logos are used as examples in the block diagram. Please don't consider these logos as the vendor recommendations

No alt text provided for this image

Reading the block diagram from left to right, we start with the modern cloud-based SAAS data sources accessed through APIs. These applications provide standard services; hence they have standard data objects as well, irrespective of who is using these apps. The popularity and the standardization offered by these apps open up the opportunity for the?Automated E/L tools.

Automated E/L tools are the integration tools with source-aware connectors to connect to the various popular SAAS applications. These tools understand the metadata of the source and replicate data incrementally and in near-real-time. All these functionalities are available right out of the box without requiring any logic to be written. Such niche features of these apps help create replication pipelines that are quick, scalable, and low maintenance.

These tools replicate the source data to a centralized data storage layer that could be a data warehouse/ data lake or a Lakehouse. This centrally available data is now available to the data analyst community, and this is where the central data team's?managed services?end.

Moving on to the top of the diagram, DBT is the second most crucial technology apart from automated E/L tools that enable the creation of a hybrid data delivery model. Data analysts can leverage the self-service transformation functionalities to perform the business logic transformation on the copy of the centrally available data. DBT then pushes the transformed and augmented data back to the central data storage.

Data Science and Reporting teams can now leverage this?golden?data, or it can also be written back to the transactional apps using the reverse ETL tools.

Why is the separation of EL and T such a paradigm shift in Data Engineering?

  1. Decoupling —?Decoupling of upstream and downstream changes brings down the maintenance costs of these pipelines significantly
  2. Unboxing the black box —?Decentralization of the business logic aids in speed and governance of the business logic. Analysts can now self-service the business logic while also having visibility as to why the data looks the way it looks.
  3. Intelligent Integration —?As we standardize the EL part of the pipelines to do just replication, we also enable this section to become much more intelligent and automated compared to the custom-made pipelines.

What are the most significant features of the Automated EL tools?

  1. Hundreds of native source connectors?— These tools come packaged with hundreds of highly source-aware connectors. These connectors understand the data objects to expect in the source along with the source's metadata and the data model. This intelligence thus saves considerable developer efforts while increasing the quality of the pipeline.
  2. Automated metadata sync —?The tool's awareness about metadata helps it sync not only data but also the?metadata. For example, if the source introduces a new column or changes the column name or the data type, no developer involvement is needed to replicate these changes in the target. The tool automatically detects the metadata changes and starts synching the new columns and the data to the target. This feature helps to free up the developer's time and makes the changes available to the end-user in minutes rather than weeks.
  3. Incremental data refresh —?The?tool automatically takes care of incremental refresh without needing embedded logic for Change Data Capture(CDC). This functionality makes the pipelines lightweight, thus reducing the network costs while improving the refresh timings.
  4. Near-real-time sync —?With the help of the above features, these tools can replicate data in near-real-time. Better refresh speeds improve the analytics speed and enhance the user's trust in data.

How can DBT help in decentralizing data transformation?

The second crucial leg of the EL-T enablement is the decentralization of the?Transformation?portion of the data pipelines. The transformations can be decentralized now because of the advent of the open-source tool called Data Build Tool or, more popularly,?DBT. DBT tool can be considered as SQL on steroids. DBT can do everything which SQL can do plus some more. It has the following distinct features:

  1. SQL-like coding language?— SQL can be considered the English of the data world. Since DBT uses SQL-like?language for coding, it considerably reduces the learning curve for the new users.
  2. Reuse the processing power of your warehouse?— DBT acts only as of the abstraction layer to the data warehouse. It provides a window for developers to write their code while the actual data processing happens in the data warehouse itself. This feature helps optimize resources and the cost by reusing them, unlike the ETL tools, which need their own separate processing resources.
  3. Software engineering best practices —?DBT brings the best practices of the software engineering world to the world of data engineering. Best collaborative coding practices such as inline documentation, annotation, version controls, macros, CICD, etc., are available in DBT so that SQL code can also be modularized, reused, and easy to maintain in the future.
  4. Automated online catalog and lineage?— DBT can create an automated lineage and data catalog without needing a third-party tool or manual effort. This feature becomes vital for data governance as we pursue widespread data democratization.

Balancing Act

While the modern data stack provides many benefits over the traditional architecture, the whole ecosystem is still in its nascent stage. The number of logos in the above block diagram can indicate that this is still an evolving space that could be ripe for consolidation shortly.

Hence, I suggest performing a?balancing act?here and being cautious before introducing these technologies in mission-critical applications. My advice will be to follow the crawl-walk-run strategy as we move towards modernizing the tech stack.

Something more crucial to imbibe from this modern tech stack is the concept of decoupling EL and T components of the data pipelines compared to the actual tools themselves. For example, data engineers can use the existing ETL tools only for the EL part of the data pipelines. Similarly, the transformation component can be unboxed to the end-users by leveraging database native technologies such as materialized views or exposing the logic to the reporting tool itself.

Conclusion

If there can be only one takeaway from this blog, then it ought to be —

Separate EL and T components of the data pipelines to achieve scale in data engineering, leading to a far better data democratization in the organization

The modern data stack makes this end-user enablement a reality. Data engineering teams should embrace these technologies and delivery models to help establish a data-driven culture in their organizations.

Please reach out to me on?Linkedin?for further conversations.

Here is the link to the first part of the series for reference.

If you enjoyed reading this article, don’t forget to comment, share, and follow me. Feedback is welcome!

Many thanks!

要查看或添加评论,请登录

Aman Gupta的更多文章

社区洞察

其他会员也浏览了