登录查看更多内容

How can Modern Data Stack help in data democratization?

Aman Gupta

Engineering Management | Data Engineering | AI

发布日期: 2022年6月22日

In the?first part of this blog series, we established that data democratization is the cornerstone of a data-driven culture—however, the centralized data delivery model bottlenecks the data democratization. We argued how a hybrid data delivery model could help create scale by establishing a delivery model as in the image below.

This blog post will discuss establishing this hybrid data delivery model using the modern tech stack.

The most significant change in the modern tech stack that is also the essential message of this blog is the separation of the data pipeline's Replication?and?Transformation?layers. The below block diagram provides the high-level architecture of this tech stack with the icons of the most dominating players in their respective technology domains.

*The logos are used as examples in the block diagram. Please don't consider these logos as the vendor recommendations

Reading the block diagram from left to right, we start with the modern cloud-based SAAS data sources accessed through APIs. These applications provide standard services; hence they have standard data objects as well, irrespective of who is using these apps. The popularity and the standardization offered by these apps open up the opportunity for the?Automated E/L tools.

Automated E/L tools are the integration tools with source-aware connectors to connect to the various popular SAAS applications. These tools understand the metadata of the source and replicate data incrementally and in near-real-time. All these functionalities are available right out of the box without requiring any logic to be written. Such niche features of these apps help create replication pipelines that are quick, scalable, and low maintenance.

These tools replicate the source data to a centralized data storage layer that could be a data warehouse/ data lake or a Lakehouse. This centrally available data is now available to the data analyst community, and this is where the central data team's?managed services?end.

Moving on to the top of the diagram, DBT is the second most crucial technology apart from automated E/L tools that enable the creation of a hybrid data delivery model. Data analysts can leverage the self-service transformation functionalities to perform the business logic transformation on the copy of the centrally available data. DBT then pushes the transformed and augmented data back to the central data storage.

Data Science and Reporting teams can now leverage this?golden?data, or it can also be written back to the transactional apps using the reverse ETL tools.

Why is the separation of EL and T such a paradigm shift in Data Engineering?

Decoupling —?Decoupling of upstream and downstream changes brings down the maintenance costs of these pipelines significantly
Unboxing the black box —?Decentralization of the business logic aids in speed and governance of the business logic. Analysts can now self-service the business logic while also having visibility as to why the data looks the way it looks.
Intelligent Integration —?As we standardize the EL part of the pipelines to do just replication, we also enable this section to become much more intelligent and automated compared to the custom-made pipelines.

What are the most significant features of the Automated EL tools?

Hundreds of native source connectors?— These tools come packaged with hundreds of highly source-aware connectors. These connectors understand the data objects to expect in the source along with the source's metadata and the data model. This intelligence thus saves considerable developer efforts while increasing the quality of the pipeline.
Automated metadata sync —?The tool's awareness about metadata helps it sync not only data but also the?metadata. For example, if the source introduces a new column or changes the column name or the data type, no developer involvement is needed to replicate these changes in the target. The tool automatically detects the metadata changes and starts synching the new columns and the data to the target. This feature helps to free up the developer's time and makes the changes available to the end-user in minutes rather than weeks.
Incremental data refresh —?The?tool automatically takes care of incremental refresh without needing embedded logic for Change Data Capture(CDC). This functionality makes the pipelines lightweight, thus reducing the network costs while improving the refresh timings.
Near-real-time sync —?With the help of the above features, these tools can replicate data in near-real-time. Better refresh speeds improve the analytics speed and enhance the user's trust in data.

领英推荐

The Future of Data Management: A Deep Dive into Data…

Sidd TUMKUR 4 个月前

WHAT MODERN DATA TEAMS DO DIFFERENTLY

Andrew Madson MSc, MBA 1 个月前

Rethinking Modern Data Architectures: How VAST Data…

VAST Data 2 个月前

How can DBT help in decentralizing data transformation?

The second crucial leg of the EL-T enablement is the decentralization of the?Transformation?portion of the data pipelines. The transformations can be decentralized now because of the advent of the open-source tool called Data Build Tool or, more popularly,?DBT. DBT tool can be considered as SQL on steroids. DBT can do everything which SQL can do plus some more. It has the following distinct features:

SQL-like coding language?— SQL can be considered the English of the data world. Since DBT uses SQL-like?language for coding, it considerably reduces the learning curve for the new users.
Reuse the processing power of your warehouse?— DBT acts only as of the abstraction layer to the data warehouse. It provides a window for developers to write their code while the actual data processing happens in the data warehouse itself. This feature helps optimize resources and the cost by reusing them, unlike the ETL tools, which need their own separate processing resources.
Software engineering best practices —?DBT brings the best practices of the software engineering world to the world of data engineering. Best collaborative coding practices such as inline documentation, annotation, version controls, macros, CICD, etc., are available in DBT so that SQL code can also be modularized, reused, and easy to maintain in the future.
Automated online catalog and lineage?— DBT can create an automated lineage and data catalog without needing a third-party tool or manual effort. This feature becomes vital for data governance as we pursue widespread data democratization.

Balancing Act

While the modern data stack provides many benefits over the traditional architecture, the whole ecosystem is still in its nascent stage. The number of logos in the above block diagram can indicate that this is still an evolving space that could be ripe for consolidation shortly.

Hence, I suggest performing a?balancing act?here and being cautious before introducing these technologies in mission-critical applications. My advice will be to follow the crawl-walk-run strategy as we move towards modernizing the tech stack.

Something more crucial to imbibe from this modern tech stack is the concept of decoupling EL and T components of the data pipelines compared to the actual tools themselves. For example, data engineers can use the existing ETL tools only for the EL part of the data pipelines. Similarly, the transformation component can be unboxed to the end-users by leveraging database native technologies such as materialized views or exposing the logic to the reporting tool itself.

Conclusion

If there can be only one takeaway from this blog, then it ought to be —

Separate EL and T components of the data pipelines to achieve scale in data engineering, leading to a far better data democratization in the organization

The modern data stack makes this end-user enablement a reality. Data engineering teams should embrace these technologies and delivery models to help establish a data-driven culture in their organizations.

Please reach out to me on?Linkedin?for further conversations.

Here is the link to the first part of the series for reference.

If you enjoyed reading this article, don’t forget to comment, share, and follow me. Feedback is welcome!

Many thanks!

Aman Gupta on Engineering Mgt

562 位关注者

要查看或添加评论，请登录

Aman Gupta的更多文章

How Engineering managers leverage operations management to 'Do more with?less.'

2023年1月31日

How Engineering managers leverage operations management to 'Do more with?less.'

This blog is part of the blog series — How can industrial-era management tools help in effective engineering…

3 条评论
How can Data Teams evolve into Chief Data Officer (CDO) Organizations?

2022年10月18日

How can Data Teams evolve into Chief Data Officer (CDO) Organizations?

In the current era of "Data is the New Oil" [1] and the tech organizations getting fined millions while trying to mine…
Economics in Engineering Management

2022年7月8日

Economics in Engineering Management

This blog is part of the blog series — How can industrial-era management tools help in effective engineering…

1 条评论
How metrics can help in designing your organization

2022年7月1日

How metrics can help in designing your organization

This blog is part of the blog series — How can industrial-era management tools help in effective engineering…

2 条评论
How can industrial-era management tools help in effective engineering management?

2022年6月30日

How can industrial-era management tools help in effective engineering management?

Companies are the new countries! Apple touched a market cap of $3 Trillion on 4th Jan 2022 [1]. To put that humongous…
Why data-driven culture is still a far-reaching goal for many organizations?

2022年6月16日

Why data-driven culture is still a far-reaching goal for many organizations?

Creating a collaborative data-driven culture is one of the most critical goals many modern organizations pursue…

1 条评论
THE ‘ELUSIVE’ SEED OF FACEBOOK MARKETING!

2017年9月12日

THE ‘ELUSIVE’ SEED OF FACEBOOK MARKETING!

Facebook knows you better than your Spouse “In just 10 likes, the algorithms can predict your personality better than a…

1 条评论

See all articles

How can Modern Data Stack help in data democratization?

Aman Gupta

Engineering Management | Data Engineering | AI

Why is the separation of EL and T such a paradigm shift in Data Engineering?

What are the most significant features of the Automated EL tools?

领英推荐

How can DBT help in decentralizing data transformation?

Balancing Act

Conclusion

Aman Gupta on Engineering Mgt

562 位关注者

Aman Gupta的更多文章

社区洞察

其他会员也浏览了

Snowflake Tables: Revolutionizing Data Management for Modern Businesses

MetaFlex DataHub

Building Scalable, Real-Time Data Pipelines

Revolutionizing Data Engineering: The Power of Data Mesh Over Traditional Architectures

How Data Architecture And Technology Support AI For Digital Transformation

How Dremio Simplifies Data Lakehouse Architecture for Modern Analytics

Data Management News for the Week of December 20; Updates from Anomalo, Ataccama, Boomi & More

Data Management News for the Week of January 17; Updates from Boomi, Informatica, Snowflake & More

Data Management News for the Week of August 16; Updates from Cloudera, Informatica, Safe Software & More

“Data Mess to Data Mesh” - Part:1

Why is the separation of EL and T such a paradigm shift in Data Engineering?

What are the most significant features of the Automated EL tools?

领英推荐

How can DBT help in decentralizing data transformation?

Balancing Act

Conclusion

Aman Gupta on Engineering Mgt

562 位关注者

Aman Gupta的更多文章

How Engineering managers leverage operations management to 'Do more with?less.'

How can Data Teams evolve into Chief Data Officer (CDO) Organizations?

Economics in Engineering Management

How metrics can help in designing your organization

How can industrial-era management tools help in effective engineering management?

Why data-driven culture is still a far-reaching goal for many organizations?

THE ‘ELUSIVE’ SEED OF FACEBOOK MARKETING!

社区洞察

其他会员也浏览了

Snowflake Tables: Revolutionizing Data Management for Modern Businesses

MetaFlex DataHub

Building Scalable, Real-Time Data Pipelines

Revolutionizing Data Engineering: The Power of Data Mesh Over Traditional Architectures

How Data Architecture And Technology Support AI For Digital Transformation

How Dremio Simplifies Data Lakehouse Architecture for Modern Analytics

Data Management News for the Week of December 20; Updates from Anomalo, Ataccama, Boomi & More

Data Management News for the Week of January 17; Updates from Boomi, Informatica, Snowflake & More

Data Management News for the Week of August 16; Updates from Cloudera, Informatica, Safe Software & More

“Data Mess to Data Mesh” - Part:1