Modern Data Stack- Part 1
Sankha Mitra
Data Architecture | Data Warehousing | Business Intelligence | Cloud Architecture and Solutions | Program Management | Practice Management
What is modern data stack?
?The birth of cloud data warehouses with their massively parallel processing (MPP)capabilities and first-class SQL support has made processing large volumes of data faster and cheaper. This has also led to the development of many cloud-native data tools that are easy to integrate, scalable and economical. These tools and technologies are collectively referred to as the Modern Data Stack (MDS).
?Modern data stack and modern data platform – Differences
?A data platform is the set of components through which dataflows while a data stack is the set of tools that serve these components.
?Key characteristics
1.???? Cloud-first approach
2.???? Built around cloud data warehouse/lake
3.???? Focus on solving one problem at a time
4.???? Offered as SaaS or open-core
5.???? Low-entry barrier
6.???? Actively supported by communities
?1.???? Cloud-first - Modern public cloud vendors have enabled MDS tools to become highly elastic and scalable. This makes it easy for organizations to integrate them into their existing cloud infrastructure.
?2.???? Built around cloud data warehouse/lake- Modern data stack tools recognize that a central cloud data warehouse/lake is what fuels robust data analytics. So they are designed to integrate seamlessly with all the prominent cloud data warehouses (like Redshift, Big Query, Snowflake, Databricks etc.) and take full advantage of their features.
?3.???? Focus on solving one specific problem at a time- The modern data stack is a maze of tools connected by the different stages of the data pipeline. Each tool focuses on one specific aspect of data processing/management. This enables modern data stack tools to fit into a variety of architectures and plugs into any existing stack with few or no changes.
?4.???? Offered as SaaS or open-core- Modern data stack tools are mostly offered as SaaS (Software as a Service). In some cases, the core components are open-source and come with paid add-on features like end-to-end hosting and professional support.
?5.???? Has low entry barrier - Modern data stack tools are packaged in easy pay-as-you-go and usage-based pricing models. Data practitioners can explore new tools and their features and utility before making big commitments. This saves money and time. Also, MDS tools are designed to be low-code or even no-code. Tool setup can be completed in a few hours and does not require big tech expertise or time investments.
领英推荐
?6.???? Actively supported by communities- Modern data stack solution providers invest considerable time and effort in community building
?
What was the need for a modern data stack?
3 major factors
The emergence of Hadoop and the public cloud
Prior to Hadoop, it was only possible to vertically scale the infrastructure. So data processing demanded a large upfront investment. With the emergence of Hadoop, it was possible to horizontally scale storage and compute on commoditized hardware. But even after that, the user experience was clunky (map-reduce) and only large organizations could invest in the special skills required to make it work well.
But then as public cloud platforms became inexpensive and accessible, even smaller companies could afford storage and compute on the cloud.
?The launching of Amazon’s Redshift - Meanwhile, the microservices architecture had popularized NoSQL and non-relational databases. When loaded into a Hadoop cluster for analytics, this non-relational data was hard to process using SQL. This forced data teams to use other programming languages like Java, Scala, and Python to process data. Organizations came to depend on expensive engineering resources and highly specialized skills.
Data democracy took a severe hit. Amazon’s Redshift changed all that
Launched in 2012, Redshift was launched in 2012 making it the very first cloud data warehouse. It allowed large volumes of data to be stored on horizontally scalable infrastructure, and also made it possible to query the data using plain SQL.
?As on date Amazon’s Redshift has decoupled processing capabilities and storage
?A growing need for better tooling- ?In the following years, data warehouse solution providers were able to further improve the architecture, separate storage and compute and offer better price points and scalability. But transforming, modelling, cleaning, and converting data into actionable insights remained cumbersome and error-prone.
Fast-growing businesses became unhappy with what they were getting in return for their large infrastructure investments. Their data had grown in volume, variety and complexity, but the ecosystem still did not have the tools that could manage it well.
Privacy too was becoming a serious matter and governments across the world wanted to protect their citizens from overly digitized information systems. This led to stringent regulatory frameworks such as the EU’s GDPR and California’s CCPA.
As the basic building blocks of the analytical data platform matured and stabilized, better data management and observability became super important. The ground was set for the development of a better set of tools that could address these challenges. Investors and entrepreneurs became interested and the modern data stack became the focus of attention and innovation.