登录查看更多内容

Balancing Decoupling: Finding the Right Boundaries for Storage, Compute, and the Modern Data Stack.

Dr. RVS Praveen Ph.D

Director - Product Engineering at LTIMindtree

发布日期: 2023年8月3日

Data engineers enthusiastically embraced the advantages of separating storage and compute instead of keeping them tightly coupled. They happily adopted emerging technologies such as Snowflake (2012), Databricks (2013), and BigQuery (2010) to achieve this decoupling.

The advantages of this approach were remarkable from both cost and scalability perspectives compared to traditional on-premises databases. A data engineering manager from a Fortune 500 company expressed the challenges they faced with on-prem limitations:

"Our analysts couldn't run queries when needed, as our data warehouse was regularly taken offline for data transformations and loading. The process was painfully disruptive."

Over the following decade, the data management industry has seen considerable innovation centered around how various data platforms couple or decouple storage and compute, as well as bundle or unbundle related data services, such as data ingestion, transformation, governance, and monitoring. These aspects are closely related, and it is essential for data leaders to take notice.

The underlying connection and integration of these services often lie in the metadata of table formats (storage) and query/job logs (compute). How these elements are managed within a data platform significantly impacts its performance, cost, ease of use, partner ecosystem, and future viability.

Choosing the right type of data platform and the appropriate level of decoupling is akin to deciding how to format your SQL code: it relies on personal preferences and professional requirements. However, there is a small range of possibilities that will satisfy most needs.

In line with Aristotle's golden mean, I believe that the majority of users will find the best fit in options positioned in the middle of the spectrum. Operating at either extreme will be suitable for the select few with very specialized use cases.

Before delving into the reasons behind this, let's first explore the current landscape and recent developments.

The spectrum of data platforms encompassing storage and compute components.

A vocal minority has garnered attention with their "cloud is expensive, let's return to on-premises server racks" movement. However, this strategy is becoming increasingly scarce and is not widely adopted.

Only a few weeks ago, the Pragmatic Engineer brought attention to Twitter's rate throttling and considerable user experience problems. These issues were likely caused by the decision to shift their machine learning models away from Google Cloud Platform (GCP) and rely solely on their three data centers.

The capability to independently scale and utilize storage and compute proves to be more cost-effective and efficient. However, there are advantages to having these functions integrated within the same data platform as well.

Data platforms that have been fine-tuned to work effectively out of the box usually execute average unoptimized SQL queries faster, especially when catering to ad hoc analytics requests. On the other hand, a decoupled architecture, separating compute and storage at the platform level, can be cost-effective for handling heavy workloads, assuming there is a highly skilled team to optimize these workloads.

Combined but decoupled storage and compute in data platforms also offer a more comprehensive and integrated user experience for essential data operations tasks. For instance, in the context of data governance, these platforms provide a centralized mechanism for access control, unlike decoupled architectures that require role federation across multiple query engines, which can be a challenging task.

This decoupled but combined approach has earned accolades for platforms like Snowflake, often praised for its seamless performance. Notably, Snowflake recently reinforced its capabilities with Unistore for transactional workloads and introduced Snowpark to support Python and other data science-related workloads.

Databricks experienced remarkable growth due to its emphasis on the Spark processing framework. Additionally, its decision to incorporate metadata and ACID-like transactions within Delta tables, along with governance features within Unity Catalog, unlocked new levels of growth. Recently, Databricks took further steps to enhance compatibility by enabling the writing of metadata in formats readable by Delta, Apache, and Hudi when writing to a Delta table (storage).

Emerging data platforms.

It's intriguing to observe the emerging data engineering technologies that are beginning to separate storage and compute at the vendor level. For instance, Tabular positions itself as a "headless data warehouse" or a comprehensive solution for everything in a data warehouse except compute.

Furthermore, certain organizations are opting to migrate to Apache Iceberg tables within a data lake, managing the backend infrastructure themselves and utilizing a separate query engine like Trino. This shift is often driven by customer-facing use cases that demand highly performant and cost-effective interactive queries.

On the other hand, DuckDB combines storage and compute, prioritizing developer simplicity and reduced cost over the near-infinite compute capacity of modern data stacks.

The question remains whether these innovations will replace established cloud-native data platforms. The answer will largely depend on individual needs and preferences. While DuckDB is immensely popular among data analysts, it might not serve as the foundational backbone for building an entire data platform. Ultimately, we are likely to see a distribution that encompasses a variety of solutions.

I will elucidate the reasons by examining various dimensions of the modern data stack and data platform types.

领英推荐

The 5 Modern Data Platforms:…

Dr. Rabi Prasad Padhy 6 个月前

Seamless Integration: Databricks' Approach to Reading…

Akshay T. 1 年前

?? Part 3: How AWS Powers a Scalable & Secure Data…

Abdulla Pathan 3 周前

The extent and intention of consolidation.

B2B vendors often praise the concept of a "single pane of glass," but its value depends on the quality and alignment of each service with your specific needs. The true value lies in unifying disparate information and actions into a coherent narrative or streamlined workflow. Microsoft 365 serves as an example of this approach, where integrating video and email within their Teams collaboration application enhances meeting scheduling and video conferencing processes. However, the value of other integrated apps, like Sway, depends on individual requirements, such as interactive reporting needs.

In the data universe, compute and storage play a crucial role in creating a unified dataops story, addressing aspects like cost, quality, and access management. Platforms that excel in this integration often boast robust partner ecosystems and seamless integrations. For many, this will be a key criterion when choosing a data platform, similar to how preferences for different operating systems exist among phone users.

We all understand the significance of a tightly integrated partner ecosystem surrounding organisations data platform. The convenience of an up-to-date data stack that requires minimal patching allows customer facing teams to focus on building exceptional customer experiences.

Nevertheless, exceptions to this approach exist, and certain use cases at a large scale may call for more complex platforms like true data lakes or headless warehouses.

When it comes to bundling semantic layers, data quality, access control, catalog, BI, transformation, and ingestion tools within the same platform, there are valid perspectives across the spectrum. Ultimately, most data teams will opt for a collection of tools that best align with their specific requirements.

Key Points:

The majority of data leaders will prioritize a data platform that integrates both compute and storage services to enable a cohesive "single story" and foster a diverse partner ecosystem.

Balancing performance and ease of use.

In general, the more customizable a platform, the greater its potential performance across various use cases, but it also becomes more challenging to use. This tradeoff is inevitable when separating storage and compute services across different vendors.

Considering the "ease of use" of a data platform involves not only day-to-day usage but also the simplicity of administration and customization.

Many teams tend to prioritize platform performance, comparing platforms like cars and focusing on horsepower for specific workloads. While an optimized data platform can lead to significant cost savings, the expenses associated with managing complex configurations or lengthy onboarding projects for new business aspects should not be overlooked.

A similar decision-making pattern emerges with open-source solutions, where the upfront cost may be low, but the time and effort required to maintain the infrastructure can be substantial.

Solution costs and engineering salary costs should not be treated as the same, as this mistaken equivalence can lead to problems in the future. There are two primary reasons for this distinction:

Assuming your usage remains constant (a crucial consideration), solution costs tend to remain stable while efficiency improves with SaaS vendors continuously introducing new features. Conversely, the efficiency of a more manual implementation is likely to decline over time due to turnover, as key team members may leave, and new ones require onboarding.
When most of your time is spent on infrastructure maintenance, your data team may lose focus on maximizing business value, and maintaining peak performance becomes the primary goal. Meetings start revolving around infrastructure, and niche infrastructure skills gain disproportionate importance, with these specialists becoming more prominent within the organization. Organizational culture is significantly influenced by the primary tasks and challenges the team is tackling.

Many data heads fromlarge companies in my interaction have emphasized two central issues that their data team needed to address. First, they required a data stack that could centralize all data from different parts of the company into a stable source of truth accessible to everyone. Second, they aimed to have enough time to focus on insights rather than solely on managing the data infrastructure.

While premium performance may be necessary for certain business use cases, such as a credit card fraud data product requiring low latency or a customer-facing app needing high responsiveness, in most cases, a data warehouse or managed data lakehouse will scale effectively. It is crucial to double-check any specific requirements that suggest otherwise.

Key Points:

Finding a balance between ease of use and performance is essential, and most data leaders tend to prioritize ease of use due to hidden maintenance and culture costs. Rather than focusing on maintaining complex infrastructure, the competitive advantage often lies in enriching and applying first-party data effectively.

Supporting the case for the MDS.

I understand that it's trendy to criticize the modern data stack (and you might not necessarily need it to achieve your goals), but despite its flaws, it remains the best choice for the majority of data teams. It strikes a balance between providing quick value generation and ensuring a long-term, future-proof investment.

Many emerging technologies hold significant value, even though their applications might be more specialized. Observing how these technologies evolve and influence data engineering practices will be exciting.

However, while it's essential for compute and storage to operate and scale separately, having these services and corresponding metadata within the same platform offers immense power and numerous advantages that cannot be overlooked.

要查看或添加评论，请登录

Dr. RVS Praveen Ph.D的更多文章

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

2024年8月4日

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

"Craft Compelling Data Stories: A 7-Step Recipe for Effective Insight and Information Dissemination" If data can be…
The Future of Data Integration: Moving Beyond Traditional ETL

2024年7月8日

The Future of Data Integration: Moving Beyond Traditional ETL

Forward-looking technologies are typically innovative and embraced by early adopters, providing some business value…

1 条评论
Fostering Analytical Maturity in Organizations (AMO)

2024年4月18日

Fostering Analytical Maturity in Organizations (AMO)

Several straightforward frameworks to identify your organization's analytical requirements and enhance its data-driven…

1 条评论
Revealing Contemporary Data Frameworks: From Warehouses to Meshes

2024年4月15日

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Traversing the Data Revolution Embark on an expedition through the transformative terrain of modern data architecture…

1 条评论
Part 5: Navigating Generative AI in Retail & Commercial Banking

2024年4月2日

Part 5: Navigating Generative AI in Retail & Commercial Banking

In the contemporary digital landscape, retail and commercial banking encounter a plethora of hurdles, ranging from the…
Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

2024年3月27日

Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

The banking sector has long been a pioneer in embracing technology to optimize operational efficiency, minimize costs…
Part 3: Exploring Generative AI Applications in Banking

2024年3月21日

Part 3: Exploring Generative AI Applications in Banking

wIn this segment of the series, we navigate the realm of Generative AI models, uncovering their intricacies and the…
Part 2: Exploring Generative AI in Banking: An Overview

2024年3月4日

Part 2: Exploring Generative AI in Banking: An Overview

In the rapidly evolving realm of technology, the advent of artificial intelligence (AI) has proven transformative…
Part 1: Introduction to Generative AI Playbook for Banking

2024年3月3日

Part 1: Introduction to Generative AI Playbook for Banking

In the swiftly changing terrain of financial services, the emergence of Generative Artificial Intelligence (AI)…

3 条评论
Constructing a Data Platform in 2024

2024年2月18日

Constructing a Data Platform in 2024

A guide to developing a contemporary, adaptable data platform to drive your analytics and data science initiatives…

3 条评论

See all articles

Balancing Decoupling: Finding the Right Boundaries for Storage, Compute, and the Modern Data Stack.

Dr. RVS Praveen Ph.D

Director - Product Engineering at LTIMindtree

The spectrum of data platforms encompassing storage and compute components.

Emerging data platforms.

领英推荐

The extent and intention of consolidation.

Key Points:

Balancing performance and ease of use.

Key Points:

Supporting the case for the MDS.

Dr. RVS Praveen Ph.D的更多文章

社区洞察

其他会员也浏览了

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Migrating from Traditional Databases to Databricks: A Strategic Path to Data Modernization

Top 5 Benefits of Using SQL Azure Blob Storage for Your Data Management Needs

NuoData open data lake-house

Real-Time Challenges and Solutions for Data Engineers in Azure Databricks

Revolutionising Finance: How I Built a Databricks Lakehouse for a UK Financial Giant

Using Azure Cosmos DB for Globally Distributed Applications

Performance essentials - BigQuery & Distributed data processing systems

Designing Data-Intensive Applications with Azure Cosmos DB

The spectrum of data platforms encompassing storage and compute components.

Emerging data platforms.

领英推荐

The extent and intention of consolidation.

Key Points:

Balancing performance and ease of use.

Key Points:

Supporting the case for the MDS.

Dr. RVS Praveen Ph.D的更多文章

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

The Future of Data Integration: Moving Beyond Traditional ETL

Fostering Analytical Maturity in Organizations (AMO)

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Part 5: Navigating Generative AI in Retail & Commercial Banking

Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

Part 3: Exploring Generative AI Applications in Banking

Part 2: Exploring Generative AI in Banking: An Overview

Part 1: Introduction to Generative AI Playbook for Banking

Constructing a Data Platform in 2024

社区洞察

其他会员也浏览了

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Migrating from Traditional Databases to Databricks: A Strategic Path to Data Modernization

Top 5 Benefits of Using SQL Azure Blob Storage for Your Data Management Needs

NuoData open data lake-house

Real-Time Challenges and Solutions for Data Engineers in Azure Databricks

Revolutionising Finance: How I Built a Databricks Lakehouse for a UK Financial Giant

Using Azure Cosmos DB for Globally Distributed Applications

Performance essentials - BigQuery & Distributed data processing systems

Designing Data-Intensive Applications with Azure Cosmos DB