Balancing Decoupling: Finding the Right Boundaries for Storage, Compute, and the Modern Data Stack.

Balancing Decoupling: Finding the Right Boundaries for Storage, Compute, and the Modern Data Stack.

Data engineers enthusiastically embraced the advantages of separating storage and compute instead of keeping them tightly coupled. They happily adopted emerging technologies such as Snowflake (2012), Databricks (2013), and BigQuery (2010) to achieve this decoupling.

The advantages of this approach were remarkable from both cost and scalability perspectives compared to traditional on-premises databases. A data engineering manager from a Fortune 500 company expressed the challenges they faced with on-prem limitations:

"Our analysts couldn't run queries when needed, as our data warehouse was regularly taken offline for data transformations and loading. The process was painfully disruptive."

Over the following decade, the data management industry has seen considerable innovation centered around how various data platforms couple or decouple storage and compute, as well as bundle or unbundle related data services, such as data ingestion, transformation, governance, and monitoring. These aspects are closely related, and it is essential for data leaders to take notice.

The underlying connection and integration of these services often lie in the metadata of table formats (storage) and query/job logs (compute). How these elements are managed within a data platform significantly impacts its performance, cost, ease of use, partner ecosystem, and future viability.

Choosing the right type of data platform and the appropriate level of decoupling is akin to deciding how to format your SQL code: it relies on personal preferences and professional requirements. However, there is a small range of possibilities that will satisfy most needs.

In line with Aristotle's golden mean, I believe that the majority of users will find the best fit in options positioned in the middle of the spectrum. Operating at either extreme will be suitable for the select few with very specialized use cases.

Before delving into the reasons behind this, let's first explore the current landscape and recent developments.

The spectrum of data platforms encompassing storage and compute components.

No alt text provided for this image

A vocal minority has garnered attention with their "cloud is expensive, let's return to on-premises server racks" movement. However, this strategy is becoming increasingly scarce and is not widely adopted.

Only a few weeks ago, the Pragmatic Engineer brought attention to Twitter's rate throttling and considerable user experience problems. These issues were likely caused by the decision to shift their machine learning models away from Google Cloud Platform (GCP) and rely solely on their three data centers.

The capability to independently scale and utilize storage and compute proves to be more cost-effective and efficient. However, there are advantages to having these functions integrated within the same data platform as well.

No alt text provided for this image

Data platforms that have been fine-tuned to work effectively out of the box usually execute average unoptimized SQL queries faster, especially when catering to ad hoc analytics requests. On the other hand, a decoupled architecture, separating compute and storage at the platform level, can be cost-effective for handling heavy workloads, assuming there is a highly skilled team to optimize these workloads.

Combined but decoupled storage and compute in data platforms also offer a more comprehensive and integrated user experience for essential data operations tasks. For instance, in the context of data governance, these platforms provide a centralized mechanism for access control, unlike decoupled architectures that require role federation across multiple query engines, which can be a challenging task.

This decoupled but combined approach has earned accolades for platforms like Snowflake, often praised for its seamless performance. Notably, Snowflake recently reinforced its capabilities with Unistore for transactional workloads and introduced Snowpark to support Python and other data science-related workloads.

Databricks experienced remarkable growth due to its emphasis on the Spark processing framework. Additionally, its decision to incorporate metadata and ACID-like transactions within Delta tables, along with governance features within Unity Catalog, unlocked new levels of growth. Recently, Databricks took further steps to enhance compatibility by enabling the writing of metadata in formats readable by Delta, Apache, and Hudi when writing to a Delta table (storage).

Emerging data platforms.

No alt text provided for this image

It's intriguing to observe the emerging data engineering technologies that are beginning to separate storage and compute at the vendor level. For instance, Tabular positions itself as a "headless data warehouse" or a comprehensive solution for everything in a data warehouse except compute.

Furthermore, certain organizations are opting to migrate to Apache Iceberg tables within a data lake, managing the backend infrastructure themselves and utilizing a separate query engine like Trino. This shift is often driven by customer-facing use cases that demand highly performant and cost-effective interactive queries.

On the other hand, DuckDB combines storage and compute, prioritizing developer simplicity and reduced cost over the near-infinite compute capacity of modern data stacks.

The question remains whether these innovations will replace established cloud-native data platforms. The answer will largely depend on individual needs and preferences. While DuckDB is immensely popular among data analysts, it might not serve as the foundational backbone for building an entire data platform. Ultimately, we are likely to see a distribution that encompasses a variety of solutions.

No alt text provided for this image

I will elucidate the reasons by examining various dimensions of the modern data stack and data platform types.

The extent and intention of consolidation.

No alt text provided for this image

B2B vendors often praise the concept of a "single pane of glass," but its value depends on the quality and alignment of each service with your specific needs. The true value lies in unifying disparate information and actions into a coherent narrative or streamlined workflow. Microsoft 365 serves as an example of this approach, where integrating video and email within their Teams collaboration application enhances meeting scheduling and video conferencing processes. However, the value of other integrated apps, like Sway, depends on individual requirements, such as interactive reporting needs.

In the data universe, compute and storage play a crucial role in creating a unified dataops story, addressing aspects like cost, quality, and access management. Platforms that excel in this integration often boast robust partner ecosystems and seamless integrations. For many, this will be a key criterion when choosing a data platform, similar to how preferences for different operating systems exist among phone users.

We all understand the significance of a tightly integrated partner ecosystem surrounding organisations data platform. The convenience of an up-to-date data stack that requires minimal patching allows customer facing teams to focus on building exceptional customer experiences.

Nevertheless, exceptions to this approach exist, and certain use cases at a large scale may call for more complex platforms like true data lakes or headless warehouses.

When it comes to bundling semantic layers, data quality, access control, catalog, BI, transformation, and ingestion tools within the same platform, there are valid perspectives across the spectrum. Ultimately, most data teams will opt for a collection of tools that best align with their specific requirements.

Key Points:

The majority of data leaders will prioritize a data platform that integrates both compute and storage services to enable a cohesive "single story" and foster a diverse partner ecosystem.

Balancing performance and ease of use.

No alt text provided for this image

In general, the more customizable a platform, the greater its potential performance across various use cases, but it also becomes more challenging to use. This tradeoff is inevitable when separating storage and compute services across different vendors.

Considering the "ease of use" of a data platform involves not only day-to-day usage but also the simplicity of administration and customization.

Many teams tend to prioritize platform performance, comparing platforms like cars and focusing on horsepower for specific workloads. While an optimized data platform can lead to significant cost savings, the expenses associated with managing complex configurations or lengthy onboarding projects for new business aspects should not be overlooked.

A similar decision-making pattern emerges with open-source solutions, where the upfront cost may be low, but the time and effort required to maintain the infrastructure can be substantial.

Solution costs and engineering salary costs should not be treated as the same, as this mistaken equivalence can lead to problems in the future. There are two primary reasons for this distinction:

  • Assuming your usage remains constant (a crucial consideration), solution costs tend to remain stable while efficiency improves with SaaS vendors continuously introducing new features. Conversely, the efficiency of a more manual implementation is likely to decline over time due to turnover, as key team members may leave, and new ones require onboarding.
  • When most of your time is spent on infrastructure maintenance, your data team may lose focus on maximizing business value, and maintaining peak performance becomes the primary goal. Meetings start revolving around infrastructure, and niche infrastructure skills gain disproportionate importance, with these specialists becoming more prominent within the organization. Organizational culture is significantly influenced by the primary tasks and challenges the team is tackling.

Many data heads fromlarge companies in my interaction have emphasized two central issues that their data team needed to address. First, they required a data stack that could centralize all data from different parts of the company into a stable source of truth accessible to everyone. Second, they aimed to have enough time to focus on insights rather than solely on managing the data infrastructure.

While premium performance may be necessary for certain business use cases, such as a credit card fraud data product requiring low latency or a customer-facing app needing high responsiveness, in most cases, a data warehouse or managed data lakehouse will scale effectively. It is crucial to double-check any specific requirements that suggest otherwise.

Key Points:

Finding a balance between ease of use and performance is essential, and most data leaders tend to prioritize ease of use due to hidden maintenance and culture costs. Rather than focusing on maintaining complex infrastructure, the competitive advantage often lies in enriching and applying first-party data effectively.

Supporting the case for the MDS.

No alt text provided for this image

I understand that it's trendy to criticize the modern data stack (and you might not necessarily need it to achieve your goals), but despite its flaws, it remains the best choice for the majority of data teams. It strikes a balance between providing quick value generation and ensuring a long-term, future-proof investment.

Many emerging technologies hold significant value, even though their applications might be more specialized. Observing how these technologies evolve and influence data engineering practices will be exciting.

However, while it's essential for compute and storage to operate and scale separately, having these services and corresponding metadata within the same platform offers immense power and numerous advantages that cannot be overlooked.

要查看或添加评论,请登录

Dr. RVS Praveen Ph.D的更多文章

社区洞察

其他会员也浏览了