登录查看更多内容

Data Pipelines | To merge, or not to merge

Olivier Soucy

Founder @ okube.ai | Fractional Data Platform Engineer | Open-source Developer | Databricks Partner

发布日期: 2024年9月10日

In recent years, data has shifted towards a more streaming-centric nature. Online transactions, website clicks, TikTok likes, and even your car's real-time energy consumption now contribute to an ecosystem where each data point represents a new, immutable event. This evolution has led to the rise of incremental data pipelines, which process data as it arrives, in contrast to traditional batch processing of large historical datasets.

Most online transaction processing (OLTP) systems have adapted by generating streams of events through techniques like Change Data Capture (CDC), where every row-level change in a database is tracked and delivered in real time. SQL operations, such as MERGE INTO, enable these events to be seamlessly integrated into an existing data table without the need to overwrite its entire content, making it a more efficient and targeted way to update data.

However, not all systems can adopt this approach. Many legacy or simpler systems lack the ability to stream changes, leaving data engineers with no choice but to take periodic "snapshots" or "dumps" of their data.

So, if you're building a data pipeline that relies on snapshots, what are your options for ingestion and processing? And more importantly, how does this affect performance?

Data Flows

Before diving into the details, let’s explore the various options for ingesting data and propagating changes to downstream tables.

Input Data Structures

Whether you’re using a push or pull configuration for your data ingestion, data producers typically follow one of these three structures:

Incremental: Only new data points (or events) are submitted. This structure is often immutable and is common in scenarios like telemetry data, IoT systems, and website monitoring, where data is constantly streaming in.
Change Data Capture (CDC): This structure captures a list of changes within a system—whether it's a new record, an update to an existing one, or even a deletion. CDC is frequently used in transactional and other OLTP systems to track and reflect real-time changes.
Snapshots: In this case, the entire state of a data system is submitted at fixed intervals. For systems such as file-based data warehouses (e.g., Parquet), SQLite, MySQL, or even Excel spreadsheets, snapshots are often the only available option due to their static nature.

Write Operations

Having access to input data is just the beginning; you also need to determine how to propagate this information to downstream tables. In this article, we focus on Spark as the primary engine, though most modern SQL engines support similar types of operations.

Append: This operation adds new rows to a table by writing them to a new file, without overwriting any existing data on disk.
Overwrite: In this case, the entire table is deleted and replaced with a new dataset, effectively overwriting all previous data.
Merge: A more advanced operation where the engine writes new files that overwrite specific rows or marks certain rows as deleted, without actually removing the existing data. This allows for efficient updates and deletions within the table without a full rewrite.

While the incremental data structure aligns perfectly with the append operation, and CDC pairs naturally with the merge operation, a full snapshot can be used with either the overwrite or merge operations, depending on the requirements. The diagram below illustrates how these relationships work:

Performance Analysis

When building a data platform, the performance of each data flow is crucial and must be balanced with the cost of configuring producers to generate specific data structures.

Source Code

While a deep dive into the implementation of each data flow is beyond the scope of this article, the following code snippets highlight the key methods used and provide a sense of the complexity involved in each flow.

For the complete code and an executable notebook, refer here.

Test Bench

Many factors can influence the performance of each data flow, including cluster type/size, storage type, merge strategy complexity and partitioning strategy. For simplicity, we’ll focus on the most impactful variable: target table size. Here are some details:

Target tables ranging from 1 thousand to 1 billion rows
Each test ran 3 times to get an accurate average duration
Incremental input consist of 1,440 new rows
CDC input consist of 1,440 new rows and update to 1% of target table
Executed in Azure Databricks with 4 workers of type Standard_D4ds_v5

Results

Small Tables: No clear winner

For smaller datasets (fewer than 1 million rows), all four data flows perform similarly, with only slight differences in execution times. This is expected, as the overhead of distributed systems like Spark doesn't offer much advantage when dealing with smaller data volumes.

Absolute Data Flow Durations vs Number of rows

Big Tables: Overwrite overrule merge for snapshots

As table sizes increase beyond 1 million rows, the append operation for incremental data and the merge for CDC inputs are the most efficient flows. This aligns with expectations since appending only requires writing new rows, and CDC merge operations involve similar tasks but add filtering and calculations.

The more interesting comparison lies between snapshot overwrite and snapshot merge. Essentially, it comes down to whether it's more efficient to compare two dataframes and only write the differences, or to simply overwrite the old dataframe with the new one. Overwrite seems to be the clear winner.

Very Big Tables: Merge closes the gap

When normalizing the duration of each data flow relative to snapshot overwrite, we see some interesting trends. Although overwrite remains faster than merge at 1 billion rows, the gap narrows significantly. While overwrite has a 5x advantage at 10 million rows, this shrinks to 2x at 1 billion rows. This suggests that compute and write operations may not scale at the same rate as data grows, and at some point, the merge operation could outperform overwrite.

Normalized Data Flow Durations vs Number of rows

Simplicity: Overwrite wins again

Another advantage of the overwrite operation is its simplicity. Though harder to quantify, simpler code translates to reduced development time and better data quality. Given similar performance (or even slightly worse), I would prefer overwrite over merge any day if dealing with snapshot inputs.

Summary

We explored four types of data flows:

Incremental input with append
Change Data Capture (CDC) input with merge
Snapshot input with merge
Snapshot input with overwrite

And found out that:

For small data tables (fewer than 1 million rows), all data flows offer similar performance. Choose the simplest one that aligns with your data producer or pipeline strategy.
For tables exceeding 1 million rows, favor incremental and CDC inputs, as they reduce operational costs and enable streaming pipelines.
If your data producer can only generate snapshots, an overwrite operation is usually the best choice—unless your data size exceeds 10 billion rows. However, if your system handles more than 10 billion rows without supporting incremental or CDC updates, you may have a bigger issue at hand.

Again, these results are very specific to the test bench described above, but I believe the same trends would be observed in different conditions. I encourage you to use the source code provided here and experiment with your own setup.

What about you? Have you ever had to choose between a merge and an overwrite operation when handling downstream snapshot inputs? I'd love to hear about your experience—share your thoughts in the comments below!

And don't forget to follow me for more great content about data and ML engineering.

要查看或添加评论，请登录

Olivier Soucy的更多文章

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

2024年10月15日

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…

1 条评论
Lakehouse as Code | 03. Data Pipeline Jobs

2024年10月8日

Lakehouse as Code | 03. Data Pipeline Jobs

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Lakehouse as Code | 02. Workspace

2024年9月30日

Lakehouse as Code | 02. Workspace

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Lakehouse as Code | 01. Unity Catalog

2024年9月24日

Lakehouse as Code | 01. Unity Catalog

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Unity Catalog | 3 levels to rule them all

2024年9月3日

Unity Catalog | 3 levels to rule them all

In May 2021, Databricks introduced Unity Catalog (UC), promising a unified governance layer designed to streamline the…

5 条评论
Databricks AI Playground | How to bring your own model

2024年8月26日

Databricks AI Playground | How to bring your own model

After a few months in public preview, Databricks AI Playground has garnered great feedback from the community. But if…

1 条评论
Building a Data Pipeline with Polars and Laktory

2024年7月8日

Building a Data Pipeline with Polars and Laktory

When discussing data pipelines, distributed engines like Spark and big data platforms such as Databricks and Snowflake…

1 条评论
DataFrames Battle Royale | Pandas vs Polars vs Spark

2024年5月31日

DataFrames Battle Royale | Pandas vs Polars vs Spark

Usually, when I sit down to write these blog posts, I have a clear direction in mind. This time, however, it's a bit…

4 条评论
Analytics for Everyone | Data driven decisions using ChatGPT

2024年5月21日

Analytics for Everyone | Data driven decisions using ChatGPT

Preface: If you are non-technical but have an interest in technology or data-driven decision-making, please keep…

2 条评论
Mastering Streaming Data Pipelines with Kappa Architecture

2024年5月16日

Mastering Streaming Data Pipelines with Kappa Architecture

These days, experience with streaming data is a common requirement in most data engineering job postings. It seems that…

2 条评论

See all articles

Data Flows

Input Data Structures

Write Operations

Performance Analysis

Source Code

Test Bench

Results

Summary

Olivier Soucy的更多文章

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

Lakehouse as Code | 03. Data Pipeline Jobs

Lakehouse as Code | 02. Workspace

Lakehouse as Code | 01. Unity Catalog

Unity Catalog | 3 levels to rule them all

Databricks AI Playground | How to bring your own model

Building a Data Pipeline with Polars and Laktory

DataFrames Battle Royale | Pandas vs Polars vs Spark

Analytics for Everyone | Data driven decisions using ChatGPT

Mastering Streaming Data Pipelines with Kappa Architecture