Data Pipelines | To merge, or not to merge

Data Pipelines | To merge, or not to merge

In recent years, data has shifted towards a more streaming-centric nature. Online transactions, website clicks, TikTok likes, and even your car's real-time energy consumption now contribute to an ecosystem where each data point represents a new, immutable event. This evolution has led to the rise of incremental data pipelines, which process data as it arrives, in contrast to traditional batch processing of large historical datasets.

Most online transaction processing (OLTP) systems have adapted by generating streams of events through techniques like Change Data Capture (CDC), where every row-level change in a database is tracked and delivered in real time. SQL operations, such as MERGE INTO, enable these events to be seamlessly integrated into an existing data table without the need to overwrite its entire content, making it a more efficient and targeted way to update data.

However, not all systems can adopt this approach. Many legacy or simpler systems lack the ability to stream changes, leaving data engineers with no choice but to take periodic "snapshots" or "dumps" of their data.

So, if you're building a data pipeline that relies on snapshots, what are your options for ingestion and processing? And more importantly, how does this affect performance?

Data Flows

Before diving into the details, let’s explore the various options for ingesting data and propagating changes to downstream tables.

Input Data Structures

Whether you’re using a push or pull configuration for your data ingestion, data producers typically follow one of these three structures:

  • Incremental: Only new data points (or events) are submitted. This structure is often immutable and is common in scenarios like telemetry data, IoT systems, and website monitoring, where data is constantly streaming in.
  • Change Data Capture (CDC): This structure captures a list of changes within a system—whether it's a new record, an update to an existing one, or even a deletion. CDC is frequently used in transactional and other OLTP systems to track and reflect real-time changes.
  • Snapshots: In this case, the entire state of a data system is submitted at fixed intervals. For systems such as file-based data warehouses (e.g., Parquet), SQLite, MySQL, or even Excel spreadsheets, snapshots are often the only available option due to their static nature.

Write Operations

Having access to input data is just the beginning; you also need to determine how to propagate this information to downstream tables. In this article, we focus on Spark as the primary engine, though most modern SQL engines support similar types of operations.

  • Append: This operation adds new rows to a table by writing them to a new file, without overwriting any existing data on disk.
  • Overwrite: In this case, the entire table is deleted and replaced with a new dataset, effectively overwriting all previous data.
  • Merge: A more advanced operation where the engine writes new files that overwrite specific rows or marks certain rows as deleted, without actually removing the existing data. This allows for efficient updates and deletions within the table without a full rewrite.

While the incremental data structure aligns perfectly with the append operation, and CDC pairs naturally with the merge operation, a full snapshot can be used with either the overwrite or merge operations, depending on the requirements. The diagram below illustrates how these relationships work:

Data Flows

Performance Analysis

When building a data platform, the performance of each data flow is crucial and must be balanced with the cost of configuring producers to generate specific data structures.

Source Code

While a deep dive into the implementation of each data flow is beyond the scope of this article, the following code snippets highlight the key methods used and provide a sense of the complexity involved in each flow.

Increment Append


CDC Merge


Snapshot Merge


Snapshot Overwrite


For the complete code and an executable notebook, refer here.

Test Bench

Many factors can influence the performance of each data flow, including cluster type/size, storage type, merge strategy complexity and partitioning strategy. For simplicity, we’ll focus on the most impactful variable: target table size. Here are some details:

  • Target tables ranging from 1 thousand to 1 billion rows
  • Each test ran 3 times to get an accurate average duration
  • Incremental input consist of 1,440 new rows
  • CDC input consist of 1,440 new rows and update to 1% of target table
  • Executed in Azure Databricks with 4 workers of type Standard_D4ds_v5

Results

Small Tables: No clear winner

For smaller datasets (fewer than 1 million rows), all four data flows perform similarly, with only slight differences in execution times. This is expected, as the overhead of distributed systems like Spark doesn't offer much advantage when dealing with smaller data volumes.

Absolute Data Flow Durations vs Number of rows


Big Tables: Overwrite overrule merge for snapshots

As table sizes increase beyond 1 million rows, the append operation for incremental data and the merge for CDC inputs are the most efficient flows. This aligns with expectations since appending only requires writing new rows, and CDC merge operations involve similar tasks but add filtering and calculations.

The more interesting comparison lies between snapshot overwrite and snapshot merge. Essentially, it comes down to whether it's more efficient to compare two dataframes and only write the differences, or to simply overwrite the old dataframe with the new one. Overwrite seems to be the clear winner.

Absolute Data Flow Durations vs Number of rows

Very Big Tables: Merge closes the gap

When normalizing the duration of each data flow relative to snapshot overwrite, we see some interesting trends. Although overwrite remains faster than merge at 1 billion rows, the gap narrows significantly. While overwrite has a 5x advantage at 10 million rows, this shrinks to 2x at 1 billion rows. This suggests that compute and write operations may not scale at the same rate as data grows, and at some point, the merge operation could outperform overwrite.

Normalized Data Flow Durations vs Number of rows


Simplicity: Overwrite wins again

Another advantage of the overwrite operation is its simplicity. Though harder to quantify, simpler code translates to reduced development time and better data quality. Given similar performance (or even slightly worse), I would prefer overwrite over merge any day if dealing with snapshot inputs.

Summary

We explored four types of data flows:

  • Incremental input with append
  • Change Data Capture (CDC) input with merge
  • Snapshot input with merge
  • Snapshot input with overwrite

And found out that:

  • For small data tables (fewer than 1 million rows), all data flows offer similar performance. Choose the simplest one that aligns with your data producer or pipeline strategy.
  • For tables exceeding 1 million rows, favor incremental and CDC inputs, as they reduce operational costs and enable streaming pipelines.
  • If your data producer can only generate snapshots, an overwrite operation is usually the best choice—unless your data size exceeds 10 billion rows. However, if your system handles more than 10 billion rows without supporting incremental or CDC updates, you may have a bigger issue at hand.

Again, these results are very specific to the test bench described above, but I believe the same trends would be observed in different conditions. I encourage you to use the source code provided here and experiment with your own setup.

What about you? Have you ever had to choose between a merge and an overwrite operation when handling downstream snapshot inputs? I'd love to hear about your experience—share your thoughts in the comments below!


And don't forget to follow me for more great content about data and ML engineering.


要查看或添加评论,请登录

Olivier Soucy的更多文章