Data Pipelines | To merge, or not to merge
Olivier Soucy
Founder @ okube.ai | Fractional Data Platform Engineer | Open-source Developer | Databricks Partner
In recent years, data has shifted towards a more streaming-centric nature. Online transactions, website clicks, TikTok likes, and even your car's real-time energy consumption now contribute to an ecosystem where each data point represents a new, immutable event. This evolution has led to the rise of incremental data pipelines, which process data as it arrives, in contrast to traditional batch processing of large historical datasets.
Most online transaction processing (OLTP) systems have adapted by generating streams of events through techniques like Change Data Capture (CDC), where every row-level change in a database is tracked and delivered in real time. SQL operations, such as MERGE INTO, enable these events to be seamlessly integrated into an existing data table without the need to overwrite its entire content, making it a more efficient and targeted way to update data.
However, not all systems can adopt this approach. Many legacy or simpler systems lack the ability to stream changes, leaving data engineers with no choice but to take periodic "snapshots" or "dumps" of their data.
So, if you're building a data pipeline that relies on snapshots, what are your options for ingestion and processing? And more importantly, how does this affect performance?
Data Flows
Before diving into the details, let’s explore the various options for ingesting data and propagating changes to downstream tables.
Input Data Structures
Whether you’re using a push or pull configuration for your data ingestion, data producers typically follow one of these three structures:
Write Operations
Having access to input data is just the beginning; you also need to determine how to propagate this information to downstream tables. In this article, we focus on Spark as the primary engine, though most modern SQL engines support similar types of operations.
While the incremental data structure aligns perfectly with the append operation, and CDC pairs naturally with the merge operation, a full snapshot can be used with either the overwrite or merge operations, depending on the requirements. The diagram below illustrates how these relationships work:
Performance Analysis
When building a data platform, the performance of each data flow is crucial and must be balanced with the cost of configuring producers to generate specific data structures.
Source Code
While a deep dive into the implementation of each data flow is beyond the scope of this article, the following code snippets highlight the key methods used and provide a sense of the complexity involved in each flow.
For the complete code and an executable notebook, refer here.
Test Bench
Many factors can influence the performance of each data flow, including cluster type/size, storage type, merge strategy complexity and partitioning strategy. For simplicity, we’ll focus on the most impactful variable: target table size. Here are some details:
Results
Small Tables: No clear winner
For smaller datasets (fewer than 1 million rows), all four data flows perform similarly, with only slight differences in execution times. This is expected, as the overhead of distributed systems like Spark doesn't offer much advantage when dealing with smaller data volumes.
Big Tables: Overwrite overrule merge for snapshots
As table sizes increase beyond 1 million rows, the append operation for incremental data and the merge for CDC inputs are the most efficient flows. This aligns with expectations since appending only requires writing new rows, and CDC merge operations involve similar tasks but add filtering and calculations.
The more interesting comparison lies between snapshot overwrite and snapshot merge. Essentially, it comes down to whether it's more efficient to compare two dataframes and only write the differences, or to simply overwrite the old dataframe with the new one. Overwrite seems to be the clear winner.
Very Big Tables: Merge closes the gap
When normalizing the duration of each data flow relative to snapshot overwrite, we see some interesting trends. Although overwrite remains faster than merge at 1 billion rows, the gap narrows significantly. While overwrite has a 5x advantage at 10 million rows, this shrinks to 2x at 1 billion rows. This suggests that compute and write operations may not scale at the same rate as data grows, and at some point, the merge operation could outperform overwrite.
Simplicity: Overwrite wins again
Another advantage of the overwrite operation is its simplicity. Though harder to quantify, simpler code translates to reduced development time and better data quality. Given similar performance (or even slightly worse), I would prefer overwrite over merge any day if dealing with snapshot inputs.
Summary
We explored four types of data flows:
And found out that:
Again, these results are very specific to the test bench described above, but I believe the same trends would be observed in different conditions. I encourage you to use the source code provided here and experiment with your own setup.
What about you? Have you ever had to choose between a merge and an overwrite operation when handling downstream snapshot inputs? I'd love to hear about your experience—share your thoughts in the comments below!
And don't forget to follow me for more great content about data and ML engineering.