Mastering Streaming Data Pipelines with Kappa Architecture

Mastering Streaming Data Pipelines with Kappa Architecture

These days, experience with streaming data is a common requirement in most data engineering job postings. It seems that every business has a need, or at least an appetite, for streaming data. So, what’s all the fuss about? How do we build pipelines that support this type of data flow?

To illustrate the various concepts, we will build a pipeline that processes stock prices in near-real time and share some latency metrics. The code to deploy the pipeline and review the data is available on github.

Terminology

First things first, let's clarify some terminology. There are many terms used to describe streaming data, some of which are interchangeable in certain contexts. Let's try to demystify the common ones.

Streaming Data

Streaming data is continuously generated and delivered in real-time or near real-time from various sources, such as sensors, social media, or financial transactions. It is typically characterized by low latency, high frequency, and high volume, often requiring immediate processing for timely insights and actions.

Full Reprocessing

This approach involves processing the entire dataset from scratch with each run, without considering any previous processing or partial updates.

Incremental processing

Incremental processing involves processing data in increments, focusing only on newly arrived data points. This allows for more efficient and timely updates without reprocessing the entire dataset. This method helps maintain low latency and enhances the efficiency of data analysis.

Batch Processing

Batch processing involves collecting data over a period of time (generally a few hours) and processing it all at once in a single batch, typically at scheduled intervals. This method is suitable for tasks that do not require immediate results and can tolerate delays, such as monthly financial reports or daily data backups.

source:

Micro-batches Processing

Micro-batch processing collects data in small, discrete chunks and processes them at short intervals, combining the advantages of both batch and stream processing. This approach balances real-time processing with reduced overhead compared to processing each data point individually.

Stream Processing

Stream processing continuously ingests, processes, and analyzes data streams in real-time, enabling immediate insights and actions as the data arrives. It is essential for applications requiring low-latency responses, such as real-time analytics and fraud detection.

source:

The Fuss

Most modern data sources produce streaming data. Transactional systems, IoT devices, and SaaS activity monitoring all fall into this category, typically generating high volumes of data. This makes full reprocessing inconvenient and costly.

What business are (or should) be really interested in is an incremental processing approaches. To ensure scalability and cost reduction, they should aim to avoid reprocessing the same data points during normal operations. Batch, micro-batch, and stream processing all address these needs, but offer different balances of cost, latency, and complexity. Generally, micro-batches offer the best compromise.

As a data engineer or architect, it is generally your role to highlight the pros and cons of the different approaches. A continuously streaming data pipeline will be far more costly than a well-designed micro-batch process running every hour. And most of the time, the former is not even required by the business.

Even if more consideration is given to streaming data to support live applications, historical data still remains critical and needs to be stored and made analytics-ready for other types of analytics.

Architecture

As streaming data became more prevalent, it created the need for an architecture that supports both near real-time analytics for applications and historical analytics for classic reporting. Two main architectures emerged to meet this need: the Lambda and Kappa architectures.

Lambda

Lambda architecture is designed to handle massive quantities of data by utilizing both batch and stream-processing methods. It consists of three layers: the batch layer, which manages historical data and computes results from the entire dataset; the speed layer, which processes real-time data to provide immediate insights; and the serving layer, which merges the outputs from the batch and speed layers to deliver a comprehensive view. This approach ensures data accuracy and low-latency query responses. However, maintaining two separate codebases for batch and real-time processing can increase complexity and development effort

source:

Kappa

Kappa architecture is designed to simplify the processing of streaming data by using a single stream-processing engine. Unlike Lambda architecture, Kappa eliminates the batch layer, relying entirely on real-time data processing. It continuously processes data as it arrives, allowing for immediate insights and simplified data management. This architecture reduces complexity by maintaining a single codebase and infrastructure for processing. Kappa is particularly suited for use cases where real-time data processing is a priority and batch processing can be avoided.

source:


Building a Pipeline

This is all nice, but how do we put these concepts together and build an actual pipeline for streaming data? Let's find out!

The source code is available here: https://github.com/okube-ai/okube-samples/tree/main/000-streaming-data

Use Case

Imagine you're building a cryptocurrency trading bot. For your algorithm to make a buy, hold or sell decision, it needs to have access to BITCOIN near-real time low, high, open and close price at every minute.

Architecture

Our pipeline will be built using the Kappa architecture with continuous micro-batch processing. For simplicity, the market prices data stream will be stored as JSON files in a storage container instead of being published by a streaming service like Kafka.

Stack

There are multiple technologies and frameworks that can be used to build a pipeline meeting the requirements defined in the architecture. In our example, we will use:

  • A continuously running Python process will generate cryptocurrency price events every 5 seconds (to prevent overloading the API) and store them into an Azure storage container.
  • Spark Structured Streaming will be our main processing engine, as it natively supports micro-batches. It is currently one of the popular options for such tasks.
  • Our pipeline will be built using Databricks Delta Live Tables, a declarative ETL framework that sits on top of Spark and Spark Structured Steaming.
  • Pipeline configuration and data transformations will be defined using Laktory, a DataOps framework allowing to quickly configure and deploy DLT pipelines.

Ingestion Job

The ingestion process involves a simple Python script that uses the Crypto Compare API to fetch Bitcoin prices every 5 seconds and write them to a Unity Catalog Volume. Notice how it utilizes the DataEvent model from Laktory to simplify creating a data event object and saving it to the volume.

https://github.com/okube-ai/okube-samples/blob/main/000-streaming-data/notebooks/sample-000/ingest.py

ingestion script

The result is a collection of event JSON files, stored in the landing volume:

Data events in Databricks volume

Each file represent a single price for a given timestamp.

Event JSON file

When the script is launched it will run for about an hour until it terminates by itself.

Pipeline Configuration

The pipeline and the associated data transformations are defined in a YAML configuration file used by Laktory to deploy the required components in the Databricks workspace. The first section defines the pipeline properties, such as the cluster size and the notebooks used.

https://github.com/okube-ai/okube-samples/blob/main/000-streaming-data/resources/pl-sample-000.yaml

pipeline definition

Each table is declared individually. It is generally defined by a source (such as the JSON files mentioned above or another table from pipeline) and a Spark Chain, a series of Spark-based transformations. Each data source (and its corresponding output) will be treated as a stream because we set read_as_stream = True.

Bronze: brz_asset_prices

bronze table definition

This table is a direct representation of the data events and share the same schema at the JSON files.

Silver: slv_asset_prices

silver table definition

This table selects only the desired columns, specifies their types, and renames them to standard names.

Gold: gld_asset_holc

gold table definition

This is the final aggregation we were aiming for. The data is aggregated by stock and by 1-minute time-based windows. For each group, we compute the open, low, close, and high prices of Bitcoin.

Note the watermark option added to the table source. This informs Spark Structured Streaming of the latest point in time a given window can receive late data. This feature is crucial for reducing system latency. More information available here: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

With laktory CLI, deploying this pipeline and its transformations is as simple as running deploy -e dev.

Results

Given we selected a Kappa architecture, we should be able to demonstrate that this pipeline can handle both streaming data and batch processing. That's exactly what we will do.

Batch Processing

First, let's deploy the pipeline with the read_as_stream option set to False, so that the data is processed entirely at each run instead of incrementally. This is the resulting pipeline:

batch processing

The 582 data points were processed at once, fed into the silver table, and eventually aggregated in the gold table. Notice how the tables are identified as materialized views. This is how Databricks identifies tables that are static (by opposition to streaming).

Let's explore the resulting silver table:

batch processing - silver stock prices

As expected, each data point has its corresponding row in the table. The created_at timetsmap indicates when the price was requested to the Crypto Compare API. The bronze_at and silver_at columns are timestamps automatically added by laktory to identify when exactly a row of a given layer was written to the table.

Subtracting the created_at from the silver_at allows us to compute the ingestion and processing latency:

batch processing - latency

Because the entire table was written in a single operation, each row has the same value for silver_at. The latency in this case is not of particular interest as the only information it provides is how much data was accumulated before running the pipeline.

Micro-Batch Processing

Next, we delete the tables, change read_as_stream to True in the YAML configuration and rerun the pipeline.

micro batch processing

We get a very similar result to the batch processing with one notable exception: the tables are now flagged as Streaming table, meaning their content is incrementally updated each time the pipeline runs and new unprocessed data is made available. To showcase this behavior, let's shut down the ingestion script and rerun the pipeline:

Because no new data points were available, the pipeline did not insert any new rows or update existing ones. This is what we would expect from incremental processing. Let's ingest 5 additional data points and rerun the pipeline:

Great! We now have our 5 new data points added to the existing tables. Notice that even the gold table only had 1 row updated. Thanks to the watermark we defined earlier, we have an upper bound on the possible lateness of an event and it prevents from having to update all existing rows.

At this point, we could schedule the pipeline to run at a fixed interval (hourly for example) according on our latency requirements. This approach is both time and cost-effective, as processing only new data points reduces computation time and the required cluster size.

We could stop here, but DLT has one last party trick worth exploring. Remember that continuous option we set to False in our pipeline?

Switching it to True results in a pipeline that, as you guessed, is now continuously running. Anytime a new JSON file appears in the cloud storage, it will trigger an update and insert the new rows.


This continuous mode is not pure stream processing (see definition above) as it still operates using micro-batches. However, it offers very low latencies while supporting very large throughput. The obvious drawback compared to scheduled runs is the increased cost, as a cluster needs to be kept alive continuously. Speaking of latencies, let's revisit the latency graph and see the impact of running incremental loads.

As expected, the results for the initial run of the streaming tables is very similar to the batch processing as a big chunk of rows were written at once. However, on the subsequent runs, the maximum latency is much lower as only a small number of data points need to be processed. Let's take a close look at the flat region of the graph.

These two plateaus are when the pipeline was running continuously. Because the data was processed as it was received, the latency was reduced to about 8 seconds.

This is great! But for many applications, 8 seconds would still be considered very high. Why aren't we reaching sub-second figures? There are a few reasons for this. First, by definition, our latency includes the latency of the API we are using to fetch Bitcoin prices, which we have no control over. Second, for simplicity, we are writing these events to a storage container instead of using a real streaming service like Kafka, which probably adds a second or two. Lastly, our pipeline not only processes, cleans, and aggregates the data, but also writes each intermediate result (bronze, silver, gold) to a data table. The impact of these disk operations on the latency can be easily demonstrated by changing the bronze table to a view.

With this configuration, bronze data is not written to disk, and we observe a drop in the latency from 8 seconds down to 4 seconds.

In other words, for truly low latency requirements, writing intermediate tables is not a viable option. Your final pipeline output should instead be another streaming service connected to your application. Does that prevent us from adhering to the Kappa architecture? No, because we can deploy two different pipelines using the same compute engine and the same transformations. One pipeline runs continuously and does not write tables, while the other is scheduled and writes the desired tables.

Conclusions

Through the ingestion and processing of Bitcoin prices, we demonstrated how the theory of streaming data pipelines applies in a real-life scenario. A few takeaways:

  1. Challenge stakeholders asking for true stream processing or sub-second latencies. In most cases, what they need is scalable and incremental processing framework, typically through micro-batches.
  2. The Kappa architecture, which uses a single processing engine for both streaming and batch operations, is generally the best option when using modern technologies as it greatly simplify the system.
  3. Building a streaming data pipeline can be straightforward if the right tools are used. Leveraging Databricks DLT and Laktory DataOps approach, we built and deployed a pipeline in under an hour.
  4. Designing a streaming data pipeline is always about balancing cost and latency. The closer you need to be to near real-time, the more expensive it will be to run. Which brings us back to takeaway #1: always challenge why low latency is required.


What about you? Have you tried to implement streaming data pipelines in real-life scenarios? Share your experiences below!

And don't forget to follow me for more great content about data engineering.

要查看或添加评论,请登录

Olivier Soucy的更多文章

社区洞察

其他会员也浏览了