登录查看更多内容

Apache Spark Structured Streaming

Lashman Bala

Data Engineer

发布日期: 2025年3月6日

Introduction

Apache Spark Structured Streaming is a scalable, fault-tolerant stream processing engine built on top of Spark SQL. Unlike traditional stream processing frameworks, it treats streams as incremental tables and processes data using declarative SQL-style queries.

Why Spark Structured Streaming?

End-to-end exactly-once processing (no duplicate records).
Scalable: Handles millions of events per second.
Integrated with batch & ML pipelines.
Supports stateful aggregations & windowing functions.

Input Sources :

Kafka, Kinesis, Socket, Files, Delta Lake, S3, etc.

Sink Targets :

Kafka, S3, Delta, MySQL, PostgreSQL, ElasticSearch, etc.

Operation:

The stream of input data will be flowing into spark which will be treated as an unbounded table. Each time the new data comes in, it will be appended to the unbounded table. And this unbounded table will be processed by spark each time with the new appended data. After processing the input table, a result table will be created and it'll be put into the sink in different output mode.

Output Modes in Structured Streaming

Append: This is the default output mode. Only new rows are written to the sink. Suited for non-aggregated queries.

Update: Updates existing rows with new aggregated values and also the new rows.

Complete: Writes entire aggregated results every time.

Fault Tolerance: Checkpoint Directory

A checkpoint directory is a storage location (e.g., HDFS, S3, DBFS) where Spark persists metadata and intermediate states for a streaming job. If a failure occurs, Spark can restart from the last successful state rather than reprocessing all data from the beginning.

Key Components Stored in the Checkpoint Directory

Metadata Logs – Stores information about the executed queries.
Offsets – Keeps track of processed data offsets (e.g., Kafka topic partitions).
State Store – Stores intermediate aggregation results for stateful operations like windowed aggregations.
Committed Files – Tracks successfully written files to prevent duplication.

df.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "sensor_data") \
    .option("checkpointLocation", "hdfs:///user/spark/checkpoints/") \
    .start()

Triggers

Triggers in Spark Structured Streaming define how frequently the streaming query processes new data. They control the batching behavior of micro-batches, influencing latency and throughput.

1. Default Trigger (Micro-Batch Mode)

Spark automatically decides when to process the next batch. Its as soon as the previous batch completes.
It runs as fast as possible with minimum latency.
This is the default behavior if no trigger is specified.
Best for: Low-latency applications where throughput is not a major concern.

df.writeStream \ 
          .format("console") \ 
          .start()

2. Processing Time Trigger (Trigger.ProcessingTime)

Runs the query at a fixed interval, e.g., every 10 seconds.
Reduces resource usage by avoiding continuous execution.
Ensures periodic data processing.
Suited in use cases where a slight delay in processing is acceptable (e.g., dashboard updates).
Controlling resource usage in cost-sensitive environments.

from pyspark.sql.streaming import Trigger 

df.writeStream \ 
           .format("console") \ 
           .trigger(Trigger.ProcessingTime("10 seconds")) \ 
           .start()

3.One-Time Trigger (Trigger.Once)

The query processes all available data only once and then stops.
Useful for batch-like processing of streaming data.
Works well when using structured streaming for ETL.
Best for,Scheduled batch jobs that run periodically.
Best for, Streaming data ingestion pipelines into a data lake.

df.writeStream \ 
            .format("parquet") \
            .option("path", "s3://output-data/") \
            .trigger(Trigger.Once()) \ 
            .start()

4.Continuous Processing (Trigger.Continuous)

Enables low-latency processing (milliseconds-level).
No micro-batches; each record is processed as it arrives.
Useful for real-time applications like fraud detection.
Best for: Ultra-low latency use cases (e.g., real-time fraud detection, IoT streaming). Applications where delays of seconds are not acceptable.
Limitations: Only supports append mode (no updates or aggregates). Requires exactly-once sinks (e.g., Kafka, Delta Lake).

Example: Continuous Processing (Low Latency Mode)

df.writeStream \
            .format("kafka") \
            .option("kafka.bootstrap.servers", "localhost:9092") \
            .option("topic", "alerts") \
            .trigger(Trigger.Continuous("1 second")) \
           .start()

Watermarking & Late Data Handling

Watermarking allows Structured Streaming to handle late events while maintaining performance and accuracy.

领英推荐

Lithium: Dynamic, Self Hosted, and Distributed…

Niraj Mishra 7 个月前

Seamless Data Streaming: How to Integrate Kafka with…

Reckonsys Tech Labs 2 个月前

Why History Shows Open Source AI Will Dominate:…

Sterling McMannis 5 个月前

df.withWatermark("timestamp", "10 minutes") \ 
   .groupBy(window(col("timestamp"), "5 minutes")) \ 
   .count()

?? How It Works?

If an event arrives within 10 minutes of the latest processed data → It is included.
If it arrives later than 10 minutes → It is ignored.

Stateful Aggregations & Windowing in Streaming

Structured Streaming allows stateful processing, maintaining aggregates over a time window.

Tumbling Windows (Fixed Intervals)

Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.
An input can only be bound to a single window.

Non-overlapping, fixed-time intervals.

df.groupBy(window(col("timestamp"), "5 minutes")).count()

Sliding Windows (Overlapping Intervals)

Sliding windows are similar to the tumbling windows from the point of being “fixed-sized”, but windows can overlap if the duration of slide is smaller than the duration of window, and in this case an input can be bound to the multiple windows.

Overlapping windows allow repeated processing for better accuracy.

df.groupBy(window(col("timestamp"), "5 minutes", "2 minutes")).count()

Session Windows (User Activity-Based Windows)

Session window has a dynamic size of the window length, depending on the inputs. A session window starts with an input, and expands itself if following input has been received within gap duration. For static gap duration, a session window closes when there’s no input received within gap duration after receiving the latest input.

Dynamically created windows based on user sessions.

df.groupBy(session_window(col("timestamp"), "10 minutes")).count()

Join Operations in Structured Streaming

Structured Streaming supports stream-static and stream-stream joins.

Stream-Static Joins

Useful when you need to enrich streaming data with a lookup table.

static_df = spark.read.format("jdbc").option("url", "jdbc:mysql://db").load() df.join(static_df, "id", "left")

Stream-Stream Joins

Requires watermarking for state management.

df1.join(df2, expr("df1.id = df2.id AND df1.timestamp >= df2.timestamp - interval 10 minute)

Optimizations in Spark Structured Streaming

RocksDB-based Stateful Processing

Spark 3.x introduced RocksDB for better memory efficiency in stateful processing.

Kafka Source Partitioning

Uses Kafka consumer group parallelism to scale input ingestion.

df.option("startingOffsets", "earliest")

Adaptive Query Execution (AQE)

Dynamically optimizes join strategies & partitioning based on runtime stats.

spark.conf.set("spark.sql.adaptive.enabled", "true")

Conclusion

Apache Spark Structured Streaming bridges the gap between batch and streaming processing, offering a declarative, SQL-driven approach. With support for exactly-once semantics, stateful aggregations, and integrations with Kafka, Delta Lake, and cloud storage, it is one of the most powerful streaming engines available today.

要查看或添加评论，请登录

Lashman Bala的更多文章

AWS S3: Ultimate Guide to Simple Storage Service

2025年3月25日

AWS S3: Ultimate Guide to Simple Storage Service

Introduction to S3 Amazon Simple Storage Service (S3) is a scalable, high-speed, low-cost object storage service…
Databricks: The Unified Data Analytics Platform

2025年3月24日

Databricks: The Unified Data Analytics Platform

Introduction In the era of big data and AI, businesses need scalable, unified, and cost-efficient platforms to handle…

1 条评论
DBT : A Comprehensive Guide to Data Build Tool

2025年3月22日

DBT : A Comprehensive Guide to Data Build Tool

Introduction to dbt Modern data teams need efficient ways to transform raw data into meaningful insights. dbt (Data…

1 条评论
Delta Lake: An Open Table Format for Reliable Lakehouse architecture

2025年3月21日

Delta Lake: An Open Table Format for Reliable Lakehouse architecture

The explosion of big data has led to a growing need for efficient, scalable, and reliable data management solutions…

1 条评论
Understanding Apache Airflow: A Comprehensive Guide

2025年3月20日

Understanding Apache Airflow: A Comprehensive Guide

Apache Airflow is a powerful open-source platform used for automating, scheduling, and monitoring complex workflows…
Apache Kafka: A Deep Dive into Distributed Event Streaming

2025年3月19日

Apache Kafka: A Deep Dive into Distributed Event Streaming

Introduction In the era of big data, organizations generate massive amounts of data that need to be processed, stored…
Apache Spark: The Ultimate Big Data Processing Engine

2025年3月4日

Apache Spark: The Ultimate Big Data Processing Engine

1. Introduction to Apache Spark What is Apache Spark? Apache Spark is a lightning-fast, distributed computing framework…

1 条评论
Apache Hive: A Data Warehouse Solution on Hadoop

2025年2月28日

Apache Hive: A Data Warehouse Solution on Hadoop

Introduction Apache Hive is a data warehouse infrastructure built on top of Hadoop that allows users to query and…
Apache YARN: The Resource Manager for Hadoop Ecosystem

2025年2月26日

Apache YARN: The Resource Manager for Hadoop Ecosystem

Introduction Apache YARN (Yet Another Resource Negotiator) is the cluster resource management layer in Hadoop…
Understanding HDFS: The Backbone of Big Data Processing

2025年2月25日

Understanding HDFS: The Backbone of Big Data Processing

In today’s data-driven world, the ability to store and process vast amounts of data efficiently is critical. This is…

See all articles

Apache Spark Structured Streaming

Lashman Bala

Data Engineer

Introduction

Input Sources :

Sink Targets :

Operation:

Output Modes in Structured Streaming

Fault Tolerance: Checkpoint Directory

Triggers

Watermarking & Late Data Handling

领英推荐

Stateful Aggregations & Windowing in Streaming

Join Operations in Structured Streaming

Optimizations in Spark Structured Streaming

Conclusion

Lashman Bala的更多文章

社区洞察

其他会员也浏览了

A Guide To Apache Kafka - A Data Streaming Platform

A Step-by-Step Guide to Apache Kafka with Docker and Node.js

Spark Structured Streaming

Resource Optimization for Streaming Data Preprocessing in Kafka

Understanding DStreams in Apache Spark

System Design - A2A communication concepts explained using Apache Kafka

Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

Streamlining Your Data: An Overview of Different Types of Streaming Pipelines

Why is Kafka so fast?

How I Quickly Built a Hit Counter for My Portfolio Website Using Serverless Technology

Introduction

Input Sources :

Sink Targets :

Operation:

Output Modes in Structured Streaming

Fault Tolerance: Checkpoint Directory

Triggers

Watermarking & Late Data Handling

领英推荐

Stateful Aggregations & Windowing in Streaming

Join Operations in Structured Streaming

Optimizations in Spark Structured Streaming

Conclusion

Lashman Bala的更多文章

AWS S3: Ultimate Guide to Simple Storage Service

Databricks: The Unified Data Analytics Platform

DBT : A Comprehensive Guide to Data Build Tool

Delta Lake: An Open Table Format for Reliable Lakehouse architecture

Understanding Apache Airflow: A Comprehensive Guide

Apache Kafka: A Deep Dive into Distributed Event Streaming

Apache Spark: The Ultimate Big Data Processing Engine

Apache Hive: A Data Warehouse Solution on Hadoop

Apache YARN: The Resource Manager for Hadoop Ecosystem

Understanding HDFS: The Backbone of Big Data Processing

社区洞察

其他会员也浏览了

A Guide To Apache Kafka - A Data Streaming Platform

A Step-by-Step Guide to Apache Kafka with Docker and Node.js

Spark Structured Streaming

Resource Optimization for Streaming Data Preprocessing in Kafka

Understanding DStreams in Apache Spark

System Design - A2A communication concepts explained using Apache Kafka

Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

Streamlining Your Data: An Overview of Different Types of Streaming Pipelines

Why is Kafka so fast?

How I Quickly Built a Hit Counter for My Portfolio Website Using Serverless Technology