Apache Spark Structured Streaming
Introduction
Apache Spark Structured Streaming is a scalable, fault-tolerant stream processing engine built on top of Spark SQL. Unlike traditional stream processing frameworks, it treats streams as incremental tables and processes data using declarative SQL-style queries.
Why Spark Structured Streaming?
Input Sources :
Sink Targets :
Operation:
The stream of input data will be flowing into spark which will be treated as an unbounded table. Each time the new data comes in, it will be appended to the unbounded table. And this unbounded table will be processed by spark each time with the new appended data. After processing the input table, a result table will be created and it'll be put into the sink in different output mode.
Output Modes in Structured Streaming
Append: This is the default output mode. Only new rows are written to the sink. Suited for non-aggregated queries.
Update: Updates existing rows with new aggregated values and also the new rows.
Complete: Writes entire aggregated results every time.
Fault Tolerance: Checkpoint Directory
A checkpoint directory is a storage location (e.g., HDFS, S3, DBFS) where Spark persists metadata and intermediate states for a streaming job. If a failure occurs, Spark can restart from the last successful state rather than reprocessing all data from the beginning.
Key Components Stored in the Checkpoint Directory
df.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sensor_data") \
.option("checkpointLocation", "hdfs:///user/spark/checkpoints/") \
.start()
Triggers
Triggers in Spark Structured Streaming define how frequently the streaming query processes new data. They control the batching behavior of micro-batches, influencing latency and throughput.
1. Default Trigger (Micro-Batch Mode)
df.writeStream \
.format("console") \
.start()
2. Processing Time Trigger (Trigger.ProcessingTime)
from pyspark.sql.streaming import Trigger
df.writeStream \
.format("console") \
.trigger(Trigger.ProcessingTime("10 seconds")) \
.start()
3.One-Time Trigger (Trigger.Once)
df.writeStream \
.format("parquet") \
.option("path", "s3://output-data/") \
.trigger(Trigger.Once()) \
.start()
4.Continuous Processing (Trigger.Continuous)
Example: Continuous Processing (Low Latency Mode)
df.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "alerts") \
.trigger(Trigger.Continuous("1 second")) \
.start()
Watermarking & Late Data Handling
Watermarking allows Structured Streaming to handle late events while maintaining performance and accuracy.
领英推荐
df.withWatermark("timestamp", "10 minutes") \
.groupBy(window(col("timestamp"), "5 minutes")) \
.count()
?? How It Works?
Stateful Aggregations & Windowing in Streaming
Structured Streaming allows stateful processing, maintaining aggregates over a time window.
Tumbling Windows (Fixed Intervals)
df.groupBy(window(col("timestamp"), "5 minutes")).count()
Sliding Windows (Overlapping Intervals)
df.groupBy(window(col("timestamp"), "5 minutes", "2 minutes")).count()
Session Windows (User Activity-Based Windows)
df.groupBy(session_window(col("timestamp"), "10 minutes")).count()
Join Operations in Structured Streaming
Structured Streaming supports stream-static and stream-stream joins.
Stream-Static Joins
static_df = spark.read.format("jdbc").option("url", "jdbc:mysql://db").load() df.join(static_df, "id", "left")
Stream-Stream Joins
df1.join(df2, expr("df1.id = df2.id AND df1.timestamp >= df2.timestamp - interval 10 minute)
Optimizations in Spark Structured Streaming
RocksDB-based Stateful Processing
Kafka Source Partitioning
df.option("startingOffsets", "earliest")
Adaptive Query Execution (AQE)
spark.conf.set("spark.sql.adaptive.enabled", "true")
Conclusion
Apache Spark Structured Streaming bridges the gap between batch and streaming processing, offering a declarative, SQL-driven approach. With support for exactly-once semantics, stateful aggregations, and integrations with Kafka, Delta Lake, and cloud storage, it is one of the most powerful streaming engines available today.