登录查看更多内容

?? Day 5: Windowing and Triggers in Apache Beam

Mohsin Shaikh

3x GCP Certified Cloud Data Engineer | PYTHON | JAVA | DATAFLOW (APACHE BEAM) | AIRFLOW | BIGQUERY | PUBSUB | KAFKA | Pyspark

发布日期: 2024年10月11日

Hi All, So far, we've covered core transformations, grouping, and combining data. Today, we take a significant step toward real-time data processing by introducing Windowing and Triggers. These concepts allow you to process unbounded datasets (like streaming data) by breaking them into manageable chunks.

Let’s get started!

What is Windowing in Apache Beam?

In data streaming, you don’t always have the luxury of waiting for all your data to arrive before processing it. Windowing solves this problem by dividing the incoming data into logical windows based on time, so you can process elements incrementally.

In Beam, windowing enables you to:

Perform computations on subsets of data that arrive over time.
Collect data over a window period (e.g., every 5 minutes, hourly, daily) for processing.

Types of Windows

Beam supports several types of windows, each suited for different types of streaming scenarios:

Fixed (Tumbling) Windows: Windows of a fixed size that do not overlap. For example, a 5-minute fixed window will process data in 5-minute intervals.
Sliding Windows: Windows that overlap, allowing the same element to be assigned to multiple windows. Useful when you need to calculate metrics over sliding time periods, like the past 10 minutes every 2 minutes.
Session Windows: Windows based on periods of activity, separated by gaps of inactivity. This is useful when analyzing user behavior, where sessions are defined by bursts of activity.

Example: Fixed Window in Apache Beam

Let’s create an example using a fixed window where we group and count elements over 1-minute windows.

import apache_beam as beam
from apache_beam.transforms.window import FixedWindows
from datetime import datetime

def parse_time(time_str):
    # Convert string to datetime object
    dt = datetime.strptime(time_str, '%Y-%m-%d %H:%M:%S')

with beam.Pipeline() as p:
    events = p | 'Create Events' >> beam.Create([
        ('event1', '2024-10-06 00:01:00'),
        ('event2', '2024-10-06 00:01:30'),
        ('event3', '2024-10-06 00:02:00'),
        ('event4', '2024-10-06 00:02:30')
    ])
    
    # Assign events to 1-minute fixed windows
    windowed_events = (events
                       | 'Assign Windows' >> beam.Map(lambda x: beam.window.TimestampedValue(x, parse_time(x[1])))
                       | 'Fixed Window' >> beam.WindowInto(FixedWindows(60)))

    # Count events in each window
    event_counts = (windowed_events
                    | 'Extract Event' >> beam.Map(lambda x: ('event', 1))
                    | 'Count Per Window' >> beam.CombinePerKey(sum))
    
    event_counts | 'Print Results' >> beam.Map(print)

We assign timestamps to each event using TimestampedValue.
We then apply a Fixed Window of 1 minute using WindowInto(FixedWindows(60)).
Finally, we count the number of events in each window.

Triggers: Controlling When Results Are Emitted

In streaming scenarios, Beam needs to know when to emit results. This is where triggers come into play. Triggers allow you to control when Beam outputs the results of a windowed computation. Without triggers, Beam would output results only when a window closes.

领英推荐

March 2024 - chDB joins ClickHouse, Rill Data…

ClickHouse 11 个月前

Data Flow Digest #4: Starting The Year With a Bang ??

Estuary 1 个月前

10 Future Apache Iceberg Developments to Look forward…

Alex Merced 3 个月前

Triggers let you:

Emit partial results before the window closes (e.g., after certain conditions like data arrival or processing time).
Re-fire results as late data arrives (data that arrives after the window technically closes).

Types of Triggers

Event Time Trigger: Emits results based on the timestamp of the elements in the data stream.
Processing Time Trigger: Emits results based on the processing time (i.e., when Beam processes the data).
Data-Driven Trigger: Emits results based on certain conditions (e.g., a certain number of elements arrive).
Composite Triggers: Combine multiple triggers with conditions like early firing and late data handling.

Example: Using Event Time Trigger with Late Data

Let’s extend our previous example to handle late data and fire results when an event arrives after the window closes.

import apache_beam as beam
from apache_beam.transforms.window import FixedWindows
from apache_beam.transforms.trigger import AfterWatermark, AfterProcessingTime, AccumulationMode
from datetime import datetime

def parse_time(time_str):
    # Convert string to datetime object
    dt = datetime.strptime(time_str, '%Y-%m-%d %H:%M:%S')
    
    # Return Unix timestamp
    return dt.timestamp()

with beam.Pipeline() as p:
    events = p | 'Create Events' >> beam.Create([
        ('event1', '2024-10-06 00:01:00'),
        ('event2', '2024-10-06 00:01:30'),
        ('event3', '2024-10-06 00:02:00'),
        ('late_event', '2024-10-06 00:01:15')
    ])

    # Assign timestamps and apply windowing
    windowed_events = (events
                       | 'Assign Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, parse_time(x[1])))
                       
                       # Apply a Fixed Window of 1 minute with Triggers
                       | 'Apply Fixed Window' >> beam.WindowInto(
                            FixedWindows(60),
                            trigger=AfterWatermark(early=AfterProcessingTime(10)),  # Fire after watermark or 10s of processing time
                            accumulation_mode=AccumulationMode.DISCARDING  # Discard old data once results are emitted
                        ))

    # Count events in each window, including late data
    event_counts = (windowed_events
                    | 'Extract Event' >> beam.Map(lambda x: ('event', 1))
                    | 'Count Per Window' >> beam.CombinePerKey(sum))
    
    event_counts | 'Print Results' >> beam.Map(print)

We use a trigger to emit results when all data arrives before the watermark or after 10 seconds of processing time.
The AfterWatermark() trigger waits until all data for a window has arrived, allowing for the inclusion of late data.

Windowing and Triggers in Real-World Use Cases

Windowing and triggers are vital for real-time processing in industries like:

E-commerce: Windowing can be used to analyze user activity per session or count purchases per time window.
IoT Data Streams: In IoT applications, windowing helps process sensor data in chunks (e.g., every 5 minutes) to generate real-time insights.
Social Media Analytics: Streaming platforms use windowing to analyze likes, shares, and comments in real time.

?? Goal: Mastering Windowing and Triggers

With Windowing and Triggers, we have unlocked the ability to handle unbounded data streams and build powerful real-time data pipelines. These tools let you break large streams into manageable pieces and control when and how results are emitted.

In the next post, we’ll dive into Side Inputs and Side Outputs, critical features for more complex pipeline needs. Stay tuned!

Pawan Kumar Chahar

Data Engineer | Python | PySpark | Pandas | DBT | ETL | BigQuery | Snowflake | SQL | GCP | AWS | Docker | GenAI | LLM

4 个月

Great Work ??

要查看或添加评论，请登录

Mohsin Shaikh的更多文章

?? Day 4: Advanced Transformations in Apache Beam – GroupByKey and Combine

2024年10月9日

?? Day 4: Advanced Transformations in Apache Beam – GroupByKey and Combine

Hi All, in previous post we covered core transformations like Map, FlatMap, and Filter to shape and manipulate data…
?? Day 3: Apache Beam Transformations – Map, FlatMap, and Filter

2024年10月8日

?? Day 3: Apache Beam Transformations – Map, FlatMap, and Filter

Hi! In the last post, we covered Pipelines and PCollections, the foundation of any Beam workflow. Today, we’ll take the…
?? Day 2: Core Concepts of Apache Beam – Pipelines and PCollections

2024年10月7日

?? Day 2: Core Concepts of Apache Beam – Pipelines and PCollections

Welcome back to Day 2 of our journey into Apache Beam! Yesterday, we explored what Apache Beam is and why it’s a great…
?? Introduction to Apache Beam: A Unified Model for Stream and Batch Processing

2024年10月6日

?? Introduction to Apache Beam: A Unified Model for Stream and Batch Processing

Learning Apache Beam in 7-days Today, we'll start by laying the foundation: What is Apache Beam? Why should you use it…

?? Day 5: Windowing and Triggers in Apache Beam

Mohsin Shaikh

3x GCP Certified Cloud Data Engineer | PYTHON | JAVA | DATAFLOW (APACHE BEAM) | AIRFLOW | BIGQUERY | PUBSUB | KAFKA | Pyspark

What is Windowing in Apache Beam?

Types of Windows

Example: Fixed Window in Apache Beam

Triggers: Controlling When Results Are Emitted

领英推荐

Types of Triggers

Example: Using Event Time Trigger with Late Data

Windowing and Triggers in Real-World Use Cases

?? Goal: Mastering Windowing and Triggers

Mohsin Shaikh的更多文章

社区洞察

其他会员也浏览了

#422 - Twin Databases, Half the Hassle

JDBC and ODBC drivers for Proton now open source; Support for high cardinality GROUP BY

50 Shades of Direct Lake – Everything You Need to Know About the New Power BI Storage Mode!

Chop Chop #10

Monitor Azure Data Factory with Power BI - Part 2

December 2022 - Iceberg Community News

Getting Your Hands Dirty with Microsoft Fabric: A Beginner's Guide (Part 1)

Azure Data and Power BI News (April 2023)

What The Heck is Apache Polaris?

Announcing "SherloQ" - The Industry's First Domain-Specific Text2SQL Engine from Skypoint AI Platform (AIP)

What is Windowing in Apache Beam?

Types of Windows

Example: Fixed Window in Apache Beam

Triggers: Controlling When Results Are Emitted

领英推荐

Types of Triggers

Example: Using Event Time Trigger with Late Data

Windowing and Triggers in Real-World Use Cases

?? Goal: Mastering Windowing and Triggers

Mohsin Shaikh的更多文章

?? Day 4: Advanced Transformations in Apache Beam – GroupByKey and Combine

?? Day 3: Apache Beam Transformations – Map, FlatMap, and Filter

?? Day 2: Core Concepts of Apache Beam – Pipelines and PCollections

?? Introduction to Apache Beam: A Unified Model for Stream and Batch Processing

社区洞察

其他会员也浏览了

#422 - Twin Databases, Half the Hassle

JDBC and ODBC drivers for Proton now open source; Support for high cardinality GROUP BY

50 Shades of Direct Lake – Everything You Need to Know About the New Power BI Storage Mode!

Chop Chop #10

Monitor Azure Data Factory with Power BI - Part 2

December 2022 - Iceberg Community News

Getting Your Hands Dirty with Microsoft Fabric: A Beginner's Guide (Part 1)

Azure Data and Power BI News (April 2023)

What The Heck is Apache Polaris?

Announcing "SherloQ" - The Industry's First Domain-Specific Text2SQL Engine from Skypoint AI Platform (AIP)