Apache Beam: Windowing

Thoughts and logs can be lost in time. When I began quarantining, it was March. During a recent #WomenInTech meeting, I realized we were already approving events for June! If time can escape me this quickly, can you imagine what happens at a nanosecond?

This was the next step of my journey. With the Stream Pipeline, I wanted to perform manipulation and analysis on the data. This could be done on the fly.

Event Comes in -> Processed -> Fin.

What if I want to look for habits? trends? or anomalies?

If we see human nature, humans are habitual creatures. We see that habits are activities done at a frequency, such as walking your dog every day or working out three times a week.

But how do we determine a habit?

You examine the actions that human has done everyday, thus anything outside of that day activity at that specific time would be considered an anomaly, or out of the ordinary.

Here's where my journey lead me to the Apache's Windowing function.

Windowing

Let's take a step back. A stream is a constant flow of data. PCollection is a collection of this data flow. How can data be analyzed if it is constantly flowing?

Each individual element of a PCollection creates a subdivision of the data into what is called "windows." The data is broken into subdivisions based on their timestamps. Timestamps can be added to a PCollection or come as part of the collection of data.

If you're following this series, you will know that a transform applies a function to multiple elements. Now with windows, the same transform will be applied to multiple elements working in the window model during the configured time. A full PCollection could be unbound, therefore each PCollection is "processed as a succession of multiple, finite windows." [1]

A PCollection can be bounded or unbounded. For unbounded PCollection, windowing is highly efficient and useful.

Triggers

For unbounded data, a business use case, such as an anomaly, may require the results to aggregate as the data arrives and then emit the data when ready.

The concept of triggers supports additional flows for the data. Triggers will allow refining window strategy for a PCollection as well as aggregating data to submit results early or pull late stream of data. Triggers are used to refine the window strategy and handle before or after window data.

Keep on the lookout for future articles on triggering examples.

Constraints

Windowing does offer it's own constraints. The elements of a PCollection -- after the windowing function has been executed -- will use the elements' windows on the next group transform. This is an as-needed basis. Each element is assigned to a particular window, however, the use of GroupByKey or Combine will aggregate cross windows and keys. Note that windowing and grouping transforms could have side effects to your pipeline.

Window Types

  • Fixed Time Windows
  • Sliding Time Windows
  • Per-Session Windows
  • Single Global Window

A new Windowing structure can be defined by WindowFn. This will be covered in a later series.

There is also the Calendar-based Windows, however, this Window type is not yet supported by the Beam SDK for Python.

Fixed Window

A fixed time window represents a consistent duration, non overlapping time interval in the data stream. [1]

A fixed time window is the most basic form of windowing. At a given timestamp, PCollection may be continuously updated. At each window, all elements may be captured with timestamps that fall into 60 second intervals. The example below shows just this.

PCollection<String> lines = p
  .apply(PubsubIO.readStrings().fromTopic(options.getInputTopic())
  .apply(ParDo.of(AnalyzeElements()))

  // Apply a Window to divide a PCollection into fixed windows, each 60 seconds
  .apply(Window.into(FixedWindows.of(Duration.standardSeconds(60))))
  .apply(PubsubIO.writeStrings().to(options.getOutputTopic());

Conclusion

There are other windowing options available with Apache Beam. Keep on the lookout for future series reviewing these and other topics!

If you enjoyed this article or have suggestions for other, feel free to drop a comment!

References

  1. https://beam.apache.org/documentation/programming-guide/#windowing
  2. https://beam.apache.org/documentation/programming-guide/#adding-timestamps-to-a-pcollections-elements
  3. https://beam.apache.org/releases/javadoc/2.6.0/org/apache/beam/sdk/transforms/windowing/Window.html
Soliman ElSaber

Lead Data Engineer @ TIME dotCom | MSc IT | 3X Google Data & Cloud Expert (Pro Data Engineer, Associate Cloud Engineer, Digital Leader) | AWS Solution Architect Associate

3 年

Hi Yesenia Y., I enjoyed reading your article, as I am using apache beam in my daily work... Currently, I am working on Per-Session Windowing, and was wondering if you have some samples and references for real implementation of this type of windowing? Facing some issues with GroupByKey with the Per-Session Windows... Thanks

回复

要查看或添加评论,请登录

Yesenia Y.的更多文章

  • Stream Pipeline with Apache Beam and Google Cloud Pub/Sub

    Stream Pipeline with Apache Beam and Google Cloud Pub/Sub

    A basic pipeline can solve some business scenarios. However, why use a new technology with a resource maintainable…

    1 条评论
  • Stream Pipeline with Apache Beam

    Stream Pipeline with Apache Beam

    The world around us has been changing at an exponential speed. Technology has bloomed the good with the bad.

    4 条评论

社区洞察

其他会员也浏览了