Stream Pipeline with Apache Beam and Google Cloud Pub/Sub

A basic pipeline can solve some business scenarios. However, why use a new technology with a resource maintainable input, such as a text file. If you are hoping to step up your Apache Beam game, it may be time to use Cloud Pub/Sub for your I/O.

As part of my exploration, I decided to leverage cloud services. This industry thread is not only cost efficient, but also time efficient. Imagine how to deploy each of your services individually! Let's not forget configuring them to communicate and authenticate to one another. These are nightmares of my past when I began my development career --almost a decade -- with supervisor processes, nginx configures, pip installs, local virtualenv, and vim. Each development feature was a fingers cross release.

1. Getting started with Pub/Sub

For getting started with Apache Beam, head over to Stream Pipeline with Apache Beam.

Legacy times had engineers using text files for data analysis. Log files were generated with system logs and a secondary process would read, modify and create an output file or run an additional process on any threshold or triggers. Maintainability of the log files, individual processes, and triggers/thresholds became overwhelming for organizations.

As technology grew, queueing services, such as RabbitMQ, became available to handle messaging queues for application. In todays day and age, cloud providers have stepped up to the game. Partially, Google's Pub/Sub messaging service.

What is Pub/Sub?

"Pub/Sub is an asynchronous messaging service that decouples services that produce events from services that process events. " - Cloud Pub/Sub Doc [1]

A Pub/Sub can be used as a message-oriented middleware, ingestions of events or system logs, and for streaming analytics pipelines along with it's storage and real-time message delivery. Pub/sub servers are available in all Google Cloud regions and have consistent scalable performance and high availability.

Scenarios

Before taking the time to explore a new technology, let's ensure your use cases will fit the technology. A cloud Pub/Sub will benefit scenarios such as:

  1. Workload balancing within network clusters.
  2. Asynchronous workflows.
  3. Event notifications distribution.
  4. Stream log collection for analysis.
  5. Data streaming from various processes, such as window events, or devices, such as network traffic.
  6. System Reliability Improvements.

Concepts

  • Publisher - an application that creates and sends messages to a topic. Messages could be system logs and audits, network traffic, and session logs.
  • Topic - Publishers send messages to a named resource, a topic.
  • Subscription - A stream of messages received from a single, specific topic awaiting delivery to a subscribed applications. Applications will call subscriptions to pull the messages.
  • Message - A publisher will combine data and optional attributes and create a message. This message is sent to a topic and eventually delivered to it's appropriate subscriber.
  • Message attribute - A publisher can define additional attributes for a message. These attributes are a key-value pair in a message.

Setup

To use Pub/Sub for development, there are prerequisites for setup:

  1. A system (i.e., Window, MacOS, or Linux) with Python and Git installed and configured.
  2. A Google Account
  3. On Google Cloud Console, activate your cloud project, if you haven't already.
  4. Create or select an existing Google project. (Ensure you have proper permissions to that project)
  5. Enable the API within the Pub/Sub section of Cloud Console.
  6. Create your topic and provide it a name.
  7. Select the topic created and create your subscription.
  8. Now let's create your service account via the service account section of the console.
  9. Select the project and create service account.
  10. A service account requires publish and subscription permissions. After selecting create, use the select a role dropdown and add the Pub/sub publisher role.
  11. Add another role and select Pub/Sub subscriber role. Continue for creation
  12. A Key will be required by the client library in order to access the API. Create an access key and select JSON.
  13. Save the key file as a json in your project repo, i.e., ~/Projects/pubsub_repo/cred.json.
  14. Installation and configuration for Cloud SDK will also be required. More details can be followed in the Cloud SDK doc.

There are additional steps to configure cloud SDK with the credentials and the python environment. This is out of the scope of this article. For more details, check out Google Cloud pub/sub document quick start for your specific system.

2. Create your pipeline

A. Create a `Pipeline` object.

We create our first basic pipeline in Stream Pipeline with Apache Beam.

How different would this be for a Pub/Sub? Shown below, you can see a Pub/Sub will require options to be configured.

// Pub/Sub Option

import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;

...

 PubSubToGCSOptions options = PipelineOptionsFactory
   .fromArgs(args)
   .withValidation()
   .as(PubSubToGCSOptions.class);

Pipeline p = Pipeline.create(options);

2. Reading data into your pipeline.

So we created our pipeline. It's time to tell it where to retrieve the data.

Our Pub/sub pipeline will read from the defined input topic. This option was set when we created our `PipelineOptionsFactory` in part 1.

PCollection<String> lines = p.apply(
  PubsubIO.readStrings().fromTopic(options.getInputTopic());

3. Applying Transforms to Process Pipeline Data

You can manipulate your data with various transforms within the Beam SDKs. Similar to the read transformer, you apply the transforms to your pipelines PCollection by calling `apply` and `ParDo` transformer. 

Using a Pub/Sub, we can see not much has changed. The method called has changed to represent the various transforms available.

// The AnalyzeElements to perform on each element in the input PCollection
static class AnalyzeElements(beam.DoFn):

...

PCollection<String> lines = p
  .apply(PubsubIO.readStrings().fromTopic(options.getInputTopic())
  .apply(ParDo.of(AnalyzeElements()));

4. Writing or Outputting your Final Pipeline Data

We've pulled your stream, analyzed the data, and now we are ready to write our data. 

Our Pub/Sub pipeline will write from it’s defined topic. This option was set when we created our PipelineOptionsFactory in part 1.

PCollection<String> lines = p
  .apply(PubsubIO.readStrings().fromTopic(options.getInputTopic())
  .apply(ParDo.of(AnalyzeElements()))
  .apply(PubsubIO.writeStrings().to(options.getOutputTopic());

5. Running your pipeline

In order to execute your built pipeline, you will call the `run()` command. Pipelines run asynchronously. 

p.run();

Due to the asynchronous runs, if you’d like to block execution instead, append the `waitUntilFinish` method to your pipeline:

p.run().waitUntilFinish();

Conclusion

That's all there is to creating a pipeline with Pub/Sub, folks!

// PubSub Option

import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;

...

 PubSubToGCSOptions options = PipelineOptionsFactory
   .fromArgs(args)
   .withValidation()
   .as(PubSubToGCSOptions.class);


Pipeline p = Pipeline.create(options);

...


// The AnalyzeElements to perform on each element in the input PCollection
static class AnalyzeElements(beam.DoFn):

...

PCollection<String> lines = p
  .apply(PubsubIO.readStrings().fromTopic(options.getInputTopic())
  
  .apply(ParDo.of(AnalyzeElements())
  .apply(PubsubIO.writeStrings().to(options.getOutputTopic());

p.run().waitUntilFinish();

We are in a sweet spot. Let's add more advance features to this!

Other Topics:

  1. Apache Beam Windowing

References:

  1. Cloud Pub/Sub
  2. Cloud Pub/Sub QuickStart by Mac

要查看或添加评论,请登录

Yesenia Y.的更多文章

  • Apache Beam: Windowing

    Apache Beam: Windowing

    Thoughts and logs can be lost in time. When I began quarantining, it was March.

    1 条评论
  • Stream Pipeline with Apache Beam

    Stream Pipeline with Apache Beam

    The world around us has been changing at an exponential speed. Technology has bloomed the good with the bad.

    4 条评论

社区洞察

其他会员也浏览了