Concurrency is not just a Computer Science problem
Engineered Beam of Timber - https://bit.ly/2FVqeAq

Concurrency is not just a Computer Science problem

This article and its contents are my own, they do not represent the views of my current, prior (or future) employer (s) … However, full disclosure -> I do work for Google at the time of writing this article and so may (do) express biased views !!

When I got married, there were two functions, the Nikah ceremony in South Africa and the Walima in the UK. In case you're wondering, we didn't have that many people at my Walima, about 1,000 (!)

No alt text provided for this image

The thing is, 1,000 people, I guess it's kinda awesome right? Well, in the past few weeks I've been teaching myself Apache Beam and it kept reminding me of my Walima. Bear with me, I know you are thinking, Suf - What!!

"Can we get some more naan, the freshest naan you've got please?"

Pretty much all (or close-as all) 1,000 guests were served food simultaneously, well we didn't want the ceremony lasting all night and then the next day right? Unfortunately, there was only one kitchen/dining area, a handful of servers and an impatient hall full of hungry guests. There was NO WAY we could simply serve everyone one at a time, basically this could only work if serving happened concurrently. So, what was the strategy?

Strategy 1 - divide servers into sub-teams, allocate them areas or quadrants of the hall to serve.

Problem - Uncle Khan sat on a table that loved naan breads, he would grab the attention of every server he could get hold of, "Can we get some more naan, the freshest naan you've got please?". Naturally, he paid no attention to whether the server he was talking to was his 'designated' server for the part of the hall he was in. In fact, the only designation he cared about was naan.

Nothing against uncle, but he was there to enjoy himself why should he pay attention to the details of which naan server was the best to give his nan some naan. Any server he saw, he would run after them with a not a sign of all those naan carbs holding him back. He did like to remind us he used to be a mean fast bowler in his heyday.

No alt text provided for this image

Strategy 2 - dish out as much food out into serving plates really really fast and just send them out quickly so all tables got food with, i.e. no real strategy at all.

Problem - food flowed out quickly to begin with, but our Great Aunt from Leicester was very chatty. She wanted to know about every server, were they married, were they looking into getting married, had they met her nephew, niece, oh and did they hear that her son was a doctor.

Since our servers were polite, it meant each time they got near her table, they would be stuck at table 47 far longer than they should be (as well as table 35 where the terrible cousins sat, the ones who like painting tablecloths using paint/err I mean curry). Consequentially, serving plates weren't returning to the kitchen quickly enough, so food couldn't get out of the kitchen and so ... instead of moving to the next course the kitchen found themselves serving refills of earlier courses.

No alt text provided for this image

Strategy 3 - Serve only one or two sections of the dining hall at a time and make everyone else wait their turn.

Problem - well have you been to an Indian wedding or Walima? "Oh, so I see they served that family first huh, we just aren't good enough are we? Well I don't need to be here at this walima, let's go kids"

I'm by no means a Beam expert!

This is what made me think of my Walima while playing with Beam. As a former Java developer it first took me a few days to properly get my head round the framework, but then it just clicked into place. I just had to write the steps I wanted and let the Beam runner (I was using Dataflow) work out the correct ordering, when to introduce concurrency or not, when to scale up more workers, when not to, where to schedule the work. I just got on and wrote my data processing pipeline, nice :)

It's worth mentioning, I'm by no means a Beam expert! In many ways I'm still a novice, so my code is probably not the finest example, but I found it a great experience. My challenge was to write a pipeline which could process my personal banking transactions, determine for each transaction what category of spending they were, filter out bad transactions and then write the output to BigQuery for analytics as well as to cloud storage (just for fun).

What I personally liked about Beam is that:

  • I could let the runner work out all the operational characteristics for me, it was almost like using a PaaS for data processing
  • I could construct my 'map' parts of my code as independent classes to the main pipeline, that way I could amend them or replace them in the future without upsetting or modifying the entire pipeline. In fact, if I wanted to I could swap in someone else' implementations if they were better at some point in the future. This kinda reminded me of a microservices approach slightly.
  • Although I wrote my pipeline to handle large batch inputs, I knew it would also handle real-time streaming with the same code if I needed it to.

Going forward there is still a lot to learn, amongst many other things: templating, protobufs, adding grouping functions, testing with streamed data etc and so I'll work on these with time. However, overall I'm quite pleased with the new skill. It's also worth mentioning that I did get some great help and coaching along the way, so thanks to @Asa Harland as well as the @Fred Tsang and the team at Flumaion.

Using Dataflow solved many of the problems we could have had at my Walima too. As I tested my pipeline with more and more data going back several years, more compute was needed for some parts so Dataflow scaled up and down workers as needed. If at times some workers were taking too long, Dataflow snuck in, divided the processing chunk up and re-balanced across workers. Dataflow also logged everything for me, allowed me to see the audit history of which step did what, how many records each step processed and in how much time and audit of my previous runs.

Anyway, if you want to see my handiwork, do check out this repo.

No alt text provided for this image

If you'd like to know more about how Beam and/or Dataflow or how to use Beam with other runners such as Spark, Flink etc, do read here and here.

Finally, if you'd like to know how this could be used to solve a problem for a really big dataset, e.g. for a large financial company, do watch this video and this one.

Many thanks for reading :)

@Sufyaan_Kazi

Shilpee Gupta

Customer Engineering Manager, Data Analytics and AI/ML, Google Cloud | ex-Unilever I ex-Intel

2 年

If only I knew this 17 years back!! I could have done some efficient planning with Option 1, 2 and 3 ??

Aamar Hussain

Enabling businesses with AI & Data Expertise | Strategic Advisor & Trainer in GenAI, Data, ML | Empowering Global Businesses with Generative AI and Data-Driven Decision Making

4 年

Love it Suf! now i really get what Apache Beam was designed for ;-)

回复
Abdullah Yildirim

Sales Manager at arcusoft GmbH Dortmund

6 年

I like your Walima example. Sounds very familiar to me! ;-)

要查看或添加评论,请登录

Sufyaan Kazi的更多文章

  • Autism, they said ...

    Autism, they said ...

    Two important moments happened to me this year, 2024. 1) On the 23 May, I got on stage in front of approx 400 of our…

    23 条评论
  • Six Years working at Google

    Six Years working at Google

    This article and its contents are my own, they do not represent the views of my current, prior (or future) employer (s)…

    16 条评论
  • Pivot Tables are HORRIBLE!

    Pivot Tables are HORRIBLE!

    This article and its contents are my own, they do not represent the views of my current, prior (or future) employer (s)…

  • The lonely shirt(s) in the wardrobe, an ode to 2020

    The lonely shirt(s) in the wardrobe, an ode to 2020

    The day started as usual, well the 'new' normal during lockdown. I "commuted" from the breakfast table (which was now…

    11 条评论
  • So, I joined Google ….

    So, I joined Google ….

    This article and its contents are my own, they do not represent the views of my current, prior (or future) employer (s)…

    4 条评论
  • Openshift vs Cloud Foundry, pt2 ... K8S, Ansible and BOSH

    Openshift vs Cloud Foundry, pt2 ... K8S, Ansible and BOSH

    This article and it's contents are my own, they do not represent the views of my current, prior (or future) employer…

    9 条评论
  • So long and thanks for all the fish .....

    So long and thanks for all the fish .....

    This article and it's contents are my own, they do not represent the views of my current, prior (or future) employer…

    5 条评论
  • You're hired .......

    You're hired .......

    This article and it's contents are my own, they do not represent the views of my current, prior (or future) employer…

  • Cloud Foundry vs OpenShift

    Cloud Foundry vs OpenShift

    This article and it's contents are my own, they do not represent the views of my current, prior (or future) employer…

    18 条评论
  • Fasting in Ramadan

    Fasting in Ramadan

    This article and it's contents are my own, they do not represent the views of my current, prior (or future) employer…

社区洞察

其他会员也浏览了