The Push for a New Beginning

The Push for a New Beginning

“We have to change our stack.” That was one of my remarks after a crisis post-mortem. We launched our service with great fanfare. We grew our users rapidly and spread throughout the country to become a major service provider. But that’s now. Let’s rewind to late 2021 when we had one of our first Tech crises. We were running a monolith architecture, which served us well for the first six months of our operation. It worked well until it didn’t.

We were hit with multiple capacity issues due to multiple variables.

  1. Database Capacity

Since we were running off our monolith in just one database, it needed to be highly performant, as multiple ACID transactions were happening simultaneously at any given time. These operations include search, booking operations, authentication, notification, and payment transactions, to name a few.

We saw a spike in operational compute when we launched promo codes across the entire user base. Although there were safeguards, like monitoring alerts and load management for the promo code notifications, the spikes were instantaneous due to the load that was tasked to the system. The eventual effect was a sharp peak in CPU and memory usage, which drove our database into a fatal locking state.

  1. Increase in vehicle capacity and user base.

As we grew over the initial months after launch, our user base spiked dramatically. We were also adding new vehicles almost daily across multiple locations throughout the island. This increased load complexity and database compute spans multiple different app functions. The dependency on one main database to span a whole application seems tenable only for so long.

  1. Scheduled jobs

As part of our service, we provided multiple friendly reminders of your booking and your time to start or end booking. Each booking had an average of 6 notification reminders, which spread throughout the lifetime of your booking. Booking statuses were also changed over time. We also had an auto extension feature which extends a user’s booking to prevent them from running a late charge (this was in 2021). All of these were running concurrently for each booking within the same database. In addition, the job scheduler activated and deactivated promo codes as they reached their respective activations and deactivations.

  1. Mass messages and announcements

As our numbers scale, things become demanding. We launched new campaigns and scheduled mass messages multiple times to inform our user base of our services and exciting deals. Back then, we had no Mar-Tech services plugged in and handled all these in-house through our systems.

?

After our initial crisis, it dawned on me that this would soon become unscalable. Our plug gap measure was to increase our DB capacity and launch more FarGate instances to support the concurrent load and compute. Of course, we can only go so far as there is an upper limit to RDS capacity. Hence, I raised my hand in that meeting and mentioned, “We need to split these up.” And by these, I was looking into splitting our services into functional domains, essentially forming a microservice architecture. As I drew up the initial “napkin” plans, I was ready to embark on this challenge with my team of 5 engineers, knowing that the task was huge. I proposed a solution that made sense to us as a business (mind you, in the early stages), which was piled on top of the need to expand our current feature set. BUT! It is still a challenge!!

So, did we do it? Not quite!

The impetus to change was not high at the initial stages. We created an advanced monitoring system to alert us and established better SOPs to predict loads that could bring more compute to the server. We survived only so much! As the user base grew, so did the need for the freedom to drive. It eventually erupted one day when the system was uploaded with a batch of promo codes for the entire user base. Although we have a way to manage it at the systems level, a non-intended repeated user action broke the camel’s back, which sent the system on an overhaul. We took three weeks to bring the system back to normal. Everything was halted then, and all efforts were to mitigate users' bookings, ensuring that business was not impacted late at night; the whole team was dedicated to unwinding the system back to its working state.

?

To resolve the problem, we had to architecture the system at its most tactical scale. I jumped at this opportunity almost wholeheartedly as it meant I could prove a concept and bring it forward to push for the rest of the system. The solution was very small, yet it gave me the impetus to push forward and be very bullish on this front when it realises its potential. And truth be told, it did! We separated a part of the service and carved out a mini microservice from the monolith. It’s clunky, but the area of effect in code complexity is tiny. It was a simple cut-and-paste job and realigning pipelines to swing data entry over to a new database to be focused on processing job schedules.

?

It worked beautifully, and there was no worry about job scheduling as it performed within its own capacity limits. We also saw some savings from scaling down our main database so it was a great win! This formed the drive behind the need to change and the push for a new beginning

Hong Sen Teo

Data & AI | IBM

10 个月

Well written and clear! Change management always benefits from a little push ??

Nicolas M.

CTO / head of Eng. / Product Manager - Technology Leadership with people at the core. Singapore Permanent Resident

10 个月

Great sharing Malik Badaruddin !

回复
Teik Seong LAU

Developer at Avexsoft

10 个月

cool, what's the new stack like?

Joey Cheng

??Cyber Security | GRC, OffSec, Managed Services, Strategy & Architecture, Technology Platform and Engineering | @ Sekuro ??

10 个月

onward to more growth!! ??

Michal Lihocky

Engineering @ foodpanda

10 个月

+1 for carving out the microservices out of the monolith only once there's an actual need and use case for it. All the best guys! ????

要查看或添加评论,请登录

Malik Badaruddin的更多文章

  • Breaking Bad and Starting Up

    Breaking Bad and Starting Up

    So we were looking into our architecture (or, instead, I was looking into it) and decided it was time for a revamp. No…

    3 条评论
  • Crisis Mode

    Crisis Mode

    At release, the GetGo app worked seamlessly, even with its monolithic architecture. It's worth noting that monolithic…

    5 条评论
  • The Growth Spurt

    The Growth Spurt

    Welcome back to the next edition of my humble newsletter. This month, we are looking at growth challenges—not from the…

    6 条评论

社区洞察

其他会员也浏览了