Crisis Mode

Crisis Mode

At release, the GetGo app worked seamlessly, even with its monolithic architecture. It's worth noting that monolithic architectures can often deliver high-quality products. Success depends on factors like load management, code architecture, product complexity, and a multitude of unforeseen challenges. However, some level of logical imbalance was bound to disrupt operations, and we certainly learned valuable lessons from those early days.

Lesson 1: Aggregates Aren't Ideal

Our backend administrative system initially boasted a resource-intensive function aggregating specific numbers, like the total number of active cars and daily revenue. This feature, while impressive, strained our resources significantly. The database had to perform substantial calculations with each page load to generate and display these numbers on the front end. Although aesthetically pleasing, their practical value was limited, as they mainly offered a snapshot of the current operational status.

This was primarily accomplished through the use of stored procedures. At first, I believed this was a viable solution, having previous experience with stored procedures handling large data sets simultaneously. However, I underestimated the impact of this feature on every page load, refresh, and even during user journeys that defaulted to the system's main homepage. This marked the beginning of our first few crises.

One Saturday afternoon, our systems experienced a sudden slowdown, preventing our customer service team from accessing internal systems due to significantly delayed page loads. There were no feature updates or code changes, but one factor contributed to the issue: our inventory (active cars) reached a critical point, overwhelming the database processes. User screens failed to load on a weekend when our service usage peaked.

This marked our first encounter with a full-blown crisis. We swiftly adapted, managed the situation, and provided customer support to the best of our ability. We adopted an all-hands-on-deck approach; our total headcount was just over ten. We restarted and upgraded databases to restore system uptime quickly. At that very moment, I realized we needed to address the system architecture (a first hunch).

Lesson 2: Time-Based Notifications Don't Belong in a Monolith

This was a difficult lesson, leading me to the following conclusion: Regardless of your company's size or your product's current utilisation and revenue generation if you're building a system with time-based notifications, avoid integrating them within your monolith. Trust me, this will prevent countless headaches in the future. Here's why:

We utilize notifications throughout the user journey to provide reminders, gentle nudges, and other informational updates. This was integrated into the same monolithic system that handled all other functionalities. Several containers ran this monolith, and its architecture was intended to handle the load effectively.

Interestingly, the system exhibited warning signs in the months leading up to what I would call the most challenging week in our system's history. I observed warning signs as our notifications began to experience delays. Due to our large customer base, notifications took an extended period to reach everyone. Sending all messages took nearly four hours. We hadn't established an internal development team, relying primarily on vendors for support. We addressed the slowness, assuming the worst was behind us.

A few weeks later, we discovered the root cause. We also hired our first few engineers and slightly more experienced teammates to implement modifications and plan future feature enhancements. One evening, we intended to release a batch of promo codes to our entire user base, but everything abruptly stopped.

The problem arose innocently when our time-based booking statuses failed to transition from "Confirmed" to "Ongoing," indicating the start of a booking based on the current time. Our "Connect to Car" feature, reliant on this time construct, failed to appear in-app, preventing customers from accessing their vehicles and starting their bookings.

Panic ensued. I had to manually patch statuses to keep bookings moving, flipping and patching database values every 15 minutes. For security reasons, none of the developers had direct access, and I held the sole key to this capability. We frantically investigated the issue while continuing to flip statuses. Time slipped away, and before we knew it, it was past midnight with no resolution in sight. At this point, we learned that the server responsible for these updates was unresponsive, but we had no idea why.

It was a gruelling week-long ordeal to resolve this issue. One of our resourceful engineers created a quick cron job to act as the server, executing the change in statuses, relieving the pressure of manual patching. In short, living life in 15-minute blocks was far from enjoyable.

We eventually discovered that the server (we called it the "job server," as it had one job) had frozen. It wasn't dead; it simply ceased functioning. Further investigation revealed the initial fault lay in the promo code upload. The lack of UI feedback to indicate backend processing led our teammate to click the submit button multiple times unknowingly. With multiple containers running concurrently, some processed these upload requests, replicating identities into our job server. In essence, there was a fatal data desynchronization. Once the job server encountered a conflict in its data structure, it stopped working, and no further data points could be processed.

This experience highlighted the illogical nature of combining notifications and operationally sensitive functions. As part of the solution, we separated the job server from the primary database to decouple some critical functions. This validated my earlier hunch. I realized that, rather than pursuing fancy solutions, we needed to deliberately separate our functional use cases and begin decoupling our services.

A sigh of relief washed over us when we restored the system to a working state, emerging even stronger, having been forced to innovate rapidly to ensure system stability.

Lesson 3: Your Team is Your Most Valuable Asset

This crisis underscored how proud I was of the team. Despite its small size, with less than ten members at that stage, everyone contributed, from senior staffers to juniors and new hires. For the first time, it felt incredible to have people who would step up in times of need, putting everything aside to achieve a common goal. It was the first indication that we had a great team, and I was immensely proud of how we rallied across the company to restore service, with customers barely noticing the chaos we were managing.

Throughout the crisis, I prioritized maintaining a work/rest cycle. All new developments were paused. Shifts were implemented, and we operated on a 24-hour cycle to manage and resolve the situation. When we finally succeeded on day three (or four?) and normalcy returned, I insisted on a day off for everyone to recuperate. We emerged even stronger and more unified as a team. Remarkably, almost all of them are still with us today.

Lesson 4: Have a crisis playbook

If there is only one lesson you need from this entry, it is this one. I can’t emphasize enough how important it is to have a playbook. What do I mean by playbook? At the very least, you must have the most basic SQL script data pulls to provide data to your customer service teammates. This was so invaluable that I started building a whole library of them from the first few weeks of gaining access to the system. I iterated over them multiple times and even had built-in variables to manage time brackets, specific statuses, etc. This is the utmost essential thing you must have whenever you build or gain access to a system.

The second most important thing is to have a chain of command. Chain of command not in your day-to-day job but in crisis activity. This is so crucial as with crises, there are trillions of things happening all at once. You have customers coming at you, your customer service team crying for help, and your systems alarms blaring non-stop. You need to have a means of putting actions into place. The only way to do this well is to have a hierarchy of command that the company has built to drive action in an orderly manner. Who or how doesn’t matter, but what matters most is getting the system back into a working state. Everything else can be looked into after the fact.

We were lucky enough to set this piece up super quickly, with my CEO spearheading the command line and the core team driving the rest of the actions down to the individuals responsible for each unit of activity, engineering, marketing, ops, etc. We are nowhere near perfect, but we learn and grow stronger with every crisis. Today, we have drills and created multiple playbooks across departments to ensure that we are ready to respond at any given time of day when a crisis happens. It is not yet a Disaster Recovery at the technology scale, but the actions are similar.

Final thoughts

A crisis is never fun. No good company or individual ever wishes for a crisis, but it will happen; we all must be ready. As impossible as it may seem, all it takes is just one wrong submit request, and everything can go awry.

But with every crisis, the mettle of the team and the company is tested, and when it happens, we should never let a good crisis go to waste.


Elijah Ng

Talent Acquisition Lead | Recruitment Strategist | Driving Success Through Strategic Hiring | GetGo #1 Carsharing

5 个月

Very well written article, Malik Badaruddin. Even non-tech folks like myself can follow it easily. I'm really proud to have you as our CTO so keep up the great work! We will only keep getting better at dealing with these crisis ?

回复
David Han

Strategic Account Leader at Gitlab

6 个月

Very insightful and thank you for sharing! ??

回复
Adrian Ng

CEO @ Rezerv | Fitness & Wellness Platform

6 个月

Great article and tons to learn from your team! The team at Rezerv also learnt this the hard way. When more and more clients launch their schedules for booking on specific days/times weekly, tons of customers will rush in to book their favourite classes and crash our servers.

罗杰耀

销售经理 - 华为云

6 个月

This was really insightful. Thanks for sharing!

要查看或添加评论,请登录

Malik Badaruddin的更多文章

  • Breaking Bad and Starting Up

    Breaking Bad and Starting Up

    So we were looking into our architecture (or, instead, I was looking into it) and decided it was time for a revamp. No…

    3 条评论
  • The Growth Spurt

    The Growth Spurt

    Welcome back to the next edition of my humble newsletter. This month, we are looking at growth challenges—not from the…

    6 条评论
  • The Push for a New Beginning

    The Push for a New Beginning

    “We have to change our stack.” That was one of my remarks after a crisis post-mortem.

    5 条评论

社区洞察

其他会员也浏览了