Our First Outage
??

Our First Outage

This wasn’t the post I was planning on writing today. I had one planned called “The Year of The Product”, but that’ll have to wait until next time. This is because we had our first outage this week.?

I’m writing about this for three reasons. First, I want to be fully transparent. Second, I want to share what we did to prevent it from happening again. And lastly, maybe this can inspire others not to make the same mistakes we did.?

On Tuesday evening at 9.15 p.m., one of the primary database servers became unavailable, rendering Herodesk unavailable for all customers. This continued until the next morning at approx. 4.45 a.m. when I got up and spotted it myself.?

An outage like this is totally unacceptable. Fortunately (if I may be that bold), it happened during some hours when few were using our product, so it affected very few (less than ten persons, the logs show). Nevertheless, it’s unacceptable and frankly embarrassing.?

So, if you were one of the few affected by this: I’m sorry!?

Before we move on, let me just emphasise that no data was lost, no messages from customers were lost, or anything like that.


The Root Cause

On Tuesday afternoon, the database server started using more and more memory. This is a new behaviour. At 9:15 p.m., the OOM (out-of-memory) manager shut the database server down because there wasn’t enough memory in the physical server to keep it running.?

Three learnings from this:

  1. Keep tight surveillance of resource usage. We should have spotted this well in advance.
  2. The database server was mal-configured, allowing it to use more memory than what was available in the server.
  3. The database should have auto-restarted when shut down.

The big question is of course: How could we not have spotted this for such a long time? Outages happen. Systems stop working. I mean, it literally happens to everyone from time to time. But when it happens, it’s usually fixed quickly, because the people responsible knows about it.?

The answer: Human error (read: me) in configuring the surveillance system, meaning it didn’t alert us the way it should.?


How to prevent it from happening again

Wednesday morning, we had a planned maintenance window. We do this regularly to keep things up-to-date. Because we have to reboot things during those updates, we plan it in advance in so-called “maintenance windows” (meaning you can expect some things not to respond during that time while we’re working – which is why we do it at 5 a.m. in the morning).

This time, we also upgraded a number of services, including the database server above, which was also re-configured not to be able to use more memory than what is available.?

Most of Wednesday was also spent updating our surveillance system, ensuring with 100% certainty that Herodesk won’t be able to sneak a fart without us instantly knowing it!?

Part of the new surveillance is resource consumption (CPU, Memory, etc.). We’ve added (a lot) of new customers during the past weeks, and this is also a part of the “problem”, that the strain on the servers has become too much. By constantly keeping an eye on the resource consumption of our servers, we’ll be able to spot when we need to add more in the future, and in good time before it becomes an issue.

So, the concrete initiatives to prevent this from happening again:

  1. Upgraded our external surveillance systems, which constantly keep an eye on our public facing services (meaning the Herodesk product) and alert us if anything happens. This is a critical function which has been triple-tested and will be regularly tested in the future.?
  2. Upgraded our internal surveillance systems, which monitor resource consumption, etc., and alert us if anything looks “out of the ordinary”.
  3. Put in place new SOP’s (standard operating procedures) to ensure correct configuration of critical services, such as database servers.


This last part of this post is gonna be quite technical and meant to inspire others to how they can build a robust surveillance system themselves (that’s startup budget-friendly).?

If you’re not into the tech-stuff, it’s fine to stop reading here. In that case, I want to wrap up by saying that we’re sorry! I’ve been running mission-critical systems and software for 15 years. This should never happen, and honestly, it’s embarrassing that it did. Lesson learned, and I hope that the above gives you the same confidence that I have that this chain of events will never be able to repeat itself, causing an outage like we saw this week.?


Takeaways and technical solutions

In this part I’ll be touching on three things. Internal monitoring, external monitoring and notifications, and how that's now set up at Herodesk.

We’ll start with the internal monitoring.

We’ve been running our own ELK-stack since day one. Logging, monitoring – the lots. What we’ve added now is Rules and Alerts . All internal servers are already added as “fleet agents ”, so the performance metrics etc., are available and already being used in various dashboards. So now, we’re also using it with Rules and Alerts.

For examples:

WHEN Average OF system.cpu.total.pct IS ABOVE 85% FOR THE LAST 3 Minutes        

This will trigger an alert, which will write a new document in the alerts-* index. Because we’re running the open-source version of ELK, it’s not possible for ELK to notify external systems by itself using webhooks, email, or anything like that. You can basically choose between writing to an index or writing in syslog. So, it’s just going to write a new document in an index on alert and recovery.?

Then we have another job running every 30th seconds that checks if there are any new documents in the alerts-* index. This is just a super simple PHP script that does a CURL to the ELK-stack to check for new documents in the index. If there are any, then, depending on the criticality of the event, it either sends an e-mail or text message to notify whoever needs to be notified.?

Text-messages are sent using Compaya.dk and their API - pay as you go, easy and cheap.

To make it run every 30th seconds, we choose the “easy solution”. Two cronjobs that run every minute, with one of them having a sleep 30 before the script is executed. Simple, easy, robust.?

“But what if all your internal systems are down?!”

That leads to the external monitoring. We’re using Betterstack.com for external monitoring. This does external http checks on all public-facing critical services (including domain and SSL expiry).?

What it also does is heartbeat checks. Every time the internal surveillance job runs, it sends a heartbeat to Betterstack. If Betterstack is missing two heartbeats in a row, or if one of the external http checks fails, it sends a notification.?

As for notifications, Betterstack is configured so that first, it sends a critical push messages (meaning it overrides “do-not-disturb” and “silence” settings) to the on-call. If the issue isn’t acknowledged, it starts calling the person on-call (yes, this service can call you) and later others, until the issue is acknowledged.?

Then, we know for certain that any internal or external abnormality is spotted, reported, and acknowledged instantly, ensuring we can solve the problem quickly.?

The total price? Betterstack is $25 /mo. The ELK server is € 10 /mo on a Hetzner Cloud Server. I find that very reasonable, as we can sleep well at night knowing that nothing can happen without us instantly knowing it.?


None of this is necessary in the best of worlds because things “just stay online”. But as I said before, everyone (and I do mean everyone!) experiences outages from time to time. And when they happen, what really matters is how it’s handled and how you communicate about it with your customers. After this experience, I’m confident with our approach going forward.

Flot beskrevet Anders Eiler - alle kan blive ramt af driftsproblemer, sp?rgsm?let er bare hvordan man agerer p? bagkant og l?rer af sine udfordringer og det m? man sige at det har du klaret til UG her. ????

回复
Jess Froholdt Stopa

Engineering Manager @ Veo Technologies

8 个月

God h?ndteret! Lad det v?re en inspiration for alle andre ??

Tobias Kristian Lund Nielsen

Specialiseret i digitale B2B l?sninger - lidt udover det s?dvanlige

8 个月

Cheers Anders! Your post actually made me revisit and rethink some of our internal monitoring policies. I could not agree more with the "not knowing feels like the worst part”. While service interruptions are inevitable, missing any opportunity to prevent them are certainly not the greatest feeling.

Mistakes happend everywhere its just the selected few thats actually bossman enough to admit.

David McNally

I make sure your e-mail gets delivered.

8 个月

Mistakes happen in all companies - how you handle those mistakes show the values of your company.

要查看或添加评论,请登录

Anders Eiler的更多文章

  • How we work

    How we work

    It’s pretty daunting when your company grows from being just you (or a team of founders) to having its first employees.…

    3 条评论
  • Welcome, Sara!

    Welcome, Sara!

    This is the final chapter on hiring our first full-time employee for now, and fortunately, it is ending exactly as I…

    31 条评论
  • In it for the Long Run

    In it for the Long Run

    I’ve held some twenty first-round interviews with potential candidates for the open CSM position at Herodesk. One of…

  • The Hiring Process

    The Hiring Process

    I’m in the middle of the hiring process for the new Customer Success Manager position that I opened two weeks ago. I’ve…

    4 条评论
  • When to hire the first employee

    When to hire the first employee

    Let me start by setting the scene: We’re entirely bootstrapped (self-funded and no outside money) The MRR doesn’t cover…

    19 条评论
  • Our first anniversary - stats, learnings and future plans

    Our first anniversary - stats, learnings and future plans

    Last week, we celebrated Herodesks’ one-year birthday. I can’t believe it’s already been a year since we launched the…

    4 条评论
  • Thoughts on AI and Customer Service

    Thoughts on AI and Customer Service

    I’ve started to work on integrating AI into Herodesk. Using the OpenAI SDK (software development kit) isn’t that…

    2 条评论
  • Talk to Your Customers

    Talk to Your Customers

    Thomas Bisgaard Kj?lhede (a seasoned SaaS CMO) posted this on LinkedIn the other day: The bar is really low in B2B SaaS…

    1 条评论
  • Be obsessive about the User Experience

    Be obsessive about the User Experience

    Before the summer holidays, I wrote about what I wanted to work on now that everybody else was on vacation. I had one…

    3 条评论
  • Summer Slow-Down Ahead

    Summer Slow-Down Ahead

    We’re approaching summer at a steady pace (although you could easily get fooled by the weather in Denmark right now…)…

    2 条评论

社区洞察

其他会员也浏览了