登录查看更多内容

Our First Outage

Anders Eiler

I help webshops simplify customer support with a better customer support platform @Herodesk.io.

发布日期: 2024年3月15日

This wasn’t the post I was planning on writing today. I had one planned called “The Year of The Product”, but that’ll have to wait until next time. This is because we had our first outage this week.?

I’m writing about this for three reasons. First, I want to be fully transparent. Second, I want to share what we did to prevent it from happening again. And lastly, maybe this can inspire others not to make the same mistakes we did.?

On Tuesday evening at 9.15 p.m., one of the primary database servers became unavailable, rendering Herodesk unavailable for all customers. This continued until the next morning at approx. 4.45 a.m. when I got up and spotted it myself.?

An outage like this is totally unacceptable. Fortunately (if I may be that bold), it happened during some hours when few were using our product, so it affected very few (less than ten persons, the logs show). Nevertheless, it’s unacceptable and frankly embarrassing.?

So, if you were one of the few affected by this: I’m sorry!?

Before we move on, let me just emphasise that no data was lost, no messages from customers were lost, or anything like that.

The Root Cause

On Tuesday afternoon, the database server started using more and more memory. This is a new behaviour. At 9:15 p.m., the OOM (out-of-memory) manager shut the database server down because there wasn’t enough memory in the physical server to keep it running.?

Three learnings from this:

Keep tight surveillance of resource usage. We should have spotted this well in advance.
The database server was mal-configured, allowing it to use more memory than what was available in the server.
The database should have auto-restarted when shut down.

The big question is of course: How could we not have spotted this for such a long time? Outages happen. Systems stop working. I mean, it literally happens to everyone from time to time. But when it happens, it’s usually fixed quickly, because the people responsible knows about it.?

The answer: Human error (read: me) in configuring the surveillance system, meaning it didn’t alert us the way it should.?

How to prevent it from happening again

Wednesday morning, we had a planned maintenance window. We do this regularly to keep things up-to-date. Because we have to reboot things during those updates, we plan it in advance in so-called “maintenance windows” (meaning you can expect some things not to respond during that time while we’re working – which is why we do it at 5 a.m. in the morning).

This time, we also upgraded a number of services, including the database server above, which was also re-configured not to be able to use more memory than what is available.?

Most of Wednesday was also spent updating our surveillance system, ensuring with 100% certainty that Herodesk won’t be able to sneak a fart without us instantly knowing it!?

Part of the new surveillance is resource consumption (CPU, Memory, etc.). We’ve added (a lot) of new customers during the past weeks, and this is also a part of the “problem”, that the strain on the servers has become too much. By constantly keeping an eye on the resource consumption of our servers, we’ll be able to spot when we need to add more in the future, and in good time before it becomes an issue.

So, the concrete initiatives to prevent this from happening again:

Upgraded our external surveillance systems, which constantly keep an eye on our public facing services (meaning the Herodesk product) and alert us if anything happens. This is a critical function which has been triple-tested and will be regularly tested in the future.?
Upgraded our internal surveillance systems, which monitor resource consumption, etc., and alert us if anything looks “out of the ordinary”.
Put in place new SOP’s (standard operating procedures) to ensure correct configuration of critical services, such as database servers.

Tripwire 9 个月前

VLAN Trunking Protocol (VTP): A Comprehensive Guide…

Md Abu Sayed 7 个月前

Microsoft CrowdStrike Outage: Key Insights & Early…

Symposia 4 个月前

This last part of this post is gonna be quite technical and meant to inspire others to how they can build a robust surveillance system themselves (that’s startup budget-friendly).?

If you’re not into the tech-stuff, it’s fine to stop reading here. In that case, I want to wrap up by saying that we’re sorry! I’ve been running mission-critical systems and software for 15 years. This should never happen, and honestly, it’s embarrassing that it did. Lesson learned, and I hope that the above gives you the same confidence that I have that this chain of events will never be able to repeat itself, causing an outage like we saw this week.?

Takeaways and technical solutions

In this part I’ll be touching on three things. Internal monitoring, external monitoring and notifications, and how that's now set up at Herodesk.

We’ll start with the internal monitoring.

We’ve been running our own ELK-stack since day one. Logging, monitoring – the lots. What we’ve added now is Rules and Alerts . All internal servers are already added as “fleet agents ”, so the performance metrics etc., are available and already being used in various dashboards. So now, we’re also using it with Rules and Alerts.

For examples:

WHEN Average OF system.cpu.total.pct IS ABOVE 85% FOR THE LAST 3 Minutes

This will trigger an alert, which will write a new document in the alerts-* index. Because we’re running the open-source version of ELK, it’s not possible for ELK to notify external systems by itself using webhooks, email, or anything like that. You can basically choose between writing to an index or writing in syslog. So, it’s just going to write a new document in an index on alert and recovery.?

Then we have another job running every 30th seconds that checks if there are any new documents in the alerts-* index. This is just a super simple PHP script that does a CURL to the ELK-stack to check for new documents in the index. If there are any, then, depending on the criticality of the event, it either sends an e-mail or text message to notify whoever needs to be notified.?

Text-messages are sent using Compaya.dk and their API - pay as you go, easy and cheap.

To make it run every 30th seconds, we choose the “easy solution”. Two cronjobs that run every minute, with one of them having a sleep 30 before the script is executed. Simple, easy, robust.?

“But what if all your internal systems are down?!”

That leads to the external monitoring. We’re using Betterstack.com for external monitoring. This does external http checks on all public-facing critical services (including domain and SSL expiry).?

What it also does is heartbeat checks. Every time the internal surveillance job runs, it sends a heartbeat to Betterstack. If Betterstack is missing two heartbeats in a row, or if one of the external http checks fails, it sends a notification.?

As for notifications, Betterstack is configured so that first, it sends a critical push messages (meaning it overrides “do-not-disturb” and “silence” settings) to the on-call. If the issue isn’t acknowledged, it starts calling the person on-call (yes, this service can call you) and later others, until the issue is acknowledged.?

Then, we know for certain that any internal or external abnormality is spotted, reported, and acknowledged instantly, ensuring we can solve the problem quickly.?

The total price? Betterstack is $25 /mo. The ELK server is € 10 /mo on a Hetzner Cloud Server. I find that very reasonable, as we can sleep well at night knowing that nothing can happen without us instantly knowing it.?

None of this is necessary in the best of worlds because things “just stay online”. But as I said before, everyone (and I do mean everyone!) experiences outages from time to time. And when they happen, what really matters is how it’s handled and how you communicate about it with your customers. After this experience, I’m confident with our approach going forward.

Square One

713 位关注者

Martin Saldern Schr?der

8 个月

Flot beskrevet Anders Eiler - alle kan blive ramt af driftsproblemer, sp?rgsm?let er bare hvordan man agerer p? bagkant og l?rer af sine udfordringer og det m? man sige at det har du klaret til UG her. ????

Jess Froholdt Stopa

Engineering Manager @ Veo Technologies

8 个月

God h?ndteret! Lad det v?re en inspiration for alle andre ??

1 次回应

Tobias Kristian Lund Nielsen

Specialiseret i digitale B2B l?sninger - lidt udover det s?dvanlige

8 个月

Cheers Anders! Your post actually made me revisit and rethink some of our internal monitoring policies. I could not agree more with the "not knowing feels like the worst part”. While service interruptions are inevitable, missing any opportunity to prevent them are certainly not the greatest feeling.

1 次回应

Lars Knakkergaard

8 个月

Mistakes happend everywhere its just the selected few thats actually bossman enough to admit.

1 次回应

David McNally

I make sure your e-mail gets delivered.

8 个月

Mistakes happen in all companies - how you handle those mistakes show the values of your company.

2 次回应

查看更多评论

要查看或添加评论，请登录

Anders Eiler的更多文章

How we work

2024年11月22日

How we work

It’s pretty daunting when your company grows from being just you (or a team of founders) to having its first employees.…

3 条评论
Welcome, Sara!

2024年11月8日

Welcome, Sara!

This is the final chapter on hiring our first full-time employee for now, and fortunately, it is ending exactly as I…

31 条评论
In it for the Long Run

2024年10月25日

In it for the Long Run

I’ve held some twenty first-round interviews with potential candidates for the open CSM position at Herodesk. One of…
The Hiring Process

2024年10月11日

The Hiring Process

I’m in the middle of the hiring process for the new Customer Success Manager position that I opened two weeks ago. I’ve…

4 条评论
When to hire the first employee

2024年9月27日

When to hire the first employee

Let me start by setting the scene: We’re entirely bootstrapped (self-funded and no outside money) The MRR doesn’t cover…

19 条评论
Our first anniversary - stats, learnings and future plans

2024年9月13日

Our first anniversary - stats, learnings and future plans

Last week, we celebrated Herodesks’ one-year birthday. I can’t believe it’s already been a year since we launched the…

4 条评论
Thoughts on AI and Customer Service

2024年8月30日

Thoughts on AI and Customer Service

I’ve started to work on integrating AI into Herodesk. Using the OpenAI SDK (software development kit) isn’t that…

2 条评论
Talk to Your Customers

2024年8月16日

Talk to Your Customers

Thomas Bisgaard Kj?lhede (a seasoned SaaS CMO) posted this on LinkedIn the other day: The bar is really low in B2B SaaS…

1 条评论
Be obsessive about the User Experience

2024年8月2日

Be obsessive about the User Experience

Before the summer holidays, I wrote about what I wanted to work on now that everybody else was on vacation. I had one…

3 条评论
Summer Slow-Down Ahead

2024年7月5日

Summer Slow-Down Ahead

We’re approaching summer at a steady pace (although you could easily get fooled by the weather in Denmark right now…)…

2 条评论

See all articles

Our First Outage

Anders Eiler

I help webshops simplify customer support with a better customer support platform @Herodesk.io.

The Root Cause

How to prevent it from happening again

领英推荐

Takeaways and technical solutions

Square One

713 位关注者

Anders Eiler的更多文章

社区洞察

其他会员也浏览了

A Letter to IT Professionals From a Fellow Battle-Scared IT Veteran

What it's like to not have data backups

F5 Lab 1.7: Load Balancing Algorithm - Static Load Balancing

Being Open About (API) Outages

LFTP is a sophisticated file transfer program supporting various network protocols such as ftp, sftp, http, fish and many more...

HTTP response status codes

Most Important HTTP status codes

Running Out Of Kerosene Can Be Worse Than Running Out Of Gas… Much Worse.

How to Migrate Exchange 2013 Public Folders over 1.4TB

The Root Cause

How to prevent it from happening again

领英推荐

Takeaways and technical solutions

Square One

713 位关注者

Anders Eiler的更多文章

How we work

Welcome, Sara!

In it for the Long Run

The Hiring Process

When to hire the first employee

Our first anniversary - stats, learnings and future plans

Thoughts on AI and Customer Service

Talk to Your Customers

Be obsessive about the User Experience

Summer Slow-Down Ahead

社区洞察

其他会员也浏览了

A Letter to IT Professionals From a Fellow Battle-Scared IT Veteran

What it's like to not have data backups

F5 Lab 1.7: Load Balancing Algorithm - Static Load Balancing

Being Open About (API) Outages

LFTP is a sophisticated file transfer program supporting various network protocols such as ftp, sftp, http, fish and many more...

HTTP response status codes

Most Important HTTP status codes

Running Out Of Kerosene Can Be Worse Than Running Out Of Gas… Much Worse.

How to Migrate Exchange 2013 Public Folders over 1.4TB