Blameless Postmortem - Sample
https://www.dhirubhai.net/in/ajimatidavid

Blameless Postmortem - Sample

It was 2:35 am, Saturday morning. You had just laid down beside your angel, ready to goof around(#winks).

(Phone rings) - oops, it's your boss.

Boss: Hey josh, we don't know what's wrong, but twitter is on fire. Our customers can't find the products on their dashboard and they want their money back.

Production has broken, and you're the lead DevOps engineer. You're in trouble.

This failure might have been caused by a red-headed junior, who just got employed and wanted to do everything to push his first code to production. Alas, he has broken the whole thing.

Or maybe it was the mistake of a senior engineer who became too confident of his 'legendary-wizardry-stinking' programming prowess and refused to write unit tests that examined edge cases before he pushed the code to production.

Of course, he is the lead software engineer, how dare you question him?

Long story short, production has sank, customers are crying "I can't see my bank balance", "All the products I paid for are no longer on my dashboard", "They scammed me".

Yea, let's assume you've fixed it, it now works well, but there's one more issue.

Your managers, bosses, fat-bellied investors, and the "give-me-back-my-money" customers need to know why they were such chaos over the weekend.

It's time to submit a postmortem, a statement of the process you undertook to find the cause of your software failure and how to prevent them from happening (ever again).

As hot-blooded, bald-headed humans (#winks), we're prone to place blame on the culprit, but no, that's never the best way to handle things.

attlassian.com said:

In a blameless postmortem,?It's assumed that every team and employee acted with the best intentions based on the information they had at the time. Instead of identifying—and punishing—whoever screwed up, blameless postmortems focus on improving performance moving forward.

So, to show what that might look like, below is a sample blameless postmortem. Yea, you're licensed to copy and paste/edit it without the written permission of the almighty publisher. Me. (lol).


WEBSITE DOWNTIME - POSTMORTEM REPORT

ISSUE SUMMARY

  • Duration: This issue lasted for 2 hours and 5 minutes
  • Impact: The company website was inaccessible and 43.4% of users who tried to use our services were unable to access product information.
  • Root cause: Nginx was unable to listen to HTTP requests on port 80.

TIMELINE

  • The issue was detected at 13:52:04 - 25th May 2023 ( and lasted till 15:09:55 of the same day).
  • This was detected when a customer complained about their inability to access their dashboard and track the progress of their orders.
  • Actions taken: The Nginx’s default ‘site-enabled’ file was inspected and was found to contain errors that couldn’t allow Nginx to listen on port 80.
  • misleading investigation/debugging paths that were taken: The sys admin team started by inspecting the Nginx configuration file, after this, they checked for the server for possible services that might have hijacked port 80 from the Nginx web server.
  • The issue was escalated to David M. ajimati, the Head of the engineering team.
  • The Incident was resolved by deleting the default “sites-enabled” file, then creating a symbolic link between the ‘sites-available’ and ‘sites-enabled’ files.

ROOT CAUSE AND RESOLUTION

  • ERROR DETAILS:

The sites-enabled configuration file contained links to pages on our website that are no longer accessible (since they were updated to another).

The sites-available configuration file contained the updated pages that contained the up-to-date information on our products. And since the sites-enabled were pointing to the wrong(inexisting) pages, users accessing the product info website got error 500.

  • FIXING DETAILS:

The error was fixed by creating a symbolic link from the default configuration file in the sites-available directory to the sites-enabled directory for NGINX. By doing so, the default configuration becomes enabled and active for NGINX to use.

This will allow us to enable or disable specific site configurations by simply creating or removing symbolic links and ensuring such issues do not re-occur.

CORRECTIVE AND PREVENTATIVE MEASURES

  • What may be improved includes: how we update the available pages on our website and ensure the NGINX web server takes these changes into effect immediately to minimize downtimes or related issues.
  • Below is the script we used to solve this problem:

#!/usr/bin/env bash
# This script reconfigures nginx to listens to port 80
rm /etc/nginx/sites-enabled/default
ln -sf /etc/nginx/sites-available/default /etc/nginx/sites-enabled/default
sudo service nginx restart        

  • Task list to address this issue:
  • Install web-server monitoring application
  • Compare the sites-enabled and sites-available configuration files
  • Check if there is a symbolic link between both
  • Delete the sites-enabled file
  • Create symbolic links from sites-enabled to sites-available file
  • Check if other services are not using/blocking port 80.
  • Restart nginx


There are many formats, but everyone does just this:

  • To provide the rest of the company’s employees easy access to information detailing the cause of the outage. Often outages can have a huge impact on a company, so managers and executives have to understand what happened and how it will impact their work.
  • And to ensure that the root cause(s) of the outage has been discovered and that measures are taken to make sure it will be fixed.


If that helped you in any way, or you have other Ideas, you can share them in the comment section below. and of course, like and share it.

Else, I'll break your phone when next you try to goof around.

Oluwadamilare (Dami) Agbolade, MBA

Award-Winning Brand Consultant | CEO Damlexa Consulting | I Collaborate with Individuals and Businesses to Build 7-Figure Personal Brands, Craft Irresistible Offers, and Position for Global Success.

1 年

Agba software engineer, I wan learn abeg

回复

要查看或添加评论,请登录

David Ajimati的更多文章

社区洞察

其他会员也浏览了