Blameless Postmortem - Sample
It was 2:35 am, Saturday morning. You had just laid down beside your angel, ready to goof around(#winks).
(Phone rings) - oops, it's your boss.
Boss: Hey josh, we don't know what's wrong, but twitter is on fire. Our customers can't find the products on their dashboard and they want their money back.
Production has broken, and you're the lead DevOps engineer. You're in trouble.
This failure might have been caused by a red-headed junior, who just got employed and wanted to do everything to push his first code to production. Alas, he has broken the whole thing.
Or maybe it was the mistake of a senior engineer who became too confident of his 'legendary-wizardry-stinking' programming prowess and refused to write unit tests that examined edge cases before he pushed the code to production.
Of course, he is the lead software engineer, how dare you question him?
Long story short, production has sank, customers are crying "I can't see my bank balance", "All the products I paid for are no longer on my dashboard", "They scammed me".
Yea, let's assume you've fixed it, it now works well, but there's one more issue.
Your managers, bosses, fat-bellied investors, and the "give-me-back-my-money" customers need to know why they were such chaos over the weekend.
It's time to submit a postmortem, a statement of the process you undertook to find the cause of your software failure and how to prevent them from happening (ever again).
As hot-blooded, bald-headed humans (#winks), we're prone to place blame on the culprit, but no, that's never the best way to handle things.
attlassian.com said:
In a blameless postmortem,?It's assumed that every team and employee acted with the best intentions based on the information they had at the time. Instead of identifying—and punishing—whoever screwed up, blameless postmortems focus on improving performance moving forward.
So, to show what that might look like, below is a sample blameless postmortem. Yea, you're licensed to copy and paste/edit it without the written permission of the almighty publisher. Me. (lol).
WEBSITE DOWNTIME - POSTMORTEM REPORT
ISSUE SUMMARY
领英推荐
TIMELINE
ROOT CAUSE AND RESOLUTION
The sites-enabled configuration file contained links to pages on our website that are no longer accessible (since they were updated to another).
The sites-available configuration file contained the updated pages that contained the up-to-date information on our products. And since the sites-enabled were pointing to the wrong(inexisting) pages, users accessing the product info website got error 500.
The error was fixed by creating a symbolic link from the default configuration file in the sites-available directory to the sites-enabled directory for NGINX. By doing so, the default configuration becomes enabled and active for NGINX to use.
This will allow us to enable or disable specific site configurations by simply creating or removing symbolic links and ensuring such issues do not re-occur.
CORRECTIVE AND PREVENTATIVE MEASURES
#!/usr/bin/env bash
# This script reconfigures nginx to listens to port 80
rm /etc/nginx/sites-enabled/default
ln -sf /etc/nginx/sites-available/default /etc/nginx/sites-enabled/default
sudo service nginx restart
There are many formats, but everyone does just this:
If that helped you in any way, or you have other Ideas, you can share them in the comment section below. and of course, like and share it.
Else, I'll break your phone when next you try to goof around.
Award-Winning Brand Consultant | CEO Damlexa Consulting | I Collaborate with Individuals and Businesses to Build 7-Figure Personal Brands, Craft Irresistible Offers, and Position for Global Success.
1 年Agba software engineer, I wan learn abeg