Post-mortem (Incident Report)

Andrew Godwin

Senior Software Development Engineer in Test at Solera Inc.

发布日期: 2020年10月5日

+ 关注

Friday, June 28, 2019

By the DevOps Team

Earlier this week we experienced a network outage in our Consolidatedalliance website. This incident report is provided to give details of the nature of the outage and our responses.

The outage occurred on Friday, June 28, 2019. We know this outage and downtime issue has impacted our valued developers and users, and we apologize to everyone who was affected.

Issue Summary

From 6:26 PM to 7:58 PM WAT, requests to most home page of consolidatedalliance.ng resulted in 404 error response messages. Access to other parts of the website was also affected including the login page and booking manager page. The issue affected 80% of traffic to this API infrastructure. Users could continue to access certain parts of the website where you didn't have to go through the home page. The root cause of this outage was an invalid href attribute configuration change in the header file that exposed a bug in the home page. The href attribute specifies the base URL for all relative URLs on a page. It was wrongly set to localhost instead of consolidatedalliance.ng

Timeline (all times West African Time (WAT))

6:19 PM: header file push begins

6:26 PM: Outage begins

6:26 PM: a customer alerted the support team

7:15 PM: Successful header file configuration change rollback

7:19 PM: Server restarts begin

7:58 PM: 100% of traffic back online

Root Cause

At 6:19 PM WAT, an invalid href attribute in the header file was inadvertently released to the production environment without first being tested. The change specified an invalid url for the home page. This made the page to point to localhost instead of consolidatedalliance.ng As a result, the homepage became inaccessible which also blocked access to other parts of the site and the downtime began.

Resolution and recovery

At 6:26 PM WAT, the support team informed the DevOps that a customer called in to complain about the inability to access the login page to his profile. By 6:40 PM, the developers identified the problem and set to work immediately.

At 7:15 PM, we attempted to roll back the problematic configuration change and it was successful since we had previously detected the issue.

At 7:19, We restarted the server. By 7:58, 100% of traffic was recovered and everything went back to normal.

Corrective and Preventative Measures

We’ve conducted an internal review and analysis of the outage. The following are actions we are taking to address the causes of the issue and to help prevent recurrence and improve response times:

-we will put measures in place to make sure out DevOps get alerted first before the customer discover any bugs in the future.

-we agreed to implement DataDog APIs to alert the team of any issues.

-we put measures in place to always test any little changes to avoid such occurrences.

We appreciate your patience while we reviewed and fixed the issue. Again, we apologize for the inconveniences experienced and the impact to you, your users and your business. We thank you for your business with us.

Sincerely,

The DevOps Team

要查看或添加评论，请登录

Andrew Godwin的更多文章

Recruiters vs Recruitees - WW3

2024年6月1日

Recruiters vs Recruitees - WW3

INTRODUCTION In the world today, recruiters and job applicants seem to be in a war situation, whereas, this should be a…
How I developed my personal portfolio website

2020年11月4日

How I developed my personal portfolio website

In this article I am going to explain all the trials and learning I encountered along the way while developing my…
Post-mortem (Incident Report)

2020年10月5日

Post-mortem (Incident Report)

Friday, June 28, 2019 By the DevOps Team Earlier this week we experienced a network outage in our Consolidatedalliance…
What is IoT? - A Simple Explanation of the Internet of Things

2020年9月1日

What is IoT? - A Simple Explanation of the Internet of Things

Introduction In this article, we will understand the basic concept of IoT, how it works and its importance in our world…
What happens when you type any URL in your browser and press Enter

2020年9月1日

What happens when you type any URL in your browser and press Enter

Introduction The moment a user types a URL into a web browser, we must assume that that user is sitting on the famous…
How SQL Database Engines Work

2020年7月23日

How SQL Database Engines Work

Overview In this article, we are going to take a deep dive into the internals of the SQL Database Engine and see what…
Machine Learning

2020年7月5日

Machine Learning

Introduction When you type "Machine Learning" or "what is machine learning?" into a Google search or any search engine,…
How object and class attributes work

2020年6月2日

How object and class attributes work

In this article, we are going to discuss how object and class attributes work in Python. We will learn about Python…
The differences between static and dynamic libraries

2020年5月8日

The differences between static and dynamic libraries

There are two general types of libraries: static and shared. Static libraries get linked in to an application during…
How integers are stored in memory using two’s complement in digital computers

2020年3月31日

How integers are stored in memory using two’s complement in digital computers

Introduction In digital computers, Complements are used in order to simplify the subtraction operation and for the…

See all articles

Issue Summary

Timeline (all times West African Time (WAT))

Root Cause

Resolution and recovery

Corrective and Preventative Measures

Andrew Godwin的更多文章

Recruiters vs Recruitees - WW3

How I developed my personal portfolio website

Post-mortem (Incident Report)

What is IoT? - A Simple Explanation of the Internet of Things

What happens when you type any URL in your browser and press Enter

How SQL Database Engines Work

Machine Learning

How object and class attributes work

The differences between static and dynamic libraries

How integers are stored in memory using two’s complement in digital computers

社区洞察