Another day another AWS outage

Another day another AWS outage

Another day, another significant AWS outage. According to DownDetector: Doordash, Twitch, PSN, Hulu and more went offline for an hour and a half.

It all started around 7AM PT. A ton of people started complaining with all the major reports seen and tracked on DownDetector.

User1: My Oregon EC2 Instance is down. Website cannot be accessed.

User2: Amazon Connect is down and we use it for work!

User3: Still down in Texas.

AWS goes down again suffering from another major outage. Although the outage lasted for just over an hour and a half, a large number of popular websites were impacted. Two major AWS regions, US-WEST-1 and US-WEST-2 both suffered "internet connectivity" issues.

No alt text provided for this image

30 minutes after the first reports of the outage AWS said - "investigating Internet connectivity issues to the US-WEST-1 and US-WEST-2 Regions."

8:10 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-1 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.

8:14 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-2 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.

As an Incident Response Platform company, Fylamynt: built for DevOps and SREs, we started thinking about this incident and wanted to add one more relevant question to an already long list of headaches - “what could have we done to avoid this problem?”

The Real Problem

As much as every company dreams of cloud operations running perfectly all the time, as even junior operations people know the reality is there are issues, things break, and they have to be dealt with constantly.?

No cloud service is perfect, and neither are the teams that work tirelessly to keep them running. They need modern tools and processes to respond quickly when things go wrong, and help ensure your customers continue to receive the value you provide.

Why Fylamynt?

At Fylamynt, we pride ourselves on providing the most state of the art incident response and remediation platform in the market.

Fylamynt is the kind of tool you need when you receive an outage alert.?

Fylamynt can trigger runbooks from your favorite monitoring and alerting tools, spin up a slack channel, zoom, and ticket for the response team, pull all the relevant data from tools like AWS Health, and then automate all the repetitive, tedious and time consuming parts of the runbook to get the system operating at full capacity as quickly as possible. The SRE getting woken up at 3:00 AM will see everything they need to make decisions in the critical part of the process by the time they get logged in.?

The Solution

Fylamynt’s AWS Health integration provides the ability to instantly start collaborating as well as remediating your application incidents for AWS Outages. Let’s configure this solution.

Before we can create the workflow we need to configure and authorize the AWS Health integration.

Navigate to Settings -> Integrations -> AWS Health

Detailed steps to setup an EventBridge rule is available here

No alt text provided for this image

After the integration is authorized, we can create our workflow.

Navigate to Workflows -> New Workflow

No alt text provided for this image

Our workflow consists of the AWS Health trigger, with additional action nodes to extract the service and region information that is impacted, as well as sending a Slack notification to a centralized team.? This is a simple workflow example, but it can also contain your remediation steps based on a specific service that is being monitored.

No alt text provided for this image

In order to execute the workflow automatically when an AWS health alert is triggered, we need to configure our Incident Type and Incident Type assignment.

Navigate to Settings -> Incident Types -> New Type

No alt text provided for this image

Provide the Incident Type details by selecting the priority and workflow to launch.

No alt text provided for this image
No alt text provided for this image

The last step is to associate the Incident Type to an AWS Health service.? You can create multiple types and associates for different services, which might have a different priority and workflow to run.

Navigate to Settings -> Integrations -> AWS Health -> Incident Type Assignments -> New Assignments

No alt text provided for this image

Provide the assignment details, in this case we are monitoring the AWS Internet Connectivity outage experienced today.

No alt text provided for this image
No alt text provided for this image

Finally we need to add collaboration channels as soon as the AWS outage is triggered

Navigate to Settings -> Collaboration

Based on each Incident Priority, a new private slack channel will be created and you can specify the users to add to the new channel.

A zoom session can also automatically be created, as well as opening of a new Jira ticket.

No alt text provided for this image

When AWS changes the state of the assigned AWS Health service event, Fylamynt automatically creates an Incident.

No alt text provided for this image

On the Incident Details page each of the configured collaboration mediums are available to open.

No alt text provided for this image

Try us out today or reach out below and we’ll set up some time to talk.

Try for Free Take the Tour Contact Us

Helping SREs stay asleep.

Subrahmanya Kattamuri, PMP

Director and Senior IT Leader | Portfolio and Program Management | Digital and Data Transformation | Enterprise Data Management | Regulatory and Risk Management | MarTech Ops and Product Management| IT Strategic Planning

3 年

Amazon has lot of incompetent people. Value of management and consulting sucks in amazon.

回复
Morgan ? L.

5+ Year Cloud Engineer & Solutions Architect w/ AWS & Azure Projects @ runtcpip.com | I Can Help - Just Ask | Writing blogs and books

3 年

I wrote about the one last week...didn't think it would happen again so soon, but everything still stands - The AWS team is working hard; https://www.runtcpip.com/2021/12/business-bonus-aws-outage-1272021.html

Satabdi Sikdar

Lead QA Analyst at Velotio Technologies

3 年

??

要查看或添加评论,请登录

Prasen Shelar的更多文章

社区洞察

其他会员也浏览了