Another day another AWS outage
Another day, another significant AWS outage. According to DownDetector: Doordash, Twitch, PSN, Hulu and more went offline for an hour and a half.
It all started around 7AM PT. A ton of people started complaining with all the major reports seen and tracked on DownDetector.
User1: My Oregon EC2 Instance is down. Website cannot be accessed.
User2: Amazon Connect is down and we use it for work!
User3: Still down in Texas.
AWS goes down again suffering from another major outage. Although the outage lasted for just over an hour and a half, a large number of popular websites were impacted. Two major AWS regions, US-WEST-1 and US-WEST-2 both suffered "internet connectivity" issues.
30 minutes after the first reports of the outage AWS said - "investigating Internet connectivity issues to the US-WEST-1 and US-WEST-2 Regions."
8:10 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-1 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.
8:14 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-2 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.
As an Incident Response Platform company, Fylamynt: built for DevOps and SREs, we started thinking about this incident and wanted to add one more relevant question to an already long list of headaches - “what could have we done to avoid this problem?”
The Real Problem
As much as every company dreams of cloud operations running perfectly all the time, as even junior operations people know the reality is there are issues, things break, and they have to be dealt with constantly.?
No cloud service is perfect, and neither are the teams that work tirelessly to keep them running. They need modern tools and processes to respond quickly when things go wrong, and help ensure your customers continue to receive the value you provide.
Why Fylamynt?
At Fylamynt, we pride ourselves on providing the most state of the art incident response and remediation platform in the market.
Fylamynt is the kind of tool you need when you receive an outage alert.?
Fylamynt can trigger runbooks from your favorite monitoring and alerting tools, spin up a slack channel, zoom, and ticket for the response team, pull all the relevant data from tools like AWS Health, and then automate all the repetitive, tedious and time consuming parts of the runbook to get the system operating at full capacity as quickly as possible. The SRE getting woken up at 3:00 AM will see everything they need to make decisions in the critical part of the process by the time they get logged in.?
The Solution
Fylamynt’s AWS Health integration provides the ability to instantly start collaborating as well as remediating your application incidents for AWS Outages. Let’s configure this solution.
Before we can create the workflow we need to configure and authorize the AWS Health integration.
Navigate to Settings -> Integrations -> AWS Health
Detailed steps to setup an EventBridge rule is available here
After the integration is authorized, we can create our workflow.
Navigate to Workflows -> New Workflow
领英推荐
Our workflow consists of the AWS Health trigger, with additional action nodes to extract the service and region information that is impacted, as well as sending a Slack notification to a centralized team.? This is a simple workflow example, but it can also contain your remediation steps based on a specific service that is being monitored.
In order to execute the workflow automatically when an AWS health alert is triggered, we need to configure our Incident Type and Incident Type assignment.
Navigate to Settings -> Incident Types -> New Type
Provide the Incident Type details by selecting the priority and workflow to launch.
The last step is to associate the Incident Type to an AWS Health service.? You can create multiple types and associates for different services, which might have a different priority and workflow to run.
Navigate to Settings -> Integrations -> AWS Health -> Incident Type Assignments -> New Assignments
Provide the assignment details, in this case we are monitoring the AWS Internet Connectivity outage experienced today.
Finally we need to add collaboration channels as soon as the AWS outage is triggered
Navigate to Settings -> Collaboration
Based on each Incident Priority, a new private slack channel will be created and you can specify the users to add to the new channel.
A zoom session can also automatically be created, as well as opening of a new Jira ticket.
When AWS changes the state of the assigned AWS Health service event, Fylamynt automatically creates an Incident.
On the Incident Details page each of the configured collaboration mediums are available to open.
Try us out today or reach out below and we’ll set up some time to talk.
Helping SREs stay asleep.
Director and Senior IT Leader | Portfolio and Program Management | Digital and Data Transformation | Enterprise Data Management | Regulatory and Risk Management | MarTech Ops and Product Management| IT Strategic Planning
3 年Amazon has lot of incompetent people. Value of management and consulting sucks in amazon.
5+ Year Cloud Engineer & Solutions Architect w/ AWS & Azure Projects @ runtcpip.com | I Can Help - Just Ask | Writing blogs and books
3 年I wrote about the one last week...didn't think it would happen again so soon, but everything still stands - The AWS team is working hard; https://www.runtcpip.com/2021/12/business-bonus-aws-outage-1272021.html
Lead QA Analyst at Velotio Technologies
3 年??