登录查看更多内容

Another day another AWS outage

Prasen Shelar

Co-Founder & CEO – cooking up something epic

发布日期: 2021年12月15日

Another day, another significant AWS outage. According to DownDetector: Doordash, Twitch, PSN, Hulu and more went offline for an hour and a half.

It all started around 7AM PT. A ton of people started complaining with all the major reports seen and tracked on DownDetector.

User1: My Oregon EC2 Instance is down. Website cannot be accessed.

User2: Amazon Connect is down and we use it for work!

User3: Still down in Texas.

AWS goes down again suffering from another major outage. Although the outage lasted for just over an hour and a half, a large number of popular websites were impacted. Two major AWS regions, US-WEST-1 and US-WEST-2 both suffered "internet connectivity" issues.

30 minutes after the first reports of the outage AWS said - "investigating Internet connectivity issues to the US-WEST-1 and US-WEST-2 Regions."

8:10 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-1 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.

8:14 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-2 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.

As an Incident Response Platform company, Fylamynt: built for DevOps and SREs, we started thinking about this incident and wanted to add one more relevant question to an already long list of headaches - “what could have we done to avoid this problem?”

The Real Problem

As much as every company dreams of cloud operations running perfectly all the time, as even junior operations people know the reality is there are issues, things break, and they have to be dealt with constantly.?

No cloud service is perfect, and neither are the teams that work tirelessly to keep them running. They need modern tools and processes to respond quickly when things go wrong, and help ensure your customers continue to receive the value you provide.

Why Fylamynt?

At Fylamynt, we pride ourselves on providing the most state of the art incident response and remediation platform in the market.

Fylamynt is the kind of tool you need when you receive an outage alert.?

Fylamynt can trigger runbooks from your favorite monitoring and alerting tools, spin up a slack channel, zoom, and ticket for the response team, pull all the relevant data from tools like AWS Health, and then automate all the repetitive, tedious and time consuming parts of the runbook to get the system operating at full capacity as quickly as possible. The SRE getting woken up at 3:00 AM will see everything they need to make decisions in the critical part of the process by the time they get logged in.?

The Solution

Fylamynt’s AWS Health integration provides the ability to instantly start collaborating as well as remediating your application incidents for AWS Outages. Let’s configure this solution.

Before we can create the workflow we need to configure and authorize the AWS Health integration.

Navigate to Settings -> Integrations -> AWS Health

Detailed steps to setup an EventBridge rule is available here

After the integration is authorized, we can create our workflow.

Navigate to Workflows -> New Workflow

领英推荐

WHY NETWORK LOAD BALANCER MONITORING IS CRITICAL

Shahzad Dhanwani 1 年前

Load Balancing in AWS: A Comprehensive Guide to ALB…

Fernando Pi?ero Estrada 1 年前

The Power Behind Web Giants: An Insight into Load…

Borel L Lepatio 10 个月前

Our workflow consists of the AWS Health trigger, with additional action nodes to extract the service and region information that is impacted, as well as sending a Slack notification to a centralized team.? This is a simple workflow example, but it can also contain your remediation steps based on a specific service that is being monitored.

In order to execute the workflow automatically when an AWS health alert is triggered, we need to configure our Incident Type and Incident Type assignment.

Navigate to Settings -> Incident Types -> New Type

Provide the Incident Type details by selecting the priority and workflow to launch.

The last step is to associate the Incident Type to an AWS Health service.? You can create multiple types and associates for different services, which might have a different priority and workflow to run.

Navigate to Settings -> Integrations -> AWS Health -> Incident Type Assignments -> New Assignments

Provide the assignment details, in this case we are monitoring the AWS Internet Connectivity outage experienced today.

Finally we need to add collaboration channels as soon as the AWS outage is triggered

Navigate to Settings -> Collaboration

Based on each Incident Priority, a new private slack channel will be created and you can specify the users to add to the new channel.

A zoom session can also automatically be created, as well as opening of a new Jira ticket.

When AWS changes the state of the assigned AWS Health service event, Fylamynt automatically creates an Incident.

On the Incident Details page each of the configured collaboration mediums are available to open.

Try us out today or reach out below and we’ll set up some time to talk.

Try for Free Take the Tour Contact Us

Helping SREs stay asleep.

Subrahmanya Kattamuri, PMP

3 年

Amazon has lot of incompetent people. Value of management and consulting sucks in amazon.

Morgan ? L.

5+ Year Cloud Engineer & Solutions Architect w/ AWS & Azure Projects @ runtcpip.com | I Can Help - Just Ask | Writing blogs and books

3 年

I wrote about the one last week...didn't think it would happen again so soon, but everything still stands - The AWS team is working hard; https://www.runtcpip.com/2021/12/business-bonus-aws-outage-1272021.html

3 次回应

Satabdi Sikdar

Lead QA Analyst at Velotio Technologies

3 年

2 次回应

查看更多评论

要查看或添加评论，请登录

Prasen Shelar的更多文章

2021: A Year in Review

2021年12月31日

2021: A Year in Review

Wow, hard to believe we are at the end of 2021! It’s been an evolving year of changes in more ways than one, and for us…
Fylamynt to introduce ‘Cloud Incident Response’ with Slack, Zoom, Jira and an Intelligent Dashboard

2021年11月30日

Fylamynt to introduce ‘Cloud Incident Response’ with Slack, Zoom, Jira and an Intelligent Dashboard

It’s 2021 and where do we stand today with Incident Response? Oh, wait. But I’m using Pagerduty, Slack, Zoom, Jira…
Global outage at Facebook, WhatsApp and Instagram

2021年10月5日

Global outage at Facebook, WhatsApp and Instagram

Thirty minutes after the first reports of the outage Facebook tweeted - "We're working to get things back to normal as…

2 条评论
Fylamynt version 1.5.0 is now available!

2021年7月13日

Fylamynt version 1.5.0 is now available!

The Fylamynt 1.5.
Fylamynt introduces “SlackOps” for DevOps and SREs

2021年7月13日

Fylamynt introduces “SlackOps” for DevOps and SREs

Slack is an amazing collaboration tool and one of the things we love about it is that it is a powerful platform that…
SOAR for Security, COAR for Cloud?

2021年6月26日

SOAR for Security, COAR for Cloud?

It’s only been a few years and the acronym SOAR is not offbeat anymore. In fact, SOAR (Security Orchestration…
Is that a bug in my workflow?

2021年6月22日

Is that a bug in my workflow?

It is rare that your automation workflows, especially if they are somewhat complicated with nested routines, run…
Speed Your SaaS Deployments with Pulumi and Fylamynt

2021年5月20日

Speed Your SaaS Deployments with Pulumi and Fylamynt

A lot of common cloud infrastructure maintenance tasks are repetitive and just plain tedious when you have to do it…

1 条评论
Human in the Loop in your Workflows

2021年5月18日

Human in the Loop in your Workflows

Workflows can automate a series of steps but oftentimes there are occasions where human intervention is required during…
Scheduling Workflows in Fylamynt

2021年5月4日

Scheduling Workflows in Fylamynt

In the previous blog we went over how easy it is to create workflows and you found out that workflows are triggered by…

See all articles

Another day another AWS outage

Prasen Shelar

Co-Founder & CEO – cooking up something epic

The Real Problem

Why Fylamynt?

The Solution

领英推荐

Prasen Shelar的更多文章

社区洞察

其他会员也浏览了

OCI Load Balancer

The Power Behind Web Giants: An Insight into Load Balancers

Beyond VMware: Unlock Agility and Savings with Red Hat OpenShift Virtualization

Azure Load Balancer

Alternative to Kubernetes: AWS Fargate?

Azure Load Balancer

12 Types of AWS troubleshooting skills, including:

Monitoring load balancers using Amazon CloudWatch anomaly detection alarms

Raspberry Pi Kubernetes Cluster — Load Balancer | Issue #6

Mastering Network Infrastructure in Azure: Selecting the Ideal Load Balancer

The Real Problem

Why Fylamynt?

The Solution

领英推荐

Prasen Shelar的更多文章

2021: A Year in Review

Fylamynt to introduce ‘Cloud Incident Response’ with Slack, Zoom, Jira and an Intelligent Dashboard

Global outage at Facebook, WhatsApp and Instagram

Fylamynt version 1.5.0 is now available!

Fylamynt introduces “SlackOps” for DevOps and SREs

SOAR for Security, COAR for Cloud?

Is that a bug in my workflow?

Speed Your SaaS Deployments with Pulumi and Fylamynt

Human in the Loop in your Workflows

Scheduling Workflows in Fylamynt

社区洞察

其他会员也浏览了

OCI Load Balancer

The Power Behind Web Giants: An Insight into Load Balancers

Beyond VMware: Unlock Agility and Savings with Red Hat OpenShift Virtualization

Azure Load Balancer

Alternative to Kubernetes: AWS Fargate?

Azure Load Balancer

12 Types of AWS troubleshooting skills, including:

Monitoring load balancers using Amazon CloudWatch anomaly detection alarms

Raspberry Pi Kubernetes Cluster — Load Balancer | Issue #6

Mastering Network Infrastructure in Azure: Selecting the Ideal Load Balancer