登录查看更多内容

From Crisis to Cloud: A 3-Day Journey to AWS Resilience

Renan Fenrich

DevOps | SecOps | SRE

发布日期: 2024年12月2日

In DevOps, we all know that “everything’s fine” can go sideways in a matter of hours.

That’s exactly what happened to me recently at Cars2You, when our production environment decided to go dark for three long days. It wasn’t just an outage—it was CHAOS. But out of that chaos came something better: a new infrastructure on AWS, built under pressure, with Terraform, Ansible, and a fair amount of coffee-fueled determination.

Here’s the breakdown of how it happened, the challenges we faced, and the lessons I learned along the way.

The Incident: When the Server Just... Disappeared

It started with users complaining that they couldn’t connect. Pings to the server were spotty at best. Monitoring alerts were firing, and we couldn’t SSH in to figure out what was wrong.

Naturally, my mind went straight to the usual suspects:

DDoS Attack? Nope—traffic patterns were normal.

Network Issues? Maybe, but traces suggested it was more than a connectivity blip.

Misconfiguration? Unlikely—no recent deployments or updates.

Meanwhile, our hosting provider, Contabo, wasn’t much help. Their responses were vague and delayed, leaving us to troubleshoot blindly.

The Root Cause: A Migration Mishap

After hours of back and forth, we found the culprit. Contabo had migrated our server to a new region without notifying us. And to make matters worse, they forgot to properly configure the root and backup disks. The server was technically “there”, but completely unusable.

We were looking at an extended outage with no clear timeline for resolution. That’s when I made the call: stop waiting, start building. We’d already been planning to migrate to AWS—this was the push we needed to fast-track the move.

The Plan: Rebuilding on?AWS—Fast

The outage put us in a tricky spot. We needed to rebuild everything—backend, frontend, database, and all the supporting infrastructure—in record time. Thankfully, some groundwork was already in place for the AWS migration. Here’s how I approached it.

Step 1: Spin Up Infrastructure with Terraform

First up: getting the basic building blocks in place. Using Terraform, I quickly defined:

EC2 Instances for our Dockerized backend and frontend services.
RDS for a managed, reliable database.
S3 Buckets for storing backups and assets.
IAM Roles to lock down permissions securely.

The Terraform project is organized based on the approach outlined in this article: Optimizing Infrastructure Deployment with Terraform Across Multiple Organizations and Environments.

The beauty of Terraform is that you can version-control your infrastructure. With some tweaks to what I’d already been working on, the setup went live in no time. The consistency was a lifesaver under pressure.

Step 2: Automate Configurations with Ansible

Next came the heavy lifting: setting up the environment on those shiny new EC2 instances. This is where Ansible shined. It handled:

Docker Installation: Spinning up containers for our legacy services.

Environment Variables: Injecting secrets and configuration details seamlessly.

Security Hardening: Applying basic protections—because a rushed setup is no excuse for cutting corners on security.

With Ansible playbooks, everything was repeatable and predictable. If an instance failed or needed tweaking, I could redeploy it in minutes.

I also created some Bash scripts to automate the management of Docker services:

https://github.com/renanfenrich/docker-composer-easy-setup

Step 3: Rescuing and Migrating Data

The database was the messiest part. Our backups were incomplete, and what we had wasn’t well-organized (lesson learned the hard way). Still, I managed to recover enough to rebuild the critical pieces.

领英推荐

Kubernetes kOps: Step-By-Step Example & Alternatives

Kubecost 2 年前

Policy as Code with Open Policy Agent (OPA) for…

Memorres 5 个月前

Effortless AWS Resource Integration into Terraform…

Webelight Solutions 6 个月前

Additionally, due to the misconfiguration of the legacy database, the character sets and collation in the backup were a complete mess, requiring hours of debugging and conversion before we resolved the issue.

Once restored, the data was loaded into RDS, which took the headache out of managing backups and failovers moving forward. Redirecting services to the new database was straightforward once the recovery was complete.

Step 4: Testing and Going Live

Even in a rush, you can’t skip testing. I ran several checks to ensure:

All Docker services were up and talking to each other
Services variables and secrets were correctly configured
The database connections were stable and performed efficiently
User-facing functionality was back to normal

When everything passed, it was time to flip the DNS and bring users back online.

The Challenges: What Slowed Us?Down

It wasn’t all smooth sailing. Some of the big hurdles included:

Missing Documentation: Recreating legacy services without clear instructions was like playing a guessing game.

Partial Backups: Critical data was scattered across old systems, making recovery tedious.

Legacy Dependencies: Some services depended on outdated tools and configurations that didn’t play nicely with AWS.

Still, automation tools like Terraform and Ansible saved me from drowning in manual work. Without them, the rebuild would’ve taken far longer.

Lessons Learned: What I’d Do Differently Next?Time

Every outage is a learning opportunity, and this one was no exception. Here’s what I took away:

1. Document Everything: Infrastructure, application setups, and recovery procedures should all be clear and accessible.

2. Test Disaster Recovery Plans: Having backups isn’t enough—you need to practice restoring them regularly.

3. Vet Your Cloud Providers: Reliability and support matter more than low costs.

4. Be Proactive About Migration: Because we’d already started planning the move to AWS, I had a head start. Without that, this would’ve been an even bigger nightmare.

The Outcome: Stronger, Faster, More?Reliable

By the end of day three, we were fully operational on AWS. The migration wasn’t perfect, but it brought some major improvements:

Reliability: No more worrying about unexpected disk misconfigurations or migrations.

Scalability: The new setup can grow as we do.

Disaster Recovery: RDS and S3 make backups and failovers a breeze compared to the old setup.

Wrapping Up: Turning Chaos Into Opportunity

This incident was a rollercoaster, but it reminded me why I love DevOps: the challenges are hard, but the solutions are rewarding. Building something resilient under pressure is one of the most satisfying parts of the job.

If you’re ever in a similar situation, here’s my advice:

Start automating—tools like Terraform and Ansible are your best friends in a crisis.

Always plan for failure, even when everything’s working fine.

And above all, don’t panic. Even a total meltdown can be an opportunity to rebuild stronger.

Got a similar story? I’d love to hear how you tackled it!

Filipe Fenrich Niespodzinski

2 个月

We only know who's good when there's a crisis like that!

1 次回应

要查看或添加评论，请登录

Renan Fenrich的更多文章

Optimizing SSH File Transfers: Rsync vs SCP performance

2025年2月24日

Optimizing SSH File Transfers: Rsync vs SCP performance

When transferring files over SSH, selecting the right tool is crucial to optimize both performance and efficiency…
Solu??es temporárias, problemas permanentes

2025年1月26日

Solu??es temporárias, problemas permanentes

Nos últimos anos, atuando como engenheiro DevOps, vivenciei desafios que exp?em um equilíbrio delicado entre atender às…
Dívida Técnica em DevOps: quando a pressa se torna um problema

2025年1月7日

Dívida Técnica em DevOps: quando a pressa se torna um problema

Se você trabalha com tecnologia, especialmente em times de desenvolvimento ou infraestrutura, provavelmente já ouviu…
Como implementar um Git Flow simples e eficiente para sua equipe

2024年12月9日

Como implementar um Git Flow simples e eficiente para sua equipe

A gest?o de branches em equipes de desenvolvimento pode ser um desafio, especialmente à medida que o time cresce e os…

2 条评论
O papel do Platform Engineering no futuro do DevOps

2024年12月9日

O papel do Platform Engineering no futuro do DevOps

Recentemente, tenho visto uma onda de discuss?es afirmando que o DevOps perdeu relevancia e que o Platform Engineering…

4 条评论

See all articles

From Crisis to Cloud: A 3-Day Journey to AWS Resilience

Renan Fenrich

DevOps | SecOps | SRE

The Incident: When the Server Just... Disappeared

The Root Cause: A Migration Mishap

The Plan: Rebuilding on?AWS—Fast

领英推荐

The Challenges: What Slowed Us?Down

Lessons Learned: What I’d Do Differently Next?Time

The Outcome: Stronger, Faster, More?Reliable

Wrapping Up: Turning Chaos Into Opportunity

Renan Fenrich的更多文章

社区洞察

其他会员也浏览了

Adopting a Serverless Mindset ??

5 Interesting Things about Terraform

Mastering Kubernetes Security: Clusters, Containers, and Monitoring

Cloud Configuration: Don't be one of the 98.6%

Navigating Encryption Key Challenges in AWS RDS Migration

What is Infrastructure as Code? & Introduction to Terraform & Terraform Workspaces VS Modules

Google not use kubernetes but use Borg and Omega for Internal Use ...

Infrastructure as Code (IaC): Provisioning Kubernetes Clusters on AWS EKS using Terraform

Automating AWS Infrastructure with Terraform: Building a VPC with Public and Private Subnets

If Running free terraform scripts ,you are one step ahead of being wiped out

The Incident: When the Server Just... Disappeared

The Root Cause: A Migration Mishap

The Plan: Rebuilding on?AWS—Fast

领英推荐

The Challenges: What Slowed Us?Down

Lessons Learned: What I’d Do Differently Next?Time

The Outcome: Stronger, Faster, More?Reliable

Wrapping Up: Turning Chaos Into Opportunity

Renan Fenrich的更多文章

Optimizing SSH File Transfers: Rsync vs SCP performance

Solu??es temporárias, problemas permanentes

Dívida Técnica em DevOps: quando a pressa se torna um problema

Como implementar um Git Flow simples e eficiente para sua equipe

O papel do Platform Engineering no futuro do DevOps

社区洞察

其他会员也浏览了

Adopting a Serverless Mindset ??

5 Interesting Things about Terraform

Mastering Kubernetes Security: Clusters, Containers, and Monitoring

Cloud Configuration: Don't be one of the 98.6%

Navigating Encryption Key Challenges in AWS RDS Migration

What is Infrastructure as Code? & Introduction to Terraform & Terraform Workspaces VS Modules

Google not use kubernetes but use Borg and Omega for Internal Use ...

Infrastructure as Code (IaC): Provisioning Kubernetes Clusters on AWS EKS using Terraform

Automating AWS Infrastructure with Terraform: Building a VPC with Public and Private Subnets

If Running free terraform scripts ,you are one step ahead of being wiped out