From Crisis to Cloud: A 3-Day Journey to AWS Resilience

From Crisis to Cloud: A 3-Day Journey to AWS Resilience

In DevOps, we all know that “everything’s fine” can go sideways in a matter of hours.

That’s exactly what happened to me recently at Cars2You, when our production environment decided to go dark for three long days. It wasn’t just an outage—it was CHAOS. But out of that chaos came something better: a new infrastructure on AWS, built under pressure, with Terraform, Ansible, and a fair amount of coffee-fueled determination.

Here’s the breakdown of how it happened, the challenges we faced, and the lessons I learned along the way.

The Incident: When the Server Just... Disappeared

It started with users complaining that they couldn’t connect. Pings to the server were spotty at best. Monitoring alerts were firing, and we couldn’t SSH in to figure out what was wrong.

Naturally, my mind went straight to the usual suspects:

DDoS Attack? Nope—traffic patterns were normal.

Network Issues? Maybe, but traces suggested it was more than a connectivity blip.

Misconfiguration? Unlikely—no recent deployments or updates.

Meanwhile, our hosting provider, Contabo, wasn’t much help. Their responses were vague and delayed, leaving us to troubleshoot blindly.

The Root Cause: A Migration Mishap

After hours of back and forth, we found the culprit. Contabo had migrated our server to a new region without notifying us. And to make matters worse, they forgot to properly configure the root and backup disks. The server was technically “there”, but completely unusable.

We were looking at an extended outage with no clear timeline for resolution. That’s when I made the call: stop waiting, start building. We’d already been planning to migrate to AWS—this was the push we needed to fast-track the move.

The Plan: Rebuilding on?AWS—Fast

The outage put us in a tricky spot. We needed to rebuild everything—backend, frontend, database, and all the supporting infrastructure—in record time. Thankfully, some groundwork was already in place for the AWS migration. Here’s how I approached it.

Step 1: Spin Up Infrastructure with Terraform

First up: getting the basic building blocks in place. Using Terraform, I quickly defined:

  • EC2 Instances for our Dockerized backend and frontend services.
  • RDS for a managed, reliable database.
  • S3 Buckets for storing backups and assets.
  • IAM Roles to lock down permissions securely.

The Terraform project is organized based on the approach outlined in this article: Optimizing Infrastructure Deployment with Terraform Across Multiple Organizations and Environments.

Terraform files
The beauty of Terraform is that you can version-control your infrastructure. With some tweaks to what I’d already been working on, the setup went live in no time. The consistency was a lifesaver under pressure.

Step 2: Automate Configurations with Ansible

Next came the heavy lifting: setting up the environment on those shiny new EC2 instances. This is where Ansible shined. It handled:

Docker Installation: Spinning up containers for our legacy services.

Environment Variables: Injecting secrets and configuration details seamlessly.

Security Hardening: Applying basic protections—because a rushed setup is no excuse for cutting corners on security.

With Ansible playbooks, everything was repeatable and predictable. If an instance failed or needed tweaking, I could redeploy it in minutes.

I also created some Bash scripts to automate the management of Docker services:

https://github.com/renanfenrich/docker-composer-easy-setup

Step 3: Rescuing and Migrating Data

The database was the messiest part. Our backups were incomplete, and what we had wasn’t well-organized (lesson learned the hard way). Still, I managed to recover enough to rebuild the critical pieces.

Additionally, due to the misconfiguration of the legacy database, the character sets and collation in the backup were a complete mess, requiring hours of debugging and conversion before we resolved the issue.

Once restored, the data was loaded into RDS, which took the headache out of managing backups and failovers moving forward. Redirecting services to the new database was straightforward once the recovery was complete.

Step 4: Testing and Going Live

Even in a rush, you can’t skip testing. I ran several checks to ensure:

  • All Docker services were up and talking to each other
  • Services variables and secrets were correctly configured
  • The database connections were stable and performed efficiently
  • User-facing functionality was back to normal

When everything passed, it was time to flip the DNS and bring users back online.

The Challenges: What Slowed Us?Down

It wasn’t all smooth sailing. Some of the big hurdles included:

Missing Documentation: Recreating legacy services without clear instructions was like playing a guessing game.

Partial Backups: Critical data was scattered across old systems, making recovery tedious.

Legacy Dependencies: Some services depended on outdated tools and configurations that didn’t play nicely with AWS.

Still, automation tools like Terraform and Ansible saved me from drowning in manual work. Without them, the rebuild would’ve taken far longer.

Lessons Learned: What I’d Do Differently Next?Time

Every outage is a learning opportunity, and this one was no exception. Here’s what I took away:

1. Document Everything: Infrastructure, application setups, and recovery procedures should all be clear and accessible.

2. Test Disaster Recovery Plans: Having backups isn’t enough—you need to practice restoring them regularly.

3. Vet Your Cloud Providers: Reliability and support matter more than low costs.

4. Be Proactive About Migration: Because we’d already started planning the move to AWS, I had a head start. Without that, this would’ve been an even bigger nightmare.

The Outcome: Stronger, Faster, More?Reliable

By the end of day three, we were fully operational on AWS. The migration wasn’t perfect, but it brought some major improvements:

Reliability: No more worrying about unexpected disk misconfigurations or migrations.

Scalability: The new setup can grow as we do.

Disaster Recovery: RDS and S3 make backups and failovers a breeze compared to the old setup.

Wrapping Up: Turning Chaos Into Opportunity

This incident was a rollercoaster, but it reminded me why I love DevOps: the challenges are hard, but the solutions are rewarding. Building something resilient under pressure is one of the most satisfying parts of the job.

If you’re ever in a similar situation, here’s my advice:

Start automating—tools like Terraform and Ansible are your best friends in a crisis.

Always plan for failure, even when everything’s working fine.

And above all, don’t panic. Even a total meltdown can be an opportunity to rebuild stronger.

Got a similar story? I’d love to hear how you tackled it!

We only know who's good when there's a crisis like that!

要查看或添加评论,请登录

Renan Fenrich的更多文章

社区洞察

其他会员也浏览了