BOOM ! Yesterday half of our cloud infrastructure disappeared
Yesterday morning around 9:30 CET Microsoft Engineers unwillingly threw us a challenge by rolling out a change on their VMs which went wrong. It unexpectedly took out 60% of our VMs running our Kubernetes clusters.
As a team we got pretty worried cause that would mean a lot of business impact in Salling Group eCommerce, apps and in stores and at the same time we were blinded by Azure's own monitoring services also not working as they use the same infrastructure...
Immediately we assembled an incident response team with the purpose of mitigating the incident as much as possible. It was clear to us that it was an underlying Azure issue, so hands on deck for getting the VMs back up - if possible - at the same time triggering reshuffling of our workloads to mitigate impact.
The business impact was assessed and fortunately we had done our job well. Only a handful of services were heavily affected out of 50+. The total impact before we had restored all services either fully or to a degraded state was only around 20-30 minutes and we got to have our lunch with peace of mind.
We were saved by our Anti-affinity rules and redundant setup that ensured that our services were spread across as many VMs as possible. This meant that only a few services were completely gone and the vast majority of the remaining 50+ services where running just fine or in a degraded state - but running and serving. Lucky for us core services like Traefik and our system nodes where also only hit partially allowing traffic to flow.
领英推荐
It pays of to design for reliability, disaster recovery and taking the time to ensure settings and feature of a platform is used, so that when something blows up the business impact is minimal.
Post mortem has been done, more to learn and we're now assessing the learnings and how we will prepare for the next BOOM because it will happen whether we want it or not.
Thank you team - Great work !
The Azure incident in mention: "Azure customers running Canonical Ubuntu 18.04 experiencing DNS errors"