An engineering deep-dive into Atlassian's Mega Outage of April 2022
In April 2022, Atlassian suffered a major outage where they "permanently" deleted the data for 400 of their paying cloud customers, and will take them weeks to recover the data. Let's dissect the outage and understand its nuances of it.
Disclaimer: I do not have any insider information and the views are pure speculation.
Insight 1: Data loss up to 5 minutes
Because some customers reported a data loss of up to 5 minutes before the incident, it shows that the persistent backup incrementally every 5 minutes. The backup typically happens through Change Data Capture which operates right over the database.
Insight 2: Rolling release of products
Atlassian rolls out features to a subset of the users and then incrementally rolls them to others. This strategy of incremental rollout gives companies and teams a chance to test the waters on a subset and then roll out to the rest.
Insight 3: Mark vs Permanent Deletion
The script that Atlassian ran to delete had both the options- Mark of Deletion and Permanent deletion.
Mark for deletion: is soft delete i.e. marking is_deleted to true. Permanent deletion: hard delete i.e. firing DELETE query
Why do companies need permanent deletion? for compliance because GDPR gives users a Right to be Forgotten
Insight 4: Synchronous Replication
To maintain high availability they have synchronous standby replicas which means that the writes happening needs to succeed on both the databases before it is acknowledged back to the user. This ensures that the data is crash-proof.
Insight 5: Immutable Backups
The backup is made immutable and stored on S3 in some serialized format. This immutable backup allows Atlassian to recover data at any point in time while being super cost-efficient at the same time.
Insight 6: Their architecture is not truly a Multi-tenant Architecture
In a true multi-tenant architecture, every customer gets its fragment of infra- right from DB, to brokers, to servers. But at Atlassian, multiple customers share the same infra components. Companies typically do this to cut down on their infrastructure cost.
Why is it taking a long time to restore?
Because data of multiple customers reside in the same database when the DB was backed up the data (rows and tables) were backed up as is; implying that the backup also had data from multiple customers.
Now to restore the intermingled rows of a customer, the entire backup needs to be loaded into a database and then the rows of specific customers need to be restored. This process is extremely time-consuming.
Here's the video of my explaining this in-depth ?? do check it out
领英推荐
In April 2022, Atlassian suffered a major outage where they "permanently" deleted the data for 400 of their paying cloud customers, and will take them weeks to recover the data. In this video, we will do an engineering deep dive into this outage trying to understand their engineering systems and practices.
We extract 6 key insights into how their engineering systems are built, their backup and restoration strategies, and most importantly why is it taking them so long to recover the data.
Disclaimer: I do not have any insider information about this and the views are pure speculation.
Outline:
Outage Report: https://www.atlassian.com/engineering/april-2022-outage-update
You can also
Thank you so much for reading ?? If you found this helpful, do spread the word about it on social media; it would mean the world to me.
Yours truly,
Arpit
Until next time, stay awesome :)
An engineering deep-dive into Atlassian's Mega Outage of April 2022
I teach a course on System Design where you'll learn how to intuitively design scalable systems. The course will help you
I have compressed my ~10 years of work experience into this course, and aim to accelerate your engineering growth 100x. To date, the course is trusted by 600+ engineers from 10 different countries and here you can find what they say about the course.
Together, we will build some of the most amazing systems and dissect them to understand the intricate details. You can find the week-by-week curriculum and topics, benefits, testimonials, and other information here https://arpitbhayani.me/masterclass.
Software Engineer 4 at Ciena
2 年I just experienced similar outage with github just now -?https://www.githubstatus.com/ " Investigating?-?We are investigating reports of degraded availability. May?27,?07:36?UTC"
Storyteller | Linkedin Top Voice 2024 | Senior Data Engineer@ Globant | Linkedin Learning Instructor | 2xGCP & AWS Certified | LICAP'2022
2 年Awesome???? Arpit Bhayani