How Canva Migrated To DynamoDB To Scale To 220 Million Users

How Canva Migrated To DynamoDB To Scale To 220 Million Users

Managing a rapidly growing media platform is no easy feat. Especially when you are supporting 100 million monthly active users who upload 50 million media files every day.

This is the challenge Canva had a few years ago. And they needed to solve this challenge without letting it affect the experience of their growing user base.

Here’s the story of how Canva transitioned from MySQL to DynamoDB and the lessons they learned along the way.

How Canva used to store their media

When Canva launched back in 2013, they stored their data on an Amazon RDS MySQL database due to its simplicity.

Their microservices architecture supported operations on users, documents, folders, and media. The choice for SQL was obvious since all of this data was relational.

As they started scaling, Canva increased database instance sizes and even replicated the database to multiple instances to scale out.

Initially, we scaled the database vertically by using larger instances, and later horizontally, introducing eventually consistent reads to some services powered by MySQL read replicas. [1]

However, Canva’s user base and media uploads grew exponentially, and some cracks began to show:

  • Slow schema changes: some large operations took days consequently blocking their product releases.
  • Storage limits: They hit the limits of RDS’s EBS volumes of 16TB and the ext3 table size of 2TB.
  • Performance issues: They encountered higher I/O latency, replication bottlenecks, and downtime when upgrading their infrastructure.

Adding miles to SQL

In 2017, the number of media assets on Canva approached 1 billion and was increasing exponentially.

This forced Canva to explore and find migration solutions that would let them continue to scale beyond that.

To buy time, Canva took some steps to optimize its SQL database. These included:

  • Migrating media content metadata (the most commonly modified elements) into a JSON column
  • Denormalizing some tables to reduce lock contention and joins (joins add a lot of extra latency).
  • Removing foreign key constraints.
  • Changing the way they imported media to reduce the number of metadata updates.

While these solutions took the edge off some pressures, they not only introduced more complexities but didn’t solve the growing scalability challenge.

Why move to DynamoDB?

As Canva’s media approached 1 billion assets, they moved to a solution that involved DynamoDB, due to the fact it was a managed solution and one they had previously prototyped with previously.

Their choice was also influenced by their need to have a migration strategy that offered no impact to users and a cut-over with zero downtime.

The following table is Canva’s comparison of different databases in the decision stage:

Migrating to DynamoDB

To migrate their data, Canva used a dual write approach. They started by writing new media for both MySQL and DynamoDB.

They used Amazon SQS queues to handle updates and enable eventual consistency while prioritizing critical write operations.

They set up a worker instance that processed messages from SQS to react the state from the MySQL database and update DynamoDB with the data. This allowed messages to be retried if message processing paused or slowed down.

Additionally, Canva used a priority system to write data to DynamoDB.

So that they could serve eventually consistent reads from Dynamo, they prioritized write replications over reads. Creates and updates were placed on high-priority queues while reads were placed on a low-priority queue.

They then had instances reading from the high-priority queue and after those were done, they read from the low-priority queue.

Finally, to test the migration, Canva implemented dual reads to compare results between MySQL and DynamoDB.

This allowed them to catch bugs early and address them.

They were then able to serve eventually consistent reads from DynamoDB, with a fallback to MySQL for the files that hadn’t replicated yet.

Switching Writes To DynamoDB

Switching all writes to DynamoDB was the riskiest part of the process. [1]

Switching writes to DynamoDB required Canva to change its code to handle the new create and update requests, which included using DynamoDB transactions and conditional writes.

To mitigate this risk Canva used a few strategies:

  • Updated integration tests to validate media updates on both MySQL and DynamoDB.
  • Transitioned all integration tests to DynamoDB, running them alongside MySQL tests.
  • Conducted local and end-to-end testing of the new implementation.
  • Created a cutover runbook with flags for quick rollback to MySQL if needed.
  • Rehearsed the cutover process in development and staging environments.

The cutover in production was seamless, with no downtime or errors and significant improvements in media service latency.

Here’s a diagram that displays this latency improvement:

Lessons Learned

Some lessons Canva learned from this migration journey:

  • Understand your access patterns: optimize migration by focusing on frequently accessed data
  • Test in production: production data reveals edge cases that test environments miss
  • Do it live: gather as much information upfront as possible by migrating live, identifying bugs early, and learning the new technology.

So was DynamoDB the right choice?

Canva’s monthly active users have tripled since migrating to DynamoDB. The fully managed database has scaled reliably and costs less than the RDS clusters that it replaced.

While Canva lost the ability to perform ad-hoc queries and simple schema changes, they instead use CDC for data warehousing and rely on composite GSIs (global secondary indexes) to support additional access patterns.

Conclusion

Canva’s migration to DynamoDB was a game-changer and enabled them to scale to 220 million users while reducing costs and improving performance.

Despite some tradeoffs, DynamoDB’s scalability and reliability have proven to be invaluable in supporting Canva’s incredible and continued growth.

If you are curious to learn more, Canva’s CTO, Brendan Humphreys, spoke about this migration journey at the AWS re:invent 2024 conference, during Werner Vogel’s keynote talk:

https://youtu.be/aim5x73crbM


?? My name is Uriel Bitton and I hope you learned something in this edition of The Serverless Spotlight

?? You can share the article with your network to help others learn as well.

?? If you want to learn how to save money in the cloud you can subscribe to my brand new newsletter The Cloud Economist.

?? I hope to see you in next week's edition!


Steffan Surdek

Elevating Executives Through Co-Creative Leadership

2 个月

It's fascinating to see how serverless technologies enable companies to handle such rapid growth efficiently.

Emmanuel Soetan

Brave Worrier. SWE | React Js | TypeScript | Next js | Golang | Docker | Cairo

2 个月

Great Read. I love how the process was tested rigorously before pushing to prod. I can see how that helped in reducing downtime to Zero.

要查看或添加评论,请登录

Uriel Bitton的更多文章

社区洞察

其他会员也浏览了