登录查看更多内容

Database reliability - Migrating Terabyte from self-hosted MySQL to GCP CloudSQL

Chinmay Naik

Founder and CEO at One2N | Building Cloud Native Solutions | Is your business scaling faster than tech can handle? DM me.

发布日期: 2023年11月8日

You're a lead SRE and CTO asks you to manage and scale a self-managed 6-node MySQL cluster with 1.5+ TB data on production.

You do what it takes, a few months pass, but now, it's time to move to a managed service.

You think this should be straightforward, but it's not so easy.

Context

For context, the DB receives 25k reads and 8k writes per second during peak traffic. It's an OLTP database with over 200+ tables. This is the main database for the monolithic app. So far, the team has been self-managing it, but we got good cloud credits, so let's move to GCP.

You list down some requirements:

This is a transactional datastore, so downtime has to be minimum
Data consistency and integrity must be maintained
Org's SLAs have to be met during the migration period
There should be a rollback strategy in case things go wrong

Existing setup

Here's the existing setup.

Applications running on Kubernetes connect to ProxySQL. ProxySQL, in turn, splits the read and write traffic based on query rules and weightage configured. The underlying MySQL Primary handles all write traffic, and the Replicas handle read traffic.

Migration Options and Trade-offs

To migrate this database to GCP's CloudSQL, you evaluate three approaches.

Point-in-time backup and restore
GCP Data Migration Service (DMS) with support for continuous replication
CloudSQL External Replication

You compare the pros and cons of each approach.

Point-in-time backup won't work since we can't stop ongoing writes and want minimum downtime.
DMS is easy to set up, uses native binlog replication, and has good monitoring support.
CloudSQL External replication had some prerequisites that our DB didn't meet. So, no.

Final approach

You decided to go ahead with DMS. However, DMS has some downsides:

Vivek Bansal 9 个月前

What is Amazon RDS?

Neal K. Davis 2 年前

How to Connect App Engine to Cloud SQL

VaporVM 2 年前

DMS provisions CloudSQL Primary instance in replica mode and not in high-availability mode. This means reads/writes will be blocked when promoting the instance as Primary.
DMS could not reach the source MySQL primary in another cloud provider even though you had set up VPC peering. To fix this, you set up some IPtables NAT rules so that the DMS service can reach the source MySQL nodes.
DMS runs large select * queries on the entire table to copy the data. To avoid DB perf issues, you had to provision an extra read replica that DMS can replicate from. This way, the source primary isn't overloaded with many parallel table scans.

With all the planning and testing done on the staging environment, you're ready for production migration. It takes 3 days to copy all data from source MySQL to CloudSQL via DMS. Finally, the replication lag is zero, and you're ready for the cutover.

So your cutover plan is:

Put the web app in Maintenance mode (during low traffic time). This will stop write traffic to DB.
Make the CloudSQL node Primary
Move traffic (via ProxySQL) to CloudSQL
Remove maintenance mode and allow all writes to GCP's CloudSQL

So you execute this, and it works as expected (no surprises, which is a good thing ??). As an SRE, you sometimes doubt things more if these work without any problem. So you double-check the details and data checksum - all good.

Your rigorous testing on staging paid off.

You monitor the whole system during high-traffic times the next day. There are frequent replica lag spikes, so you tweak some MySQL config settings, and these issues are solved. Thankfully, since you have managed self-hosted MySQL, you know what parameters to tune.

Total downtime (maintenance mode time, not the complete downtime) is just 10 minutes, that too during low traffic time.

The CTO and the rest of the business team are pretty happy with this migration. Now that you don't have to manage the uptime for DB, you take a week's vacation.

I write such stories on software engineering.

There's no specific frequency, as I don't make up these.

If you liked this one, you might love - https://www.dhirubhai.net/pulse/database-reliability-zero-downtime-schema-migrations-chinmay-naik-2xixf/

Follow me - Chinmay Naik , for more such stuff, straight from the production oven!

Oh, by the way, if you need help with database reliability and scaling, my DMs are open. We have worked at Terabyte scale when it comes to relational and non-relational databases.

This was one of the reasons I started One2N - to help growing orgs scale sustainably.

要查看或添加评论，请登录

Chinmay Naik的更多文章

Data engineering mystery - rerouting large data in Kafka

2024年6月4日

Data engineering mystery - rerouting large data in Kafka

You're a tech lead handling a large-scale data pipeline. One day, your colleague (C) pulls you into an issue related to…
Vanishing Acts: The Mystery of Failing Database Writes

2024年3月29日

Vanishing Acts: The Mystery of Failing Database Writes

As an SRE at a growth-stage company, you're on call this week. A PagerDuty alert wakes you up.

16 条评论
A story about a nightmare scenario for every SRE

2024年2月21日

A story about a nightmare scenario for every SRE

Story time. It's about cloud security failures and why good engineering practices matter, especially during the One to…
Curious case of debugging failing webhook API requests

2024年1月9日

Curious case of debugging failing webhook API requests

A short debugging story to start off the new year. Developer (D): Hey, can you join a call? I need some help in…

1 条评论
Migrating Terabytes of metrics data with zero downtime

2023年12月27日

Migrating Terabytes of metrics data with zero downtime

You're an SRE responsible for VictoriaMetrics deployment with 30 Million time series/min. The CTO wants you to…

2 条评论
A curious story of debugging Machine Learning models

2023年11月29日

A curious story of debugging Machine Learning models

You're woken up by a p90 latency-related alert. This alert is for the main API service, so you start investigating…

3 条评论
Database reliability - zero downtime schema migrations with MySQL

2023年10月24日

Database reliability - zero downtime schema migrations with MySQL

(Database reliability story 1) You join a team as lead SRE, and the CTO asks you to manage and scale a self-managed…

12 条评论
Building Pull Request based ephemeral Preview environments on Kubernetes

2023年10月9日

Building Pull Request based ephemeral Preview environments on Kubernetes

A CTO of a company calls you. They just migrated from Heroku to AWS on EKS.

3 条评论
Taming GCP networking cloud costs

2023年9月26日

Taming GCP networking cloud costs

Here's a story of a pragmatic tech lead who understands networking fundamentals like iptables packet routing and NAT…

3 条评论
My app is slow. Can you fix it?

2023年9月11日

My app is slow. Can you fix it?

A founder of a recently funded company calls you. He wants you to come in and fix webapp performance problems.

2 条评论

See all articles

Database reliability - Migrating Terabyte from self-hosted MySQL to GCP CloudSQL

Chinmay Naik

Founder and CEO at One2N | Building Cloud Native Solutions | Is your business scaling faster than tech can handle? DM me.

Context

Existing setup

Migration Options and Trade-offs

Final approach

领英推荐

Chinmay Naik的更多文章

社区洞察

其他会员也浏览了

Run MySQL in Amazon Elastic Kubernetes Service (Amazon EKS) Using KubeDB

Deploy and Manage ProxySQL in Amazon Elastic Kubernetes Service (Amazon EKS) Using KubeDB

Deploy and Manage Percona XtraDB in Amazon Elastic Kubernetes Service (Amazon EKS) Using KubeDB

Navigating the Database Dilemma: Postgres vs. MySQL

What is Sharding in MongoDB?

Best Databases For Web Applications To Use

Build Customer Service (.NET, Minimal API, and PostgreSQL)

Build Customer Service (.NET, Minimal API, and PostgreSQL)

A Need for (Database) Speed

Best Practices for Storing Files in MongoDB Database: Performance and Scalability

Context

Existing setup

Migration Options and Trade-offs

Final approach

领英推荐

Chinmay Naik的更多文章

Data engineering mystery - rerouting large data in Kafka

Vanishing Acts: The Mystery of Failing Database Writes

A story about a nightmare scenario for every SRE

Curious case of debugging failing webhook API requests

Migrating Terabytes of metrics data with zero downtime

A curious story of debugging Machine Learning models

Database reliability - zero downtime schema migrations with MySQL

Building Pull Request based ephemeral Preview environments on Kubernetes

Taming GCP networking cloud costs

My app is slow. Can you fix it?

社区洞察

其他会员也浏览了

Run MySQL in Amazon Elastic Kubernetes Service (Amazon EKS) Using KubeDB

Deploy and Manage ProxySQL in Amazon Elastic Kubernetes Service (Amazon EKS) Using KubeDB

Deploy and Manage Percona XtraDB in Amazon Elastic Kubernetes Service (Amazon EKS) Using KubeDB

Navigating the Database Dilemma: Postgres vs. MySQL

What is Sharding in MongoDB?

Best Databases For Web Applications To Use

Build Customer Service (.NET, Minimal API, and PostgreSQL)

Build Customer Service (.NET, Minimal API, and PostgreSQL)

A Need for (Database) Speed

Best Practices for Storing Files in MongoDB Database: Performance and Scalability