(GitLab) CI Pipeline Tricks: Automating Aurora Serverless v2 Cluster Restorations
Ivan Vokhmin
Lead Engineer Frontend @ moebel.de Einrichten & Wohnen GmbH | AWS, Team Leadership, Software Architecture, AI
Introduction
During my work on a CMS project using AWS Aurora Serverless v2, I faced numerous challenges related to database migration and disaster recovery. One of the biggest hurdles was ensuring that our team members with varying levels of AWS expertise could reliably restore the database in case of emergencies or failed deployments. This led me to develop an automated solution using GitLab CI, which not only simplified the restoration process but also provided peace of mind for our entire team.
Aurora serverless v2 (and most other AWS RDS setup) snapshot restoration include multiple manual steps to restore backup snapshots without downtime. In enterprise setup, such backup/restore operations should be automated to account for cases if database migration fails or important data is lost. Ideally, by pressing a button in CI.
This article presents a solution that allows even readers with limited AWS knowledge to automate no-downtime restoration from snapshots using GitLab CI. Although the initial setup may require some expertise, once configured, the automated pipeline can be leveraged by anyone in the team, significantly reducing the complexity and time required for disaster recovery.
Our setup:
- ECS service with auto-scaling (CMS), that uses aurora Serverless v2 database, with CloudFlare in front of it for heavy-lifting
- Aurora Serverless v2 database as data storage with infrequent access (as cloudflare keeps load small)
- SSM Parameter store to parametrize ECS tasks on deployment
- Gitlab CI with code and deployment pipelines
We had a problem before that it was hard for people with little database knowledge to reliable restore clusters, and also that we have all snapshots from required timepoints. Hence, I designed a solution that:
1) Makes snapshot prior to any migration of database
2) Restores that snapshot on-demand (pipeline button press)
3) Makes snapshot after deployment to create a restore point for production after successful deployment
4) Provides a "revert to tag" button, that reverts database changes to tag deployment time (for DIRE situations)
With these buttons, our developers with less AWS knowledge have the ability to reliably backup and restore our CMS database.
Solution Overview
Solution consists of 2 main scripts
1) Backup script, that creates a snapshot and waits for it to become available
#!/bin/bash
set -e
if [ "$#" -ne 2 ]; then
echo "Usage: $0 <snapshot-id> <aurora-cluster-arn>"
exit 1
fi
SNAPSHOT_ID=$1
CLUSTER_ARN=$2
# Extract cluster name from ARN (gets the last part after '/')
CLUSTER_NAME=$(echo "$CLUSTER_ARN" | sed 's/.*:cluster://')
echo "Prepare to snapshot $SNAPSHOT_ID $CLUSTER_ARN $CLUSTER_NAME"
# Create the snapshot
aws rds create-db-cluster-snapshot \
--db-cluster-snapshot-identifier "$SNAPSHOT_ID" \
--db-cluster-identifier "$CLUSTER_NAME"
# Wait for the snapshot to be available
echo "Waiting for snapshot $SNAPSHOT_ID to be available..."
aws rds wait db-cluster-snapshot-available \
--db-cluster-snapshot-identifier "$SNAPSHOT_ID" \
--db-cluster-identifier "$CLUSTER_NAME"
if [ $? -eq 0 ]; then
echo "Snapshot $SNAPSHOT_ID created successfully for $CLUSTER_ARN."
else
echo "Failed to create snapshot $SNAPSHOT_ID for $CLUSTER_ARN."
fi
2) Restore cluster from snapshot script, that does the following
* Uses the provided snapshot to restore database to new cluster
* Waits for cluster to become available
* Populates cluster with serverless writer instance with configured capacity
* Waits for instance to become available
* Updates parameter store with new connection string
* Restarts ECS service to use new parameter (no-downtime!)
* Waits for service to stabilize
* Deletes old aurora cluster (making final snapshot for debug purposes)
#!/bin/bash
set -e
# Note: Omitting parameter passing here, it could use SSM or args (adjust to your setup)
# ...
OLD_CLUSTER_ENDPOINT=$(aws rds describe-db-clusters \
--db-cluster-identifier $AURORA_CLUSTER_NAME \
--query 'DBClusters[0].Endpoint' \
--output text)
# Generate new cluster identifier
NEW_AURORA_CLUSTER_NAME="cms-${ENVIRONMENT}-${MARKET}-${SNAPSHOT_ID}-$(date +%s)"
# /serverless-v2-scaling-configuration_$ENVIRONMENT.txt contains DB config, like "MinCapacity=0.0,MaxCapacity=5.0"
SERVERLESS_CONFIGURATION=$(cat "../serverless-v2-scaling-configuration_$ENVIRONMENT.txt")
AWS_PAGER=""
# Restore the snapshot to a new serverless v2 Aurora cluster from snapshot.
aws rds restore-db-cluster-from-snapshot \
--db-cluster-identifier $NEW_AURORA_CLUSTER_NAME \
--snapshot-identifier $SNAPSHOT_ID \
--engine aurora-postgresql \
--publicly-accessible \ # Think if you need this one, you can have private too
--serverless-v2-scaling-configuration $SERVERLESS_CONFIGURATION \
--vpc-security-group-ids $SECURITY_GROUP_ID \
--db-subnet-group-name $SUBNET_GROUP_NAME \
--cli-read-timeout 0 \
--cli-connect-timeout 0 \
--no-paginate
sleep 10
echo "Wait for cluster to settle..."
# Wait for the new cluster to be available
aws rds wait db-cluster-available --db-cluster-identifier $NEW_AURORA_CLUSTER_NAME
# Add serverless v2 instance to the cluster. Yes, you read it right, a serverless instance...
NEW_INSTANCE_NAME="i-${ENVIRONMENT}-${MARKET}-$(date +%s)"
aws rds create-db-instance \
--db-cluster-identifier $NEW_AURORA_CLUSTER_NAME \
--db-instance-identifier $NEW_INSTANCE_NAME \
--db-instance-class db.serverless \
--engine aurora-postgresql \
--publicly-accessible \ # Think if you need this one, you can have private too. Align with cluster
--no-paginate
sleep 10
aws rds wait db-instance-available --db-instance-identifier $NEW_INSTANCE_NAME
# Get the new cluster's endpoint
NEW_CLUSTER_ENDPOINT=$(aws rds describe-db-clusters \
--db-cluster-identifier $NEW_AURORA_CLUSTER_NAME \
--query 'DBClusters[0].Endpoint' \
--output text)
NEW_CLUSTER_CONNECTION_STRING=$(echo "$CURRENT_CLUSTER_CONNECTON_STRING" | sed "s/$OLD_CLUSTER_ENDPOINT/$NEW_CLUSTER_ENDPOINT/")
NEW_CLUSTER_ARN=$(echo "$CURRENT_CLUSTER_ARN" | sed "s/$AURORA_CLUSTER_NAME/$NEW_AURORA_CLUSTER_NAME/")
# Update to new aurora cluster
aws ssm put-parameter \
--name $CURRENT_CLUSTER_ARN_SSM \
--value $NEW_CLUSTER_ARN \
--type String \
--overwrite
# Update the parameter store with the new cluster endpoint (may be needed for later reverts)
aws ssm put-parameter \
--name $CURRENT_CLUSTER_CONNECTON_STRING_SSM \
--value $NEW_CLUSTER_CONNECTION_STRING \
--type String \
--overwrite
# Here you restart your services, I omitted this part as your setup may differ.
# ....
echo "Waiting for ECS service to stabilize..."
aws ecs wait services-stable \
--cluster $ECS_CLUSTER_ARN \
--services $ECS_SERVICE_NAME
echo "Getting instances to delete in cluster $AURORA_CLUSTER_NAME..."
INSTANCES_TO_DELETE=$(aws rds describe-db-instances \
--filters "Name=db-cluster-id,Values=$AURORA_CLUSTER_NAME" \
--query 'DBInstances[*].DBInstanceIdentifier' \
--output text)
for instance in $INSTANCES_TO_DELETE; do
echo "Deleting instance $instance of $AURORA_CLUSTER_NAME"
aws rds delete-db-instance \
--db-instance-identifier "$instance" \
--skip-final-snapshot \
--no-paginate
echo "Waiting for instance deletion..."
aws rds wait db-instance-deleted --db-instance-identifier "$instance"
done
LAST_SNAPSHOT_ID="$AURORA_CLUSTER_NAME-final-$(date +%s)"
echo "Deleting old Aurora cluster: $AURORA_CLUSTER_NAME, final snapshot saved to $LAST_SNAPSHOT_ID"
aws rds delete-db-cluster \
--db-cluster-identifier $AURORA_CLUSTER_NAME \
--no-skip-final-snapshot \
--final-db-snapshot-identifier $LAST_SNAPSHOT_ID \
--delete-automated-backups \
--no-paginate
echo "Aurora cluster restored and ECS service restarted successfully."
Those scripts are used on the following points in the Gitlab CI pipelines:
* Pre-migration backups, using pipeline ID as snapshot ID
* Manual restoration button in pipeline, that allows restore of failed migration
* Post-deployment backup, to allow to revert to tag deployment time
Some final thoughts
1) Don't forget to add a job to cleanup old snapshots, they have infinite retention time
2) You may want to use gitlab resource groups to avoid parallel restorations of cluster (can have bad consequences)
Conclusion
Having reliable backup and restoration processes in place is crucial for any enterprise setup. I encourage you to consider implementing similar solutions in your own projects and customize them according to your specific needs. If you have any questions or would like to share your experiences with automated database recovery, feel free to contact me via LinkedIn.
Thank you for reading, and happy restoring!