ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

(GitLab) CI Pipeline Tricks: Automating Aurora Serverless v2 Cluster Restorations

Ivan Vokhmin

Lead Engineer Frontend @ moebel.de Einrichten & Wohnen GmbH | AWS, Team Leadership, Software Architecture, AI

å‘å¸ƒæ—¥æœŸ: 2025å¹´3æœˆ14æ—¥

Introduction

During my work on a CMS project using AWS Aurora Serverless v2, I faced numerous challenges related to database migration and disaster recovery. One of the biggest hurdles was ensuring that our team members with varying levels of AWS expertise could reliably restore the database in case of emergencies or failed deployments. This led me to develop an automated solution using GitLab CI, which not only simplified the restoration process but also provided peace of mind for our entire team.

Aurora serverless v2 (and most other AWS RDS setup) snapshot restoration include multiple manual steps to restore backup snapshots without downtime. In enterprise setup, such backup/restore operations should be automated to account for cases if database migration fails or important data is lost. Ideally, by pressing a button in CI.

This article presents a solution that allows even readers with limited AWS knowledge to automate no-downtime restoration from snapshots using GitLab CI. Although the initial setup may require some expertise, once configured, the automated pipeline can be leveraged by anyone in the team, significantly reducing the complexity and time required for disaster recovery.

Our setup:

- ECS service with auto-scaling (CMS), that uses aurora Serverless v2 database, with CloudFlare in front of it for heavy-lifting

- Aurora Serverless v2 database as data storage with infrequent access (as cloudflare keeps load small)

- SSM Parameter store to parametrize ECS tasks on deployment

- Gitlab CI with code and deployment pipelines

We had a problem before that it was hard for people with little database knowledge to reliable restore clusters, and also that we have all snapshots from required timepoints. Hence, I designed a solution that:

1) Makes snapshot prior to any migration of database

2) Restores that snapshot on-demand (pipeline button press)

3) Makes snapshot after deployment to create a restore point for production after successful deployment

4) Provides a "revert to tag" button, that reverts database changes to tag deployment time (for DIRE situations)

With these buttons, our developers with less AWS knowledge have the ability to reliably backup and restore our CMS database.

Solution Overview

Solution consists of 2 main scripts

1) Backup script, that creates a snapshot and waits for it to become available

#!/bin/bash
set -e

if [ "$#" -ne 2 ]; then
    echo "Usage: $0 <snapshot-id> <aurora-cluster-arn>"
    exit 1
fi

SNAPSHOT_ID=$1
CLUSTER_ARN=$2

# Extract cluster name from ARN (gets the last part after '/')
CLUSTER_NAME=$(echo "$CLUSTER_ARN" | sed 's/.*:cluster://')
echo "Prepare to snapshot $SNAPSHOT_ID $CLUSTER_ARN $CLUSTER_NAME"

# Create the snapshot
aws rds create-db-cluster-snapshot \
    --db-cluster-snapshot-identifier "$SNAPSHOT_ID" \
    --db-cluster-identifier "$CLUSTER_NAME"

# Wait for the snapshot to be available
echo "Waiting for snapshot $SNAPSHOT_ID to be available..."
aws rds wait db-cluster-snapshot-available \
    --db-cluster-snapshot-identifier "$SNAPSHOT_ID" \
    --db-cluster-identifier "$CLUSTER_NAME"
if [ $? -eq 0 ]; then
    echo "Snapshot $SNAPSHOT_ID created successfully for $CLUSTER_ARN."
else
    echo "Failed to create snapshot $SNAPSHOT_ID for $CLUSTER_ARN."
fi

2) Restore cluster from snapshot script, that does the following

* Uses the provided snapshot to restore database to new cluster

* Waits for cluster to become available

* Populates cluster with serverless writer instance with configured capacity

* Waits for instance to become available

* Updates parameter store with new connection string

* Restarts ECS service to use new parameter (no-downtime!)

* Waits for service to stabilize

* Deletes old aurora cluster (making final snapshot for debug purposes)

 #!/bin/bash
set -e

# Note: Omitting parameter passing here, it could use SSM or args (adjust to your setup)
# ...

OLD_CLUSTER_ENDPOINT=$(aws rds describe-db-clusters \
    --db-cluster-identifier $AURORA_CLUSTER_NAME \
    --query 'DBClusters[0].Endpoint' \
    --output text)

# Generate new cluster identifier
NEW_AURORA_CLUSTER_NAME="cms-${ENVIRONMENT}-${MARKET}-${SNAPSHOT_ID}-$(date +%s)"

# /serverless-v2-scaling-configuration_$ENVIRONMENT.txt contains DB config, like "MinCapacity=0.0,MaxCapacity=5.0"
SERVERLESS_CONFIGURATION=$(cat "../serverless-v2-scaling-configuration_$ENVIRONMENT.txt")

AWS_PAGER=""
# Restore the snapshot to a new serverless v2 Aurora cluster from snapshot.
aws rds restore-db-cluster-from-snapshot \
  --db-cluster-identifier $NEW_AURORA_CLUSTER_NAME \
  --snapshot-identifier $SNAPSHOT_ID \
  --engine aurora-postgresql \
  --publicly-accessible \ # Think if you need this one, you can have private too
  --serverless-v2-scaling-configuration $SERVERLESS_CONFIGURATION \
  --vpc-security-group-ids $SECURITY_GROUP_ID \
  --db-subnet-group-name $SUBNET_GROUP_NAME \
  --cli-read-timeout 0 \
  --cli-connect-timeout 0 \
  --no-paginate

sleep 10
echo "Wait for cluster to settle..."

# Wait for the new cluster to be available
aws rds wait db-cluster-available --db-cluster-identifier $NEW_AURORA_CLUSTER_NAME

# Add serverless v2 instance to the cluster. Yes, you read it right, a serverless instance...
NEW_INSTANCE_NAME="i-${ENVIRONMENT}-${MARKET}-$(date +%s)"
aws rds create-db-instance \
    --db-cluster-identifier $NEW_AURORA_CLUSTER_NAME \
    --db-instance-identifier $NEW_INSTANCE_NAME \
    --db-instance-class db.serverless \
    --engine aurora-postgresql \
    --publicly-accessible \ # Think if you need this one, you can have private too. Align with cluster
    --no-paginate

sleep 10

aws rds wait  db-instance-available --db-instance-identifier $NEW_INSTANCE_NAME

# Get the new cluster's endpoint
NEW_CLUSTER_ENDPOINT=$(aws rds describe-db-clusters \
    --db-cluster-identifier $NEW_AURORA_CLUSTER_NAME \
    --query 'DBClusters[0].Endpoint' \
    --output text)

NEW_CLUSTER_CONNECTION_STRING=$(echo "$CURRENT_CLUSTER_CONNECTON_STRING" | sed "s/$OLD_CLUSTER_ENDPOINT/$NEW_CLUSTER_ENDPOINT/")

NEW_CLUSTER_ARN=$(echo "$CURRENT_CLUSTER_ARN" | sed "s/$AURORA_CLUSTER_NAME/$NEW_AURORA_CLUSTER_NAME/")

# Update to new aurora cluster

aws ssm put-parameter \
    --name $CURRENT_CLUSTER_ARN_SSM \
    --value $NEW_CLUSTER_ARN \
    --type String \
    --overwrite

# Update the parameter store with the new cluster endpoint (may be needed for later reverts)

aws ssm put-parameter \
    --name $CURRENT_CLUSTER_CONNECTON_STRING_SSM \
    --value $NEW_CLUSTER_CONNECTION_STRING \
    --type String \
    --overwrite

# Here you restart your services, I omitted this part as your setup may differ.

# ....

echo "Waiting for ECS service to stabilize..."

aws ecs wait services-stable \
    --cluster $ECS_CLUSTER_ARN \
    --services $ECS_SERVICE_NAME

echo "Getting instances to delete in cluster $AURORA_CLUSTER_NAME..."

INSTANCES_TO_DELETE=$(aws rds describe-db-instances \
        --filters "Name=db-cluster-id,Values=$AURORA_CLUSTER_NAME" \
        --query 'DBInstances[*].DBInstanceIdentifier' \
        --output text)

for instance in $INSTANCES_TO_DELETE; do
    echo "Deleting instance $instance of $AURORA_CLUSTER_NAME"
    aws rds delete-db-instance \
        --db-instance-identifier "$instance" \
        --skip-final-snapshot \
        --no-paginate

    echo "Waiting for instance deletion..."
    aws rds wait db-instance-deleted --db-instance-identifier "$instance"
done

LAST_SNAPSHOT_ID="$AURORA_CLUSTER_NAME-final-$(date +%s)"

echo "Deleting old Aurora cluster: $AURORA_CLUSTER_NAME, final snapshot saved to $LAST_SNAPSHOT_ID"

aws rds delete-db-cluster \
    --db-cluster-identifier $AURORA_CLUSTER_NAME \
    --no-skip-final-snapshot \
    --final-db-snapshot-identifier $LAST_SNAPSHOT_ID \
    --delete-automated-backups \
    --no-paginate

echo "Aurora cluster restored and ECS service restarted successfully."

Those scripts are used on the following points in the Gitlab CI pipelines:

* Pre-migration backups, using pipeline ID as snapshot ID

* Manual restoration button in pipeline, that allows restore of failed migration

* Post-deployment backup, to allow to revert to tag deployment time

Some final thoughts

1) Don't forget to add a job to cleanup old snapshots, they have infinite retention time

2) You may want to use gitlab resource groups to avoid parallel restorations of cluster (can have bad consequences)

Conclusion

Having reliable backup and restoration processes in place is crucial for any enterprise setup. I encourage you to consider implementing similar solutions in your own projects and customize them according to your specific needs. If you have any questions or would like to share your experiences with automated database recovery, feel free to contact me via LinkedIn.

Thank you for reading, and happy restoring!

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Ivan Vokhminçš„æ›´å¤šæ–‡ç«

Datadog vs self-hosted grafana/loki for observability - migration case

2024å¹´11æœˆ25æ—¥

Datadog vs self-hosted grafana/loki for observability - migration case

Observability matters. Choosing right platform to retain and visualize logs and metrics is important for incidentâ€¦
Executing scheduled serverless tasks with AWS ECS fargate or lambda

2024å¹´9æœˆ11æ—¥

Executing scheduled serverless tasks with AWS ECS fargate or lambda

Recurring tasks require compute powers to be provisioned at predefined schedule. Like "process sales report at the endâ€¦
Technical challenges of AB test user segregation

2024å¹´8æœˆ23æ—¥

Technical challenges of AB test user segregation

Every website that has a feature developed and ready for production wants to know if this feature is making a goodâ€¦
The hidden costs of web scraping

2024å¹´7æœˆ26æ—¥

The hidden costs of web scraping

During my long developer career I encountered multiple cases when companies were taking data directly from websitesâ€¦
Gravity of monoliths in feature-centered frontend projects

2024å¹´7æœˆ5æ—¥

Gravity of monoliths in feature-centered frontend projects

During more than 10 years I had some pleasure of working with different projects with various codebases and codeâ€¦
How to deal with third party API integration issues for web services?

2024å¹´4æœˆ10æ—¥

How to deal with third party API integration issues for web services?

How to deal with third party API integration issues for web services? Many web services offer beautiful APIs that solveâ€¦
AWS Lambda: Accessing private VPC resources and internet without NAT gateway

2024å¹´2æœˆ18æ—¥

AWS Lambda: Accessing private VPC resources and internet without NAT gateway

There is a commonly known design decision of AWS to launch lambda in a separate VPC that belongs to AWS itself. Thisâ€¦
Monitoring 3rd party API response time to proactively improve performance (moebel.de case)

2023å¹´12æœˆ1æ—¥

Monitoring 3rd party API response time to proactively improve performance (moebel.de case)

Usually web apps (websites with SPA) are monitored for performance as â€œblack boxesâ€ - their response time (Time Toâ€¦
AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

2023å¹´8æœˆ22æ—¥

AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

In moebel.de we use AWS lambda for many projects.

1 æ¡è¯„è®º
Using assembly in node.js

2023å¹´7æœˆ25æ—¥

Using assembly in node.js

This is my successful timeboxed attempt to integrate assembly code in a node.js project for fun.

See all articles

Introduction

Solution Overview

Some final thoughts

Conclusion

Ivan Vokhminçš„æ›´å¤šæ–‡ç«

Datadog vs self-hosted grafana/loki for observability - migration case

Executing scheduled serverless tasks with AWS ECS fargate or lambda

Technical challenges of AB test user segregation

The hidden costs of web scraping

Gravity of monoliths in feature-centered frontend projects

How to deal with third party API integration issues for web services?

AWS Lambda: Accessing private VPC resources and internet without NAT gateway

Monitoring 3rd party API response time to proactively improve performance (moebel.de case)

AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

Using assembly in node.js