Regression and Rebuild Testing Your Infrastructure As Code
Ahhh regression testing, a by-product of software development and often thought about but never implemented! What if I told you, it should be one of the first few tests you implement as a Cloud Engineer writing Terraform?
I'm not a fan of the "applies in every case", articles so lets break it down into some of the benefits.
Terraform and IaC challenges
Circular dependencies
If like me you use Terraform, you'll know it's pretty good at sorting out what needs to go first in order to successfully apply your Infrastructure As Code (IaC), for example;
data "aws_iam_policy_document" "assume_role_policy_taskrole" {
? statement {
? ? sid ? ?= ""
? ? effect = "Allow"
? ? actions = [
? ? ? "sts:AssumeRole",
? ? ]
? ? principals {
? ? ? type ? ? ? ?= "Service"
? ? ? identifiers = ["ecs.amazonaws.com", "ecs-tasks.amazonaws.com"]
? ? }
? }
}
resource "aws_iam_role" "task_role" {
? name ? ? ? ? ? ? ? = "task-role"
? description ? ? ? ?= "The role the container running on ECS will use"
? assume_role_policy = data.aws_iam_policy_document.assume_role_policy_taskrole.json
? tags = merge(
? ? var.common_tags,
? ? {
? ? ? "Name" = task-role")
? ? },
? )
}
Now Terraform will know it needs to process the data source before the aws_iam_role, its clever like that! As we add more and more resources we tend to build these things like Tetris, write, plan, apply, repeat (I should put that on a t-shirt), even within resources we add further options, references to other resources which have already been applied, for example;
Here we have a DynamoDB backend for an ECS Service. Typically an Engineer might deploy ECS and DynamoDB as resources first and then add the IAM roles that both resources will use to talk to each other.
In the above example since the DynamoDB Table and the ECS service were created first as part of the development process Terraform will more than happily add IAM policy's which reference each other. Then if we Terraform destroyed the infrastructure it would pass but a subsequent Terraform apply would fail, why? Circular Dependency's! Terraform is unable to apply because neither policy can be deployed without the other existing first.
This only increases when we move to using multiple state files, for example
Imagine you are working on "Project" which reads in some outputs from the Auth Terraform state, as you develop auth/terraform.tfstate is likely already deployed out.
Much like in the case above, your terraform code will no doubt work, reading from /auth/terraform.tfstate, till you head towards production! Where you may learn that you have created a dependency on /auth/ to be deployed first!
It gets Worse! As your developing what if you happened to make a Route53 entry in the Project that you need to work with in Auth, seems fine that Auth could reference it. After all the terraform code applies when you run it, yet as you get to production you realise your stuck in a circular dependency you didn't know about! Auth cannot be deployed out because it needs the Route53 resource created in Project, yet Project cant be deployed because it depends on Auth!
Quotas
Now quotas is a funny one, if you develop anything like me you'll have your terraform running workspaces, which give us the ability to deploy entirely separate blocks of infrastructure with no reference to other workspaces, neat right? Well it does have one draw back, particularly with AWS in that sometimes you hit concurrency quotas for having multiple versions of a service for example;
领英推荐
In Cognito AWS by default only enable 4 Custom Domains, which means at most 4 Terraform Workspaces with their own custom domain!
Having 1 extra terraform workspace has proven time and time again to hit the quota and identify issues we may have expanding out infrastructure out!
Disaster Recovery Testing
"The cloud would never go down" Ive said it, you've said it! I've even had conversations with Disaster Recovery Managers who want me to prove that Eu-West-1 will never lose all 3 Availability Zones in 1 go, apparently it's considered antagonistic to respond with "If we have lost London we are having bigger issues"!
Anyway, one of the things I have seen alot of companies struggle to do is big bang their infrastructure, everything is a carefully managed apply onto existing states to upgrade the existing infrastructure, what this doesn't prove is if we lost everything could we rebuild it from the Terraform. In theory yes but how many people have taken a new account and tried to deploy their entire IaC setup? Id hazard a guess not many and not often is probably the answer, why would you? AWS would never lose a region right.
Regression and Rebuild To The Rescue
How?
Ill give a more technical explanation at the end of the article but the summarised version is in short we rebuild our entire infrastructure every night and on demand in our pull requests.
Why?
By big bang, from nothing, clean field deploying our entire IaC out we can identify any Circular Dependencies, as the code goes from nothing to everything any state references are checked, any IAM cross permissions are validated . Once the IaC is out we can then run our acceptance tests to make sure everything worked as we had anticipated. If all has passed we can comfortably say that;
This gives us the confidence that we can progress our changes towards Production
What next?
Well Ive always had this challenge with deploying Terraform to Production is often more an upgrade than a new deployment and we dont always test for that. I'm going to work on a process of deploying out a copy of Production (with representative data) then applying the latest version of our Terraform over the top as an upgrade to test that path, this should help tackle those "TERRAFORM WANTS TO DESTROY WHAT!" moments.
Okay The Technical Bit
So I've used a multitude of applications for CI/CD in my time as such I wrote the below in Bash, since it is pretty much always available in every application.
The below functions assume a few things
Feel free to modify the code and remove elements you don't need or like, what matters is the concept of what it's doing. I run this in github actions so I can also run it on Pull Requests if I want to.
#!/bin/bash
set -e
Reverse function for destroying
reverse() {
? tac <(echo "$@" | tr ' ' '\n') | tr '\n' ' '
}
function regression_apply {
? local workspace="$1"
? local backendconfig="$2"
? local contexts="terraform"
? local gitroot=`git rev-parse --show-toplevel`
? export TF_WORKSPACE=$workspace
? echo "git root set as ${gitroot}"
? echo "$workspace is been updated"
? echo "$backendconfig is been used"
? echo -e "____________________\nTerraform Applying\n___________________"
? for context in $contexts
? do
? ? echo "working on ${context}"
? ? cd ${gitroot}/$context/
? ? ## init Terraform
? ? terraform init -backend-config=./backend_config/$backendconfig -input=false
? ? ## Create plan and export it
? ? terraform plan -input=false -out="tfplan_${context}" -var-file=./development.tfvars
? ? ## Apply Terraform
? ? terraform apply "tfplan_${context}"
? ? ## Revert Directory for next loop
? ? cd ${gitroot}
? done
? echo -e "____________________\nTerraform Applied\n____________________"
}
function regression_destroy {
? local workspace="$1"
? local backendconfig="$2"
? local contexts="terraform"
? local gitroot=`git rev-parse --show-toplevel`
? export TF_WORKSPACE=$workspace
? echo "git root set as ${gitroot}"
? echo "$workspace is been updated"
? echo "$backendconfig is been used"
? echo -e "____________________\nTerraform Destroying\n_________________"
? for context in `reverse $contexts`
? do
? ? echo "working on ${context}"
? ? cd ${gitroot}/$context/
? ? ## init Terraform
? ? terraform init -backend-config=./backend_config/$backendconfig -input=false
? ? ## Create plan and export it
? ? terraform plan -destroy -input=false -out="tfplan_destroy_${context}" -var-file=./development.tfvars.hcl
? ? ## Apply Terraform
? ? terraform apply "tfplan_destroy_${context}"
? ? ## Revert Directory for next loop
? ? cd ${gitroot}
? done
? echo -e "____________________\nTerraform Destroyed\n__________________"
}h