Automating CI for Infrastructure as Code
In my last post, we learned about how Infrastructure as Code can be tested using standard software testing techniques and looked at a specific example of testing Terraform code which sets up a simple containerized web application on AWS using Terratest.
However there is still a bit of a gap between testing locally from your laptop to setting up reliable automated testing infrastructure. Let's look at some of the missing pieces.
Before we get started, just a quick reminder that you can see the full reference code on Github.
For continuous integration we'll look at using Github Actions (although any platform or service can be used in its place).
Let's take a look at the workflow code for running the Terratest setup that we did in the last article.
# terratest.yml # ... jobs: terratest: name: Terratest runs-on: ubuntu-latest steps: # Install Golang (needed by terratest) - name: Install Go uses: actions/setup-go@v2 with: go-version: ^1.15 id: go # Install gcc (needed by terratest) - name: Install gcc run: "sudo apt-get update && sudo apt-get install -y gcc" # Install gotestsum (for simplifying `go test` output) - name: Setup gotestsum uses: autero1/[email protected] with: gotestsum_version: 0.5.3 # Install Terraform - name: Setup Terraform uses: hashicorp/setup-terraform@v1 with: terraform_wrapper: false terraform_version: ">= 0.13"
We start out by installing some prerequisites to running Terratest. Golang, gcc, gotestsum (for pretty printing output), and of course Terraform itself. Recall that Terratest calls the system Terraform, not an embedded Terraform engine, so it needs to be explicitly installed in our CI runner environment.
# terratest.yml # ... - name: Authenticate to AWS uses: aws-actions/configure-aws-credentials@v1 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: us-east-2 - name: Checkout code uses: actions/checkout@v2 - name: Terraform init run: cd terraform; terraform init - name: Terraform validate run: cd terraform; terraform validate
Because, as we saw in the last post, we will be creating actual resources in AWS with Terratest, we must pass in our AWS credentials - I'll get back to this in a bit. We also check out the source code from the repository, and run `terraform init` followed by `terraform validate` as a quick sanity check before running the lengthier (and costlier) Terratest run.
At this point, hopefully you've already realized that we need to consider secret management. At the very least, we've seen that we must pass our credentials to AWS in to the CI runner, and if you use Terraform Cloud, you'll need to pass a CLI credentials token to the CI runner, too (there's more information in the source code about exactly how to do that). You can get away with not sharing cloud credentials if you use your own hosted runner and use cloud-provided credentials - for example, running on a EC2 instance with an IAM instance profile - but this won't cover Terraform Cloud, and also can open a different sort of attack vector to your cloud infrastructure, which is beyond the scope of this beginner level article.
In our code we use GitHub secrets to manage the safety of our secrets, but never fear: there's a plethora of other options available to help manage this in any other CI implementation. Don't skimp on finding the best and most secure solution to fit your needs!
Another point that we haven't overly considered is error handling. In classic software testing, local mocks are used to avoid mangling real data. However, as we've seen, when testing IaC we explicitly modify real infrastructure. As such, we need to deal with error handling and cleanup in case of unexpected failures (which will happen from time to time).
If we consider the types of failures that we need to deal with, for the most sake they boil down into two categories: failures due to bugs in our code, or failure due to the infrastructure (both our CI platform and our cloud provider).
We already saw that we left ample timeouts and retries for the assertions running, to account for gaps between the Terraform run finishing and the infrastructure actually being available. In our case, the gap is usually pretty small - we need only to wait for the containers to be scheduled and executed on Fargate, and then another short while for the load balancer to register the containers as healthy - before we can expect our HTTP request to execute successfully.
But often there can be other time consuming issues. For example, if we were to execute the same sort of application in EKS, that would take longer, as it takes AWS approximately 6-10 minutes to provision an EKS control plane.
Why is this important, you may ask?
Recall that Terratest is built on Golang's standard test library, which (like many unit test libraries) expects tests to run quickly. Golang's standard timeout is 10 minutes, but we can easily extend this to something longer. In our Github Actions workflow, below, we extend the timeout to 30 minutes, but you can tweak this to suit your testing needs.
# terratest.yml # ... - name: Run terratest run: "cd test; gotestsum --format standard-verbose -- -run TestIT_ -count=1 -timeout 30m"
Finally, we run Terratest wrapped by `gotestsum` on all tests with functions starting with `TestIT_` (to distinguish Integration Tests from Unit Tests, the latter which start with `TestUT_`) with our extended timeout.
That wraps up the GitHub Actions workflow itself.
xAn important "gotcha" to consider is that when the `go test` timeout expires, the process will terminate immediately. This means that our `defer` statements won't run, which will leave infrastructure dangling. When we ran this locally, this could be dealt with by manually running `terraform destroy` to clean up the resources, but in the cloud on an ephemeral CI runner, we won't necessarily have access to the terraform state files to do so.
Different test failures can, and almost certainly will, leave unintended waste in your cloud infrastructure. Aside from causing unintended consequences in the form of odd side-effects or resource limits, this unintended waste will also start ballooning your cloud infrastructure bill.
To deal with this, a best-practice is to decide on one specific cloud account (more is fine too, but at least a defined set of accounts) for all of your testing, and ensure that these accounts are completely cleaned out on a normal basis.
To do this "spring cleaning", we'll need to use another tool. Since our example is with AWS infrastructure, we'll look at an AWS-oriented tool called aws-nuke, but similar tools exist for all cloud providers.
At Coinmama, we use a single AWS account and schedule the account to be nuked daily - well, nightly - in time a window which we know won't interfere with testing.
For safety, aws-nuke requires you to define a blacklist to avoid touching - for example, your production account would go in this blacklist - in addition to the accounts that you want to nuke and a list of AWS regions to process, including the "global" region for things like IAM definitions that don't exist on a region level. By default, aws-nuke destroys all resources in the accounts/regions listed, but you can also define a list of exclude or include rules.
At Coinmama we work with multiple accounts within AWS and utilize AWS's Control Tower service to manage them. Control Tower manages a small group of resources in each managed account: IAM definitions, CloudTrail, Lambas and messaging services to name a few. We have rules protecting these resources to avoid disruptions or errors in the AWS Control Tower service.
Another probably more classic workflow is simply to have a static IAM user (in the provided example, that user is named "awsnuke") which consists of an IAM user, an IAM credential pair and an IAM policy to allow administrative access to the account.
In either case, you'll almost certainly have a few resources which you want to hold on to across account nukes - be it IAM, default VPCs, SSH keypairs or any other resource which you need.
One last point on that last sentence.
Remember from the post how we talked about using the `base` Terraform module to provide supporting infrastructure before running Terraform on our actual code-under-test? This is by design intended to work with aws-nuke by expecting a completely empty account, without any default VPCs or anything else in place to work with. Beyond minimizing the possibility of missing dependent resources causing our tests to fail, this also helps our engineers understand exactly what dependencies are needed by our Terraform modules and better plan around them.
Let's see how the aws-nuke Github Action workflow works.
# awsnuke.yml
jobs: awsnuke: runs-on: ubuntu-latest name: Nuke steps: - name: Authenticate to AWS uses: aws-actions/configure-aws-credentials@v1 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: us-east-2 - name: Checkout uses: actions/checkout@v2 - name: Execute aws-nuke uses: coinmama/actions-awsnuke@main with:
aws_nuke_config: "nuke-config.
This is very simple in comparison to the previous workflow. We authenticate to AWS and checkout the source code. Then we download and call aws-nuke with the configuration file `nuke-config.yml` from the root of the repository. You can see how to use the aws-nuke actions on the aws-nuke action GitHub repository.
You can view the full code of the aws-nuke config, as well as view sample output of the GitHub actions on the companion GitHub repository for this article series.
In conclusion, in this three article mini-series, we've talked about the importance of automated testing for your Infrastructure as Code, and explained the "how"s and "why"s around it.
We then took a closer look at a simple cloud deployment pattern, and saw the Terraform code to deploy that pattern inside AWS. We discussed the pros and cons around different testing patterns, and understood that the lowest-hanging fruit is to test it the same way a human would: by accessing the hosted service and asserting that it responds with the content we expected to see. We then looked at the Terratest code needed to make that happen.
Lastly, we looked at how to automate that Terratest in a continuous integration pattern by utilizing GitHub Actions. We looked briefly at the workflow code to make that work. We discussed some caveats that might affect remote testing and addressed how to solve them. Finally, we discussed the importance of a dedicated testing cloud account and implemented a "catch all" daily cleanup of that account.
I hope that this series has both inspired you and given you the tools needed to protect your cloud infrastructure the same way that you're able to protect your proprietary software which runs on that infrastructure.
Questions? Comments? Feel free to leave comments below!