Automating CI for Infrastructure as Code

Automating CI for Infrastructure as Code

In my last post, we learned about how Infrastructure as Code can be tested using standard software testing techniques and looked at a specific example of testing Terraform code which sets up a simple containerized web application on AWS using Terratest.

However there is still a bit of a gap between testing locally from your laptop to setting up reliable automated testing infrastructure. Let's look at some of the missing pieces.

Before we get started, just a quick reminder that you can see the full reference code on Github.

For continuous integration we'll look at using Github Actions (although any platform or service can be used in its place).

Let's take a look at the workflow code for running the Terratest setup that we did in the last article.

# terratest.yml

# ...

jobs:
  terratest:
    name: Terratest
    runs-on: ubuntu-latest
    steps:
      # Install Golang (needed by terratest)
    - name: Install Go 
      uses: actions/setup-go@v2
      with:
        go-version: ^1.15
      id: go
      # Install gcc (needed by terratest)
    - name: Install gcc
      run: "sudo apt-get update && sudo apt-get install -y gcc"
      # Install gotestsum (for simplifying `go test` output)
    - name: Setup gotestsum
      uses: autero1/[email protected]
      with:
        gotestsum_version: 0.5.3
      # Install Terraform
    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v1
      with:
        terraform_wrapper: false
        terraform_version: ">= 0.13"

We start out by installing some prerequisites to running Terratest. Golang, gcc, gotestsum (for pretty printing output), and of course Terraform itself. Recall that Terratest calls the system Terraform, not an embedded Terraform engine, so it needs to be explicitly installed in our CI runner environment.

# terratest.yml

# ...


    - name: Authenticate to AWS
      uses: aws-actions/configure-aws-credentials@v1
      with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-2
    - name: Checkout code 
      uses: actions/checkout@v2
    - name: Terraform init
      run: cd terraform; terraform init
    - name: Terraform validate
      run: cd terraform; terraform validate

Because, as we saw in the last post, we will be creating actual resources in AWS with Terratest, we must pass in our AWS credentials - I'll get back to this in a bit. We also check out the source code from the repository, and run `terraform init` followed by `terraform validate` as a quick sanity check before running the lengthier (and costlier) Terratest run.

At this point, hopefully you've already realized that we need to consider secret management. At the very least, we've seen that we must pass our credentials to AWS in to the CI runner, and if you use Terraform Cloud, you'll need to pass a CLI credentials token to the CI runner, too (there's more information in the source code about exactly how to do that). You can get away with not sharing cloud credentials if you use your own hosted runner and use cloud-provided credentials - for example, running on a EC2 instance with an IAM instance profile - but this won't cover Terraform Cloud, and also can open a different sort of attack vector to your cloud infrastructure, which is beyond the scope of this beginner level article.

No alt text provided for this image

In our code we use GitHub secrets to manage the safety of our secrets, but never fear: there's a plethora of other options available to help manage this in any other CI implementation. Don't skimp on finding the best and most secure solution to fit your needs!

Another point that we haven't overly considered is error handling. In classic software testing, local mocks are used to avoid mangling real data. However, as we've seen, when testing IaC we explicitly modify real infrastructure. As such, we need to deal with error handling and cleanup in case of unexpected failures (which will happen from time to time).

If we consider the types of failures that we need to deal with, for the most sake they boil down into two categories: failures due to bugs in our code, or failure due to the infrastructure (both our CI platform and our cloud provider).

We already saw that we left ample timeouts and retries for the assertions running, to account for gaps between the Terraform run finishing and the infrastructure actually being available. In our case, the gap is usually pretty small - we need only to wait for the containers to be scheduled and executed on Fargate, and then another short while for the load balancer to register the containers as healthy - before we can expect our HTTP request to execute successfully. 

But often there can be other time consuming issues. For example, if we were to execute the same sort of application in EKS, that would take longer, as it takes AWS approximately 6-10 minutes to provision an EKS control plane. 

Why is this important, you may ask?

Recall that Terratest is built on Golang's standard test library, which (like many unit test libraries) expects tests to run quickly. Golang's standard timeout is 10 minutes, but we can easily extend this to something longer. In our Github Actions workflow, below, we extend the timeout to 30 minutes, but you can tweak this to suit your testing needs.

# terratest.yml

# ...

    - name: Run terratest
      run: "cd test; gotestsum --format standard-verbose -- -run TestIT_ -count=1 -timeout 30m"


Finally, we run Terratest wrapped by `gotestsum` on all tests with functions starting with `TestIT_` (to distinguish Integration Tests from Unit Tests, the latter which start with `TestUT_`) with our extended timeout.

That wraps up the GitHub Actions workflow itself.

No alt text provided for this image

xAn important "gotcha" to consider is that when the `go test` timeout expires, the process will terminate immediately. This means that our `defer` statements won't run, which will leave infrastructure dangling. When we ran this locally, this could be dealt with by manually running `terraform destroy` to clean up the resources, but in the cloud on an ephemeral CI runner, we won't necessarily have access to the terraform state files to do so.

Different test failures can, and almost certainly will, leave unintended waste in your cloud infrastructure. Aside from causing unintended consequences in the form of odd side-effects or resource limits, this unintended waste will also start ballooning your cloud infrastructure bill.

No alt text provided for this image

To deal with this, a best-practice is to decide on one specific cloud account (more is fine too, but at least a defined set of accounts) for all of your testing, and ensure that these accounts are completely cleaned out on a normal basis. 

To do this "spring cleaning", we'll need to use another tool. Since our example is with AWS infrastructure, we'll look at an AWS-oriented tool called aws-nuke, but similar tools exist for all cloud providers.

At Coinmama, we use a single AWS account and schedule the account to be nuked daily - well, nightly - in time a window which we know won't interfere with testing.

For safety, aws-nuke requires you to define a blacklist to avoid touching - for example, your production account would go in this blacklist - in addition to the accounts that you want to nuke and a list of AWS regions to process, including the "global" region for things like IAM definitions that don't exist on a region level. By default, aws-nuke destroys all resources in the accounts/regions listed, but you can also define a list of exclude or include rules. 

At Coinmama we work with multiple accounts within AWS and utilize AWS's Control Tower service to manage them. Control Tower manages a small group of resources in each managed account: IAM definitions, CloudTrail, Lambas and messaging services to name a few.  We have rules protecting these resources to avoid disruptions or errors in the AWS Control Tower service.

Another probably more classic workflow is simply to have a static IAM user (in the provided example, that user is named "awsnuke") which consists of an IAM user, an IAM credential pair and an IAM policy to allow administrative access to the account. 

In either case, you'll almost certainly have a few resources which you want to hold on to across account nukes - be it IAM, default VPCs, SSH keypairs or any other resource which you need.

One last point on that last sentence. 

Remember from the post how we talked about using the `base` Terraform module to provide supporting infrastructure before running Terraform on our actual code-under-test? This is by design intended to work with aws-nuke by expecting a completely empty account, without any default VPCs or anything else in place to work with. Beyond minimizing the possibility of missing dependent resources causing our tests to fail, this also helps our engineers understand exactly what dependencies are needed by our Terraform modules and better plan around them.

Let's see how the aws-nuke Github Action workflow works.

# awsnuke.yml


jobs:
    awsnuke:
        runs-on: ubuntu-latest
        name: Nuke
        steps:
          - name: Authenticate to AWS
            uses: aws-actions/configure-aws-credentials@v1
            with:
                aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
                aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
                aws-region: us-east-2
          - name: Checkout
            uses: actions/checkout@v2
          - name: Execute aws-nuke
            uses: coinmama/actions-awsnuke@main
            with:
                aws_nuke_config: "nuke-config.

This is very simple in comparison to the previous workflow. We authenticate to AWS and checkout the source code. Then we download and call aws-nuke with the configuration file `nuke-config.yml` from the root of the repository. You can see how to use the aws-nuke actions on the aws-nuke action GitHub repository.

You can view the full code of the aws-nuke config, as well as view sample output of the GitHub actions on the companion GitHub repository for this article series.

In conclusion, in this three article mini-series, we've talked about the importance of automated testing for your Infrastructure as Code, and explained the "how"s and "why"s around it.

We then took a closer look at a simple cloud deployment pattern, and saw the Terraform code to deploy that pattern inside AWS. We discussed the pros and cons around different testing patterns, and understood that the lowest-hanging fruit is to test it the same way a human would: by accessing the hosted service and asserting that it responds with the content we expected to see. We then looked at the Terratest code needed to make that happen.

Lastly, we looked at how to automate that Terratest in a continuous integration pattern by utilizing GitHub Actions. We looked briefly at the workflow code to make that work. We discussed some caveats that might affect remote testing and addressed how to solve them. Finally, we discussed the importance of a dedicated testing cloud account and implemented a "catch all" daily cleanup of that account.

I hope that this series has both inspired you and given you the tools needed to protect your cloud infrastructure the same way that you're able to protect your proprietary software which runs on that infrastructure.

Questions? Comments? Feel free to leave comments below!

要查看或添加评论,请登录

Issac Goldstand的更多文章

  • Automated Testing for Terraform

    Automated Testing for Terraform

    In my last post, I went over some of the automated testing techniques that I use to test infrastructure as code at…

  • CI For Platform Teams

    CI For Platform Teams

    One of the key components of a baseline SSDLC is automated testing. In a large organization that wants to move as…

  • Attribute based access control to AWS resources

    Attribute based access control to AWS resources

    Part of the platform vision we have at Coinmama includes mandatory tagging of all managed resources in our AWS account.…

  • Hashicorp Waypoint - Some initial thoughts, and what's yet to come...

    Hashicorp Waypoint - Some initial thoughts, and what's yet to come...

    Last week Hashicorp dropped a product bomb on the world. Not one, but two new open source products were released! I…

  • Vaults & TLS-es & K8S-es & Ingress-es (Oh, my!)

    Vaults & TLS-es & K8S-es & Ingress-es (Oh, my!)

    Time for another DevOps related post - this time about the Vault Helm chart. I've been mulling over what I could write…

  • Using Wazuh 3.13 to monitor Docker containerized applications

    Using Wazuh 3.13 to monitor Docker containerized applications

    Over the past few weeks, I've taken responsibility over a project that utilizes the open source SIEM (Security…

    1 条评论
  • Using consul auto-encrypt with k8s

    Using consul auto-encrypt with k8s

    I'm an old-school consul user who stepped away for a couple of years, and came back - delighted! - to consul's…

    3 条评论
  • Traefik and Consul: Tips & Tricks

    Traefik and Consul: Tips & Tricks

    I've been a long-time fan and evangelist of consul (and most of the Hashicorp products, for that matter), so as I've…

    1 条评论
  • Advanced Home Assistant Add-on Development with Visual Studio Code

    Advanced Home Assistant Add-on Development with Visual Studio Code

    A short while ago, I wrote a post exploring a boilerplate add-on for Home Assistant and how to set up a streamlined…

    1 条评论
  • Creating Your First Home Assistant Add-On

    Creating Your First Home Assistant Add-On

    One of the things I like so much about the amazing Home Assistant project is its endless potential for extensibility…

    2 条评论

社区洞察

其他会员也浏览了