Day 58 : Ways to Manage Terraform at Scale – Best Practices #90DaysofDevOps
Ayushi Tiwari
Java Software Developer | Certified Microsoft Technology Associate (MTA)
Learning Terraform can be easy. The HCL configuration language is simple enough to start with basic resource creation straight away.?
But—
Once the code that defines your infrastructure grows in complexity, maintainability, and dependency, an uphill battle can ensue. And that is why working out an efficient deployment workflow is crucial for further development.
In this article you will see five approaches to managing Terraform workflows at scale, what their benefits and downsides are, and what problems they address.
Note: New versions of Terraform will be placed under the BUSL license, but everything created before version 1.5.x stays open-source. OpenTofu is an open-source version of Terraform that will expand on Terraform’s existing concepts and offerings. It is a viable alternative to HashiCorp’s Terraform, being forked from Terraform version 1.5.6. OpenTofu retained all the features and functionalities that had made Terraform popular among developers while also introducing improvements and enhancements. OpenTofu is the future of the Terraform ecosystem, and having a truly open-source project to support all your IaC needs is the main priority.
A quick and easy way to start deploying resources using the Infrastructure as Code pattern is to use Terraform locally from the same machine you’re developing the code on. The only requirement is to have access to the target provider and an installed Terraform binary.
This approach makes it much easier to leverage commands related to day-to-day operations (such as state import, move, remove) or to get outputs than if doing it within the implementations where you have no direct access to the underlying provider.
Even though this is the easiest way to work with Terraform, this is not a best practice, because as soon as you scale and you have multiple DevOps Engineers working on your project, it will become very complicated.
You can also use all of the features of Terraform, such as remote state and state locking, to work with your teammates all-at-once. If you collaborate using a version control system and do not have continuous integration in place, but want to keep your code at a decent level of quality, there are tools to help you.
One of these is pre-commit-terraform that will run a set of selected tools to lint the code you’re creating before you even commit it. Even if you have continuous integration available, it’s a handy way to avoid wasting time waiting for results from pull request status checks.
What are the drawbacks of this approach? Each action is manual and must be performed directly by the developer, which leads to several issues.?
In a situation where multiple people are working on the same codebase, you could often encounter a scenario such as this one:
Person A applies their changes to the environment and follows up with a pull request. At the same time, Person B would like to deploy their changes but is blocked as the codebase is not up-to-date with Person A’s changes. Applying the changes in this state would cause damage to the resources created in the previous deployment. When working in larger teams, this will often result in a lot of wasted time just waiting and rebasing on a codebase to apply the changes.
If you happen to write unit tests for Terraform modules (and you should be), you’re sure to know they need to be executed manually from your computer. Running them every time you change something in your configuration can be tedious and time-consuming. If you do not yet write tests for your modules, you’d be well advised to start looking at Spacelift modules tests or Terratest.
Privileged access to an underlying vendor is required, which can easily result in a compromised environment. It’s much harder to keep restricted access within this model, as Terraform must be allowed to create, change, and destroy resources. The same rule applies to a scenario where access to private (internal) resources is required, e.g., every developer has to set up a VPN connection to interact with them.
Moreover, every person working with the codebase needs access to the Terraform state, which creates a security concern due to how Terraform state works. Even if you’re pulling sensitive data from an external system such as Vault, it will be stored within the state file in plaintext once used in a resource. A threat actor could simply run terraform state pull to access all the secrets!
Keeping all these issues in mind, let’s take a look at a more automated approach.
Implementing homegrown automation means incorporating Terraform workflow into your continuous integration / continuous deployment process. A specific example would need to directly tie to a CI/CD tool you are using, so let’s try to keep the example as generic as possible.
As the applying actor is the CI/CD system, the hard requirement of privileged access for developers is gone. You can grant this type of access to the execution layer while developers keep read-only permissions. This is true for private (internal) resources, and if your code is being executed on an agent that resides within the target infrastructure with the proper permissions having been set. In a model where everything has to go through the CI/CD pipeline, visibility of changes is improved as the system logs produced during the operation execution are available in real-time.?
Linting, compliance checks, and automated unit tests can be moved to pull request status checks, and it’s up to you to configure these steps of the pipeline.
But—
If your CI/CD is executing jobs concurrently, you will end up with race conditions causing pull requests to fail non-deterministically.?
领英推荐
It is not difficult to imagine a situation where two operators open a pull request within several seconds of one another. The pipeline gets triggered immediately for both, and one pull request will be marked as successful, while the other will be marked as failed. This situation will definitely happen if you are using state locking (it is a best practice and you really should be using it).
The explanation is simple: the first pull request has locked the state and therefore the second one could not, so it failed. This issue can be resolved by configuring queued runs, where only one pipeline can run at a time. However, this is not so simple with some CI/CD products.
The biggest issue and concern with this approach is the effort you have to put in to build a complete pipeline. If you consider requirements such as:
It might be tempting to implement solutions for those requirements by yourself. However, taking into account the time and effort required for doing it, you might want to take a look at alternatives, such as open-source tools.
One of the most popular open-source tools to use with Terraform is Atlantis, which describes itself as “Terraform pull request automation”.
Atlantis provides a way for teams to collaborate on Terraform code changes by creating a pull request workflow. With Atlantis, developers can create, review, and merge Terraform code changes using familiar Git-based workflows. Atlantis also provides features such as automated plan and apply workflows, automatic branch cleanup, and integration with popular collaboration tools like Slack and GitHub.
It doesn’t offer a dedicated User Interface, you will need to use the one provided by your VCS provider, and if you want to use more advanced features like Policy as Code or Drift Detection, you will need to create the integrations inside of your pipeline, as it doesn’t come with these out of the box.
Terraform Cloud is an application provided by the company behind Terraform itself—HashiCorp. It is available for free for up to 5 users. Two plans are available for more users: “Team & Governance” and “Business”. As the product is created by Terraform authors themselves, any new feature that makes it to Terraform will be quickly available within Terraform Cloud as well. Neat!
Advantages of Terraform Cloud
Advanced security compliance mechanisms available through Sentinel allow you to enforce organization standards even before the code actually ships and creates resources with the chosen provider context.
Another important feature is a module registry that acts as an artifact repository for all your modules. It can be compared to JFrog Artifactory, Sonatype Nexus, or other similar software. Having every module in a separate repository and versioning allows you to control code promotion in a granular fashion or quickly roll back to if it allows you to go back to the most recent version only, then we should use “the”, but if it lets you go back to some other earlier version then we should use “a” previous version.?
Having a centralized place to store all the pieces also enables you to quickly browse through available modules, their documentation, inputs, outputs, and used resources. It is a great single pane of glass in terms of module management. However, it can only be used within the same Terraform Cloud organization and cannot be shared. This might be troublesome for operations teams that base their separation on the organization level.
There’s also the remote execution backend, which is an exclusive feature for Terraform Cloud. It allows you to work with Terraform on your local computer as you would usually do, upload your local codebase, and run a plan against it. It is a very efficient way of quickly verifying if what you are working on is actually working.
Disadvantages of Terraform Cloud
Along with so many benefits, there are some drawbacks as well. Any linting or module unit tests need to be implemented inside your own continuous integration system. This requires an additional effort the development team has to make to sustain their code quality.
Integrations are limited to what Run Tasks are supporting at any given time, meaning that you cannot really own your workflow. You will need to manage with what you are receiving from Terraform Cloud.?
Another disadvantage is the fact that you cannot control the steps that are happening before and after an operation (init/plan/apply). Run tasks can be added before or after a plan and also before an apply, but you cannot add them before an init or after an apply, which again shows a lack of flexibility, and you will need to build pipelines in Github Action or whatever other CI/CD tool you prefer, to do some simple checks before running your code. The only thing that you can actually do, is run remote Terraform command execution using the CLI, which is not that helpful overall.?
The biggest drawback is the fact that Terraform Cloud doesn’t support any other products apart from Terraform. Of course, this is not its purpose, but as many organizations are using Kubernetes and Ansible in conjunction with Terraform, this would mean that you need to adopt other tools like ArgoCD and Ansible Tower to manage your workflows end to end.
Best practices for managing Terraform at scale
Regardless of how you decide to manage your Terraform code base, some best practices apply: