Lessons from managing Terraform at scale
I have been thinking of creating an article to list a few of my experiences managing Terraform setups. About a month back, Slack Engineering published a post on how they use Terraform in Slack. Not only did it cover a lot of aspects I had in mind but also went to provide excellent details on their Terraform journey and how they have evolved and matured over a period. The post also provided a lot of implementation details of Slack’s in-house tooling for managing IaC using terraform. If you are interested in knowing more about using Terraform on a big scale, you might find this article interesting as I did.
?My article today summarizes some of the aspects this Slack article covers and provides my viewpoints on these aspects.
Using Terraform over CloudFormation for IaC
Even though Slack runs the bulk of its workloads in AWS, it also uses Digital Ocean, NS1, and GCP. That is why it chose Terraform over CloudFormation.
I have always found that comparing Terraform and CloudFormation, is like comparing apples and oranges. People tend to focus only on the cloud services aspect of this comparison. Look at this page of all the Terraform providers, and you will see 53 pages of tools and services that you can provision, manage, and configure using Terraform. It is the one ring to manage all the others.
AWS multi-account strategy
The post talks about the challenges of usage limits and access control that Slack faced that forced them to move from a single account to a multi-account AWS strategy. It also discusses how Slack uses dedicated AWS accounts for some services and teams. ?
A multi-account AWS strategy indeed adds a bit of complexity to your setup. However, I feel for any decent-sized shop, a multi-account setup is a way to go. This page talks about a few of the benefits of using multiple accounts.
Orchestrating Terraform pipelines
The post provides some great details about how Slack designs the Jenkins pipelines and stages and how pipelines are chained for various environment deploys. It also discusses the challenges of regularly creating new pipelines and how Slack developed an in-house tool to quickly churn out new pipelines.
Having designed and developed Terraform pipelines in Jenkins, GitHub Actions, and Circle CI, I can certify that getting your first terraform pipeline right is paramount. It is essential to arrive at a standard convention for naming your environments, environment files, contexts, projects, and resources and how you use these in your Terraform code. Once you have got the template right, you can copy and paste the pipeline to any repo, and it should work with minimal changes. If you spend time creating new pipelines, you have not got your template right. ?
Not having a centralized Ops team to manage all Terraform code
Slack has a central team called The Cloud Foundations team, which manages the tooling and the platform for running and managing Terraform however individual teams own their Terraform code.
I believe this is how the setup should be. The DevOps team should focus on the platform aspect of things. Functional code and IaC should go hand in hand, if not in the same hand. The dev team not owning IaC and another team playing catchup with Dev to update and manage IaC is the perfect recipe for delays, even for small teams.
Designing terraform states
领英推荐
The post mentions how Slack designs its states and has some significant inputs on how it can be done in scale. It discusses having a state file per region in each child account and a separate state file for global services like IAM and CloudFront. It also talks about keeping the number of resources managed by the state file to a minimum.
I feel there is no silver bullet solution to this and how you design your Terraform state is unique to how you deploy and manage workloads.
Terraform backend and state file
Slack uses S3 as a backend for the state and Dynamo DB to manage the state lock.
I have seen teams opt for Terraform Cloud to manage state, thinking the homegrown solution is not robust enough. This is a fine example to prove otherwise. With just a little planning and additional security in place, you can create your very own robust state management setup. If you end up creating your own state management solution, keep in mind that how you write your IaC pipeline and how it ties to the state management component is critical for you to get right.
Testing Terraform Code
The post talks about how Slack tests terraform code and talks about a few challenges and how they tackle them. It talks about in-house tooling to review changes, do dependency management, review the change impact, and have on-demand boxes to do test runs. Some significant inputs and pointers about these topics are in this post.
?Here are a few of my thoughts on testing Terraform
Using Terraform modules
The post talks in detail about how Slack has developed a way to develop, manage and consume terraform modules. It has some significant inputs on some of the challenges and the possible solution for a homegrown solution.
Terraform modules often evolve over a period of time. Common patterns that are repeated often get packaged as modules. This page talks about some of the common patterns of module creation. While it is possible to refer modules using a relative path and git repo links, if you are a shop that creates and uses many in-house modules, you might want to explore the Terraform Cloud (HashiCorp's managed service offering) feature to host a Private registry.
Challenges of Upgrading Terraform version
The post provides significant bits of detail on how Slack has faced issues with Terraform and provider version upgrades and how they solved it with in-house tooling. The post contains many implementation details with code snippets which can be valuable if you build such tooling.
Barring a few sporadic issues, I have not really faced many issues with Terraform or provider version upgrades. Maybe because I have been predominantly exposed to greenfield setups. However, as this post stresses, having a terraform and provider version upgrade strategy can be very useful.
Independent Technology & Solutions Consultant
1 年Pinaki, thanks for sharing!