Lessons from managing Terraform at scale
Photo by Ben Allan on Unsplash

Lessons from managing Terraform at scale

I have been thinking of creating an article to list a few of my experiences managing Terraform setups. About a month back, Slack Engineering published a post on how they use Terraform in Slack. Not only did it cover a lot of aspects I had in mind but also went to provide excellent details on their Terraform journey and how they have evolved and matured over a period. The post also provided a lot of implementation details of Slack’s in-house tooling for managing IaC using terraform. If you are interested in knowing more about using Terraform on a big scale, you might find this article interesting as I did.

?My article today summarizes some of the aspects this Slack article covers and provides my viewpoints on these aspects.

Using Terraform over CloudFormation for IaC

No alt text provided for this image

Even though Slack runs the bulk of its workloads in AWS, it also uses Digital Ocean, NS1, and GCP. That is why it chose Terraform over CloudFormation.

I have always found that comparing Terraform and CloudFormation, is like comparing apples and oranges. People tend to focus only on the cloud services aspect of this comparison. Look at this page of all the Terraform providers, and you will see 53 pages of tools and services that you can provision, manage, and configure using Terraform. It is the one ring to manage all the others.

AWS multi-account strategy

No alt text provided for this image

The post talks about the challenges of usage limits and access control that Slack faced that forced them to move from a single account to a multi-account AWS strategy. It also discusses how Slack uses dedicated AWS accounts for some services and teams. ?

A multi-account AWS strategy indeed adds a bit of complexity to your setup. However, I feel for any decent-sized shop, a multi-account setup is a way to go. This page talks about a few of the benefits of using multiple accounts.

Orchestrating Terraform pipelines

No alt text provided for this image

The post provides some great details about how Slack designs the Jenkins pipelines and stages and how pipelines are chained for various environment deploys. It also discusses the challenges of regularly creating new pipelines and how Slack developed an in-house tool to quickly churn out new pipelines.

Having designed and developed Terraform pipelines in Jenkins, GitHub Actions, and Circle CI, I can certify that getting your first terraform pipeline right is paramount. It is essential to arrive at a standard convention for naming your environments, environment files, contexts, projects, and resources and how you use these in your Terraform code. Once you have got the template right, you can copy and paste the pipeline to any repo, and it should work with minimal changes. If you spend time creating new pipelines, you have not got your template right. ?

Not having a centralized Ops team to manage all Terraform code

No alt text provided for this image

Slack has a central team called The Cloud Foundations team, which manages the tooling and the platform for running and managing Terraform however individual teams own their Terraform code.

I believe this is how the setup should be. The DevOps team should focus on the platform aspect of things. Functional code and IaC should go hand in hand, if not in the same hand. The dev team not owning IaC and another team playing catchup with Dev to update and manage IaC is the perfect recipe for delays, even for small teams.

Designing terraform states

No alt text provided for this image

The post mentions how Slack designs its states and has some significant inputs on how it can be done in scale. It discusses having a state file per region in each child account and a separate state file for global services like IAM and CloudFront. It also talks about keeping the number of resources managed by the state file to a minimum.

I feel there is no silver bullet solution to this and how you design your Terraform state is unique to how you deploy and manage workloads.

Terraform backend and state file

No alt text provided for this image

Slack uses S3 as a backend for the state and Dynamo DB to manage the state lock.

I have seen teams opt for Terraform Cloud to manage state, thinking the homegrown solution is not robust enough. This is a fine example to prove otherwise. With just a little planning and additional security in place, you can create your very own robust state management setup. If you end up creating your own state management solution, keep in mind that how you write your IaC pipeline and how it ties to the state management component is critical for you to get right.

Testing Terraform Code

No alt text provided for this image

The post talks about how Slack tests terraform code and talks about a few challenges and how they tackle them. It talks about in-house tooling to review changes, do dependency management, review the change impact, and have on-demand boxes to do test runs. Some significant inputs and pointers about these topics are in this post.

?Here are a few of my thoughts on testing Terraform

  1. It is critical that you design your workflow to test your terraform code to be able to apply it to higher environments
  2. You should be able to churn out new environments quickly, sometimes in the same AWS account, with minimal parameter changes. This again ties to how well you have designed your IaC pipeline.
  3. Remember running a plan is not enough; unless you apply your change, you cannot say that your code works.
  4. Terraform changes should work for an update, destroy, and fresh provisioning scenario. If you are not testing all three scenarios, you are not testing enough.
  5. A mix of IaC and manual configuration is a sure-shot recipe for delays. If everything cannot be automated immediately, keep a backlog of future enhancements and document your manual changes well.
  6. Just because the code works in Dev or Test does not ensure it will work on Prod. Yeah, isn’t that a bummer? Identifying these scenarios and retrofitting the fixes is vital for a robust IaC setup.
  7. The actual test of your Terraform code is how confidently you can apply it to Production without worrying about impact or downtime.
  8. Last but not least, testing IaC takes time. Sometimes significantly more time than what the actual change took. Plan for it.

Using Terraform modules

No alt text provided for this image

The post talks in detail about how Slack has developed a way to develop, manage and consume terraform modules. It has some significant inputs on some of the challenges and the possible solution for a homegrown solution.

Terraform modules often evolve over a period of time. Common patterns that are repeated often get packaged as modules. This page talks about some of the common patterns of module creation. While it is possible to refer modules using a relative path and git repo links, if you are a shop that creates and uses many in-house modules, you might want to explore the Terraform Cloud (HashiCorp's managed service offering) feature to host a Private registry.

Challenges of Upgrading Terraform version

No alt text provided for this image

The post provides significant bits of detail on how Slack has faced issues with Terraform and provider version upgrades and how they solved it with in-house tooling. The post contains many implementation details with code snippets which can be valuable if you build such tooling.

Barring a few sporadic issues, I have not really faced many issues with Terraform or provider version upgrades. Maybe because I have been predominantly exposed to greenfield setups. However, as this post stresses, having a terraform and provider version upgrade strategy can be very useful.

Sophie Phillips

Independent Technology & Solutions Consultant

1 年

Pinaki, thanks for sharing!

回复

要查看或添加评论,请登录

Pinaki Mukherjee的更多文章

  • Managing your network CIDRs across multiple AWS accounts in a Control Tower setup

    Managing your network CIDRs across multiple AWS accounts in a Control Tower setup

    Have you faced these questions while setting up your VPCs and subnets in AWS? Which CIDR should I pick up for my VPC?…

    3 条评论
  • How to buy a new car using the "Six Pillars of the AWS Well-Architected framework."

    How to buy a new car using the "Six Pillars of the AWS Well-Architected framework."

    One of the aspects of writing on technical topics that I love is when I get an opportunity to explain seemingly…

    10 条评论
  • 6 Pillars of DevOps

    6 Pillars of DevOps

    What is DevOps? Several popular definitions are in use. I explore some of these in my previous articles, "What is…

    2 条评论
  • Let's talk secrets

    Let's talk secrets

    Australia saw one of the most significant cyber attacks unfold recently. While corporates and customers still grapple…

    8 条评论
  • A Brief history of the CI-CD orchestrator

    A Brief history of the CI-CD orchestrator

    The CI-CD orchestrator is the backbone of a DevOps toolchain. It is the tool that ties all the other tools in your…

    10 条评论
  • Automate your AWS architecture diagrams

    Automate your AWS architecture diagrams

    If you use terraform or any other IaC solution to provision your cloud infrastructure, you might have faced some of…

    1 条评论
  • What is DevOps?

    What is DevOps?

    I have always wondered what would be the best way to explain DevOps in non-technical terms. An example keeps coming to…

    27 条评论
  • A Career in DevOps

    A Career in DevOps

    In the past week, I got this question on a couple of occasions about how to make a transition and progress in a career…

    17 条评论

社区洞察

其他会员也浏览了