A Fundamental Mistake in "DevOps"

I've been working as a "DevOps Engineer" for about 8 years, having been an infrastructure guy for about 15 years before that. I've been a part of many Agile software teams, for what you might consider small companies and huge companies. I do CI/CD pipelines, Infrastructure as Code (IaC), and automation scripts for a living. That's what DevOps Engineers do (even if that's the not the spirit of the term "DevOps" as it was originally envisioned.)?

The default tool that companies use for IaC is Hashicorp Terraform. It supplies a high-level descriptive language that everyone can learn and use from company to company. It is extensible with plug-ins for anything you want to create a plug-in to use it with. Some of the most common plug-ins are those for the cloud providers: Azure, AWS, GCP, etc. This makes Terraform appear less cloud-specific, because you can use the same language and just change your plug-ins.?

One of the major features of Terraform is that after you run an "apply" to do some work, you get a text file having the details of what was done. This is called the "state file", and you can keep that in source control or whatever. The next time you run an "apply", Terraform looks at the earlier state file, figures out what is changing, and changes exactly that. Terraform treats the state file as an authoritative record of what has been done. This means that if users make manual changes within whatever system you're automating, they can be reset back to what you have coded every time. ?


Now for an analogous scenario. When you take your car in for maintenance, the technicians in the shop check your tire pressures, battery, do an oil change, check your filters, and God knows what else. They diagnose what they need to do, they fix all the things, and you leave with a newly maintained car. They also hand you a form that shows what they did, the tire pressures when they finished, and so on. This is great!?

However, by the time you get home, at least one of those values is likely different. Maybe you got a flat tire. Maybe the guy didn't tighten something correctly and there's a new leak. Heck, you could have a complete engine failure before you leave the lot. ?

Your car, nor the technicians in the shop, give a damn what is written on that document, once you leave the shop. Furthermore, when the car comes back, they don't even look at the earlier file. They know what they are doing, know how to find what the proper values are from their own experience and the actual manuals. Your earlier record is useless to them.?



?

For IT organizations, this means believing that the state file is sacrosanct ignores all reality, and is a fundamental mistake made by DevOps engineers around the world.?

In all but the smallest IT organizations, there are many teams with their fingers in the infrastructure. Security teams must push policies. Support teams must be able to fix stuff as it breaks. And so on. In order for those teams to get work done, they cannot go to every infrastructure team and beg them to make changes to their Terraform code.?

No team works within a bubble. Not the development teams, not the security teams, not the support teams, not any other teams. You cannot expect that your hands are the only ones touching "your stuff". Believing so is delusional. ?

Just like the auto techs, when your systems are out-of-whack, you should not default to setting them back to the way they were the last time you touched them. Your team should have the skills and knowledge to diagnose problems and set them back to what is proper RIGHT NOW, without the assumption that everything was correct before. Setting them back is akin to rebooting your computer every time there's a problem: it might fix the problem temporarily, but the root cause is not corrected.?


What does this mean for Terraform? Well, for me, it means that when you start using Terraform, you must accept that there are going to be state file problems like these down the road. These problems are unavoidable, and they happen on every team I've ever been a part of. If you don't like it, pick another tool. If you get to a point where you are spending more time fixing Terraform-related issues than you are doing work that brings value to your users, start looking into a different solution. Pulumi, "just bash or PowerShell scripts", etc., are all possibilities. And don't automatically rule out "manual with a UI". It has worked for decades. Be open minded and make your life better.?

Jayaprakash Nimmala

Cloud Infrastructure Architect - Infrastructure Automation and Cloud Engineering

2 年

Cannot agree more. Terraform with a state file has its issues. People spend so much time trying make the state file work.

要查看或添加评论,请登录

Chris S.的更多文章

  • A Question for Data People

    A Question for Data People

    A little background: I'm an old math geek. I took darn near every undergraduate math class offered at both Morehead…

  • PowerShell Modules Rule!

    PowerShell Modules Rule!

    Say you have CI/CD pipelines. You have Azure DevOps (ADO) and are finally using YML pipelines.

  • Low Code "Revolution"

    Low Code "Revolution"

    I saw an advertisement for Brainboard (Brainboard | Design, Deploy and Manage Multi-Cloud) this morning. I looked into…

    1 条评论
  • A Terrible Terraform Pattern

    A Terrible Terraform Pattern

    Here's a scenario I've seen in multiple enterprises using Azure. Company decides to go with Terraform for all their…

  • Right Level of Automation

    Right Level of Automation

    I believe in automation and CI/CD..

  • Skepticism of Competence

    Skepticism of Competence

    My wife said something to me yesterday that I've been really thinking about now for the last 24 hours. She's worked in…

    5 条评论
  • Service Level Agreement Part 3

    Service Level Agreement Part 3

    Part 1 and Part 2 of this series covered the basics of probability and service level agreements. Now it is time to get…

  • Service Level Agreements Part 2

    Service Level Agreements Part 2

    Part 1 Hopefully, folks are feeling "refreshed" after viewing Part 1 of this series. So now let's talk about Service…

  • Probability and SLAs, Part 1

    Probability and SLAs, Part 1

    I recorded this quickly today as a refresher on probability. There are some links in the slides that I go through that…

  • A Series on How to Calculate Service Level Agreements

    A Series on How to Calculate Service Level Agreements

    When you sign up for a specific service, you are promised a percentage of time that the service will be available; this…

社区洞察

其他会员也浏览了