What platform engineering managers can do now
Taking stock of the platforms we’ve built over the past decade can be a sobering experience. Our cloud expenditure tends to be uncomfortably high and the level of complexity excessive. Finding ways to simplify our stack may be the greatest contribution we can make to the organisations we serve. Adding layer upon layer of increasingly sophisticated abstractions worked up to a point in a rising economy buoyed by low interest rates. It no longer does. Nobody has made this point more succinctly than Cindy Sridharan:
How far can we raise the waterline? How much proprietary and specialised tooling is absolutely necessary? What is clear is that we are spending too much time refining layers of the stack that either shouldn’t exist or should be somebody else’s problem. We are also far more tolerant of customisation than we should be. Not only does cost decline sharply as we reduce the complexity of our stack; we will find it more manageable, performant and resilient too.
As we approach engineering decisions, we need to weigh the cognitive burden of running a given tool against the utility of the abstractions it affords us. It may seem useful to us, even to teams adopting our platform. But does it make a difference to the end user? Sridharan is absolutely right to stress that we will be asked to justify what we build, and how much we spend running and maintaining it. We need to ask those questions before others do.
Farewell to Kubernetes?
Sridharan’s expression “every tech trend of the 2010’s” is squarely directed at Kubernetes and the Cloud Native Computing Foundation’s much-derided landscape of tools and services. It comes as no surprise that Sridharan approves of Greg Ferro’s withering account of what Kubernetes has achieved to date:
It is hardly controversial to argue that we shouldn’t choose Kubernetes lightly. From Heroku to a long line of imitators, there is no shortage of alternatives. However, it also seems to me that the prospect of cloud computing without Kubernetes is much less daunting in the warmth of the Bay Area sun than it is sitting on a pebble beach by the North Sea.
For most teams, the alternative to Kubernetes isn’t a slightly less onerous workload scheduler. It is either wholly proprietary tooling offered by one of the major cloud vendors (hello Elastic Container Service) or an open source competitor (Nomad for example) that cannot match the licensing guarantees and stewardship provided by the Cloud Native Computing Foundation.
All major cloud vendors and most niche players have taken a significant stake in the success of Kubernetes. Most of them offer rock solid, fully managed Kubernetes services. We can attribute this to brand power, timing, features or luck, but there is no getting around the fact that Kubernetes has enjoyed a phenomenal run, “limitations and flaws” notwithstanding.
For now, then, let’s start by peeling off layers of Kubernetes complexity, one at a time. We will hardly go to production with a vanilla cluster, but there’s value in asking how close we can afford to get.
Kubernetes the lazy way
Kubernetes scales well for teams that add deployments over time, but it punishes teams that ratchet up the complexity of each individual deployment. It reserves the greatest pain for bespoke solutions that require diligence and care during cluster upgrades.
Let that be our starting point. Consider the following three suggestions.
领英推荐
Use network policies until you need a service mesh
I want to achieve zero trust networking within our clusters as much the next person, and I suspect I am more excited than I should be about the prospect of pushing this work to the kernel using eBPF. But before we take the leap, we need to ask ourselves if mutual TLS is absolutely necessary for our clusters. In some industries, the answer is likely Yes, in which case you have every reason to go ahead. In many instances, network policies and robust role-based access control will be all that's required to maintain the organisation’s prized ISO 27001 certification. The key here is to work out what you absolutely have to do—for compliance, for security, for the reputation of your organisation—and stop right there.
Use object storage until you need a persistent volume
I’ve argued elsewhere that Kubernetes doesn’t have a state problem. It has a persistent volume problem. To solve the resilience and maintenance headache that is stateful Kubernetes, we need to make object storage the default storage mechanism for Kubernetes.
Applications increasingly give us a tiered storage option. Apache Flink, for example, will write checkpoints and savepoints to persistent volumes or buckets. If you set it up with persistent volumes, each release carries the risk of a prolonged outage should the deployment fail in the target environment. All that goes away when you switch to object storage: it’s much cheaper, more convenient and above all a mechanism we have learned to trust, which can’t be said of persistent volumes.
Even if tiered storage won’t do, perhaps because retrieval times aren’t what they need to be for your application, all cloud vendors have a selection of reasonably priced databases to choose from. Please do yourself a favour: do not use persistent volumes unless you absolutely have to.
Use controllers until you need operators
I don’t value operators (Kubernetes extensions consisting of custom resource definitions and matching controllers) as highly as I used to. I will admit I was swept up in the wave of enthusiasm that took the concept from Brendan Philips’s 2016 blog post to something every major open source project with a Helm chart had to offer. The first indication that something was off, for me, was having to use a Prometheus service monitor to fetch metrics when previously a simple label had done the trick.
The Prometheus operator home page devotes almost 900 words to upgrade instructions specifically for these custom resource definitions. The documentation leaves no doubt in the reader’s mind that the Helm upgrade process is far from automated and error-prone; for many organisations, it relies on an individual making changes with cluster administrator permissions.
Where did operators go wrong? Take a large cluster today, and the number of custom resource definitions is likely to be in triple digits. That by itself would not be an issue, but each custom resource definition poses a non-trivial challenge to platform teams contemplating backup/restore procedures or version upgrades at the cluster level.
We have to ask if these abstractions pay their way, and for many operators it’s hard to claim that they do. ServiceMonitors. PrometheusRules. AlertmanagerConfigs. These do not in all honesty require the administrative burden and overhead of custom resource definitions. What configuration is needed could be templated in YAML, loaded via ConfigMap and validated on pod start. These operators manage Kubernetes objects, and any controller can do so just as well as an operator.
Load balancing for platform teams
Clusters that play host to hundreds of service mesh proxies, persistent volumes and custom resource definitions will make our lives a lot harder when the time comes to upgrade or fix them.
We come back to the need to reduce cognitive load for engineering teams. Matthew Skelton and Manuel Pais have made that point best. In the recent past it was easy to lose sight of the cost of operating at the limits of what the team can handle. That’s not good enough for on-call team members paged in the middle of the night, trying to make sense of it all. It siphons off time that could be spent building platform capabilities: making it easy to do the right thing; helping new platform users onboard with ease; facilitating change and experimentation; and observing the Perl rule that easy tasks should be easy, and hard ones possible.
The more evolved and complex our platform, the harder it can be to achieve these goals.