Kubernetes Operators - Challenges and the need for Hygiene guidelines
In recent past, many have started to talk about 'Kubernetes operator' pattern and their advantages in codifying human operator tasks. Some of the advantages talked about Kubernetes operators include
- Control to the developers in performing life cycle management operations (such as upgrades, backup/restore in case of databases, application level monitoring).
- Control to the developers in regards to installing dependent packages (other K8s resources and applications).
- Control to the developers in installing & configuration of software packages in host Linux (such as installation of software on host Linux, configuration of host Linux, tuning host Linux, exposing devices to K8s etc..)
- Control to developers to execute tests, fault injections etc..
- Choosing the leader (as it is central entity in the cluster for that application) for distributed services in the cluster.
These are all good goals. But, based on few operators you see in operator hub, I get an impression that there are too many restrictions and that would have challenges in few deployments, especially in edge-computing and networking workloads (CNFs).
Note that many operators CRs that install applications are creating K8s resources such as deployments, services, network policies etc.. We see that these CRs are not taking full K8s resource information. They tend to take partial information and the operators creates K8s resources internally with this partial information. Essentially, many operators seem to be creating abstraction CRs and not exposing standard K8s resources for deployment administrators.
Few observed cases:
- Limited resource management: Many operators we see only take basic resource information for various application PODs that it is going to deploy. They are limited to cpu, memory and disk in case of databases. But, there are many resources that may be needed such as setting dedicated L3 Cache ways, dedicated DDR BW etc.. It is not possible to do with many of the opearators.
- Lack of support for few critical features such as "node affinity": Selection of few PODs to particular node for various reasons (such as security isolation, guaranteed performance, availability of large storage systems, availability of GPUs, SmartNICs etc...) is needed. Normally, deployment admins can tune the 'deployment' or 'Pod' specs (via helm profiles or Kustomize) to address affinity requirements.
- Lack of support for using resources such asCrypto and other accelerators: Few workloads that use some common libraries can take advantage of accelerators for higher performance. There are no facilities provided in CRs to indicate the accelerator SRIOV-VFs to the containers.
- Lack of support to use different runtimes: To increase security, yet times, deployment admins like to run containers using thin VM runtimes such as 'Kata'. There are no facilities provided to run some containers, which deployment admins think are critical, in Kata or in confidential-computing run times.
- Lack of support to include security features of K8s: K8s extensions include network policies for zero-trust security and ISTIO for traffic management & authorization policies. It is found that many of the operators don't add network or ISTIO related resources when applications are deployed by operators.
- Lack of flexibility to change the service type: Few operators don't provide choice to select service type - Cluster type, Load balancer type. Some assume that the services are always accessed by locally within a cluster. In distributed computing, yet times, it is required to provide access to local services from other clusters.
- Lack of rolling upgrades using ISTIO virtual services: ISTIO provides a way better of rolling upgrades of applications by ensuring that new versions of software works before dropping the old versions. It does this by sending only selected traffic to new versions until there is better confidence. This kind of facilities are not used by many operators.
When multi cluster orchestrators are used (such as OpenNESS-EMCO) for distributed computing with intelligent placement and automation of secondary resources, the expectation of these orchestrators is that they can customize resources in finer granular way. Due to inflexibility of operators, yet times, this becomes challenge for these orchestrators. Note that edge-computing requires higher security and better performance isolation and hence deployment administrators likes to leverage security features of K8s and K8s-ecosystem.
Moreover, Edges are resource constrained. One does not like to use any hardware resources for life cycle management and prefer LCM to be done from outside by another system. Operators do consume hardware resources. Note that they need to be there forever to address applications that are brought up on-demand basis and hence forever occupy hardware resources.
In complex systems, multiple applications are brought up as a single service. For example, in networking world, multiple CNFs are included while defining a a network service. As part of network service deployment, all these CNFs are brought up by Multi Cluster Orchestrators (such as OpenNESS-EMCO), some times in one cluster and many times in different clusters. It is these orchestrators responsibility to enable the connectivity and security among micro-services of different applications across clusters. Since operators don't expose the status of standard K8s resources in easy to consume fashion as they are abstracted, there is an expectation that additional knowledge base is created in these orchestrators on the standard K8s resources for each of the operator custom resources, thereby negating the operators' advantage.
Recommendations for application developers
My belief is that the applications shall not do deployment configuration management via operators. Application developers shall not just provide operators for deployment of applications. They shall also continue to provide standard K8s yaml resources for deploying the applications. In my view, developers will not be able to imagine all kinds of deployment scenarios. Hence, providing options to deployment admins is important.
If application developers decide to provide operators only, it is highly recommended that they don't abstract the K8s resources. Rather, bundle them and provide LCM the way https://kudo.dev/ is trying to address.
Provide operators for following is good though:
- That requires installation and configuration of software packages on the host Linux. For example: Device Plugins (GPUs, SmartNICs, Crypto accelerators etc..) - These require installation of drivers and firmware. Simplifying these installations & configuration via operators is good.
- Use operators to provide additional actions. For example, having application specific operators for databases to perform operations such as 'backup' and 'restore' is good to be done with operators.
- Use operators to configure applications. For example, creation of topic in Kafka, configuration of NFs are good to be done via operators.
Summary
Operator pattern is becoming popular. But, I feel, they are not suitable for all kinds of applications. As indicated above, operators are good for taking some application specific actions and for configuring applications using CRs. Operators are good for installing infrastructure components. Also, operators are good to do configuration management of complex (with many dependencies) applications. But, when they do, it is not good to abstract K8s resources as it takes away the capabilities offered by K8s resources. Finally, I believe that the application developers may not know all the deployment scenarios and hence abstraction is not always good. I hope that the application developers continue to provide K8s yaml files with helm templating as part of their offering choices.
Enabling Zero Trust (ZTA) with Envoy and Istio
3 年https://github.com/istio/istio/tree/master/operator#advanced-k8s-resource-overlays >They tend to take partial information and the operators creates K8s resources internally with this partial information. Essentially, many operators seem to be creating abstraction CRs and not exposing standard K8s resources for deployment administrators. There can be operator implementation where operator CR api is designed to accept "overlays". Example is in the istio operator. This provides the flexibility to "overlay" any internally created k8s resource with specific configs ex taints, tolerations etc or anything. This is so powerful. It is like giving a backdoor key to the operator user