Using k8s to satisfy all non-functional requirements
Victor Karabedyants
MSDP in Software Engineering, CTO, MBA, Cloud Manager at Sitecore | AI Engineer | Azure Solutions Architect | Azure Administrator | Azure Security Engineer | Azure Developer | Azure Data Engineer and Devops| CKA
I have 20 years of experience building IT infrastructure both on premise and in the cloud services with a huge number of projects behind my belt.
This article will be helpful for those who want to develop high-quality software using modern and efficient architectural solutions. Every automated system should include a number of properties that guarantee its stable operation. The list of architectural requirements may be endless but here and now I want to highlight and consider the most important ones. In this article I want to dedicate and describe five key requirements: Availability, Maintainability, Performance, Scalability and Security.
Not only my personal experience, but also an example of such large public cloud providers as Google, Microsoft and Amazon shows that the best modern solution is using Kubernetes (k8s). The story of k8s began back in 2014 when Google posted its open source code. Every year there were more and more people who appreciated the implementation of the orchestration of containerized applications. I am also among these people and I can confidently say that Kubernetes is a game-changer in the field of modern web services.
Now I want to say a few words about some advantages I am particularly pleased with. The first (and one of the most important for me) is the possibility to implement k8s in projects of any size. Kubernetes works great with both small products and sites with multi-million user audiences (Booking, Adidas, BlaBlaCar and Wikimedia Foundation). The second is the ability to easily manage resources. K8s helps to abstract from the lower layers of the infrastructure. Last but not least, containerization allows you to create any project reliable, with good fall-over protection, scaling and flexible configuration.
Now let’s take a look on how you can satisfy non-functional requirements using Kubernetes.
Availability – how to keep software running after an error occurs
What is the most important feature of quality software? The answer is very simple – the software performs correctly every time when users need it. Almost any Service Level Agreement (SLA) provides some information about the uptime of the software. It depends on the specific requirements of the system and usually expressed as a percentage:
Here I want to recall Google again. Gmail was available in 99.978% of cases (~ 2 hours of complete unavailability per year) in 2013. Achieving accessibility more than 99% may be more difficult than it looks at first sight. If a customer requests more than “three nines”, you should keep in mind that such a solution will be quite expensive.
The first thing I would like to say as a person who had to deal with sudden problems with physical disks, networks and other stuff a lot is that Kubernetes really helps in such cases. We just located the service on 2-5 different Nodes (physical servers) and even if one of them failed, all the others continued to work (k8s makes sure to restart the application using available resources).
In general, there are several mechanisms in k8s which help the application to run on several different Pods (e.g. ReplicaSet), but I want to highlight DeploymentSet.
DeploymentSet provides such features for maximum availability:
- The ability to create and recreate a lot of Pods. DeploymentSet is also able to quickly update the version of your container /application (using canary deployment) without downtime. Follow this link https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ to find more detailes.
- K8s automatically monitors availability using such parameters as:
- Liveness – the application can break if it has been running for a long time. In this case to restore the app you have to restart it. Liveness tests are designed specially to find and correct this type of errors;
- TCP Liveness – this type of liveness probe uses a TCP socket. K8s opens the socket to the container on a specific port. The container is considered healthy if the connection is established. Otherwise the container is considered bad and will be recreated;
- Readiness – when an application is temporary unable to cope with traffic because it depends on external factors or something else but it can work properly, there is no point in killing it (just as there is no point in sending client requests to it). In this case you can use k8s Readiness test to identify and neutralize such situations. The same is true for scaling. If you need to add several copies, then until the new copy is completely ready, traffic cannot be sent to it. If you don’t use Readiness, then Kubernetes will not wait until the application is completely launched and will send to a new unavailable copy. So you should perform a Readiness probe before starting to ensure greater liveness of the entire cluster. Only after successful execution the application can be considered healthy and ready to receive traffic. The configuration is the same both for Readiness and Liveness probes. You can choose between three types of probes: HTTP, Command ((if it’s not possible or you don’t want to launch an HTTP server) and TCP (if Kubernetes can establish a TCP connection, the container is healthy and ready to work);
- DeploymentSet monitors and controls everything created for it. It means that in case when you have problems with a physical Node DeploymentSet recreates your containers on another Node.
Even if something broke in k8s, the app is likely to continue its work without problems.
There are plenty of components in Kubernetes and they are all important. Still I think one of the most important is Etcd, which allows any Node in the cluster to write and read data.
K8s uses Etcd as the data store for all the data that is associated with the cluster. This means that absolutely everything is stored in Etcd, including information not only about separate modules but also about entire clusters. That is why Etcd should be able to deal with failures without data loss.
It is critical to ensure that Etcd works properly because it is the so-called brain of the entire cluster. Etcd can work with several replicas.
Sometimes some unexpected errors or malfunctions may occur in the system. To avoid the loss of critical data I advise you to periodically do back-ups so the data can be easily restored after a fail. You should store the back-up data somewhere isolated from where the cluster works in order to protect the data in case of the Etcd cluster failure. When you need to record the current of data in a cluster, use the etcdctl command through the command line interface.
Maintainability
ISO/IEC 25010 distinguishes between the concepts of Maintainability and Modifiability, although they both describe the ease of applying changes to a product or a system.
- Maintainability – the degree of efficiency and effectiveness with which a product or a system can be changed;
- Modifiability – the degree to which a product or a system can be effectively and efficiently modified without any defects or degrading the quality of an existing product.
Instead of creating products or solutions from scratch we often deal with adding new features, re-planning and updating already existing products. We constantly add some new or remove old functionality. Changes are also made when the product needs to be transferred to a new platform, when we need to use new protocols or standards.
Such changes may be either too expensive or too risky.
In the case of Kubernetes, this is a declarative description of the state an application should strive for.
We can easily add 2-5 containers (using the DeploymentSet, which we talked about above) without stopping the application.
At the core of k8s management is the use of the IaC (Infrastructure as Code) approach, which offers the following benefits:
- Increased speed and efficiency – you can use the methods of continuous integration and continuous deployment to reduce human errors and increase speed;
- Enhanced security – security standards can be easily deployed. There is no need to review and approve each change because all network services are code-based and deploy each time the same way;
- Economy – using IaC offers the potential for low-cost disaster recovery. Since the production environment comes down to code, it is possible to use the same code without paying for backup environments;
- Improvement of the customer service quality through standardization – when you use IaC the number of errors and the total downtime are reduced, which means improvement of customer service;
K8s allows you to implement various approaches to update and change both the application (without stopping it) and the cluster itself.
Performance
Performance has always been the driving force behind system architecture in software development. Even if it is considered that there are no special performance requirements, they still exist. On a practical level the best way to check if the app meets performance requirements is to conduct performance testing.
Let’s take a look at some key performance features:
- Maximum communication channel capacity in a digital system;
- Response time – the total time from the moment the user makes a request to receive a response;
- Delay – the time between the reception of a stimulus and the system’s response to it;
- Maximum processing speed - The amount of “something” per “unit of time”. For example, transactions per second, bits per minute etc.;
- The number of simultaneous users at the same time.
When we launch an application, first of all we need to understand how it behaves during deployment. We used Kubernetes to test cluster performance by examining containers, modules and services. This information helps us to evaluate the performance of the application, as well as to identify individual weaknesses that still need to be worked on.
I want to say that it is problematic to directly affect the performance of a site or application (if it is not good initially) using Kubernetes. However we have the opportunity to do everything so that our application can comfortably work using given capacities. An excellent solution for such a task would be to use Horizontal Pod Autoscaler and https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler. You can see how the Horizontal Pod Autoscaler works below:
Horizontal Pod Autoscaler is a controller that periodically checks our Pods’ load. It requests information on resource use by metrics for each specific period. It gets the information either from the resource metrics API or from the user metrics API. The controller calculates the usage value as a percentage of the equivalent resource request for containers in each module. Then it takes the average usage value and produces a ratio to scale the number of required replicas.
You can read more about this here https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
Alternatively, you can use the Vertical Pod Autoscaler which allows increasing the capacity of the Pod depending on the load. Globally speaking, with Vertical Pod Autoscaler you don’t have to think about values which need to be specified for the processor and Pod memory. Autoscaler can recommend values for requests, processor and memory limits, or it can automatically update your Pods by raising resources for them.
Scalability
The ability to scale a specific project in the future is the most common thing that my clients are interested in before starting the development process. From a business point of view it makes sense because each investor would like to increase the number of users of the application or platform. Scalability is usually associated with cloud- and micro services.
The term “scalability” has several meanings:
- It is the ability of a system, a network or a process to cope with the growing volume of work or its potential to meet this growth;
- It is the ability of the system to expand in order to meet the needs of the business;
- It is the ability of the system to handle the increased load without affecting system performance (or the ability to easily expand).
As you know, there are two main types of scalability: vertical (adding more power to an existing machine) and horizontal (the increase in power comes from adding new machines).
In my practice, there hasn’t been a single customer yet who would not be interested in the scalability of a project. As my experience shows, very few people really understand what kind of scalability they need, so it is very important to distinguish the real needs of the business from the unrealistic customer’s desires.
If you want to scale your project using k8s, all that you need is to use the Horizontal Pod Autoscaler or the Vertical Pod Autoscaler, which we talked about in the previous part. They are used when it is necessary to satisfy growing requirements for an application by changing the number of Pods or resources on pods where all workloads are performed. In case there is a performance problem, the number of Nodes automatically increases in accordance with the requirements of the application. On the other hand, if there are nodes with a small number of running Pods, the number of Nodes decreases accordingly.
The question is what to do when there aren’t enough physical resources and Nodes, so k8s cannot find a place for new Pods?
Cloud providers came up with a simple solution – you just need to increase the number of Nodes, so that the cluster can quickly create new Pods. Typical implementations of this approach are cloud hosting services such as Amazon EKS and Azure Kubernetes Service with Cluster Autoscaler. Cluster Autoscaler is a tool that automatically adjusts Kubernetes cluster size when one of the following conditions is true:
1. There are Pods that could not be started in the cluster due to lack of the resources.
2. The cluster has Nodes that have not been used for a long period of time and their modules can be placed on other existing Nodes. In this case, the cluster simply helps you to save money by removing unnecessary Nodes.
Security
Confidentiality, integrity and availability – this is the data security model. Typically, a security expert is present while developing or testing a security system. Even in small applications problems with security can lead to significant financial losses.
I use a technique to model data security problems. It consists of five key steps: identifying threats, documenting them, assessing and identifying countermeasures.
Fortunately, there are quite a lot of software products which may be used to ensure the security of Kubernetes. Each of them has its own scope, goals and type of license.
It is easy to get lost choosing a security product due to just a huge number of options, so I would like to advise several of those I worked with.
So a good option for scanning Kubernetes images is Anchore
Anchore integrates with k8s access controllers. This provides the ability to deploy only those images that match user policies. It can be deployed both independently and as a service running in a Kubernetes environment. In addition to scanning container images, Anchore conducts many additional checks, such as software licenses, Dockerfile, data leakage etc. There are two types of licenses: a free license (Apache) and a commercial license with additional features.
Sysdig open source is suitable for checking the safety of running containers.
Sysdig supports runtime environments for k8s containers. There are several ways to analyze a cluster: you can start the interactive interface using the kubectl dig plugin or capture system state through kubectl capture. Sysdig provides powerful open source tools. Companies like Olx, Pixar, Booking.com, Quby and others use its services and the total number of users worldwide is currently 274 428.
Network security in the Kubernetes cluster is also important, so we can use Istio to provide it.
Istio is an open platform for connecting, managing and protecting micro services. Despite the fact that this is a relatively new product, it has already established itself as an excellent solution both in terms of fault tolerance and security. Istio implements a specialized solution that is completely separate from services and works through intervention in network interaction. The existing set of network security features in Istio includes transparent TLS encryption to enhance the communication protocol to HTTPS. It also has its own RBAC system which is used to enable or disable the data exchange between workloads in a cluster.
Separately, I would like to highlight the Kube-hunter, which is used to conduct a Kubernetes security audit.
Kube-hunter helps to identify remote code execution, data disclosure and other potential vulnerabilities in clusters. It can be run as a Pod within a cluster or as a remote scanner. As a result, you will receive a full report that highlights the configuration problems that make the cluster vulnerable to attackers. The source code for Kube-hunter is available on GitHub.
I would like to say thank you to those who read the article to the end. For those who save their time and immediately read the conclusions, I want to briefly summarize everything that is written above:
1. Using Kubernetes is very convenient because it is reliable, flexible and safe.
2. Use DeploymentSet to increase resilience and availability of the application. There is a chance that even if something breaks, neither you nor users will notice it.
3. If you don’t want to lose important data, periodically do Etcd backups and store them in an isolated place.
4. In k8s, you can update applications and add new containers without stopping their work (use DeploymentSet).
5. Performance can be directly increased only by updating. If you suffer from the lack of capacities and there is no way to increase them, use Horizontal Pod Autoscaler or Vertical Pod Autoscaler.
6. Horizontal Pod Autoscaler helps to avoid problems with scalability.
7. Security is very important! To provide it use Anchore, Sysdig open source, Istio and Kube-hunter. It is not necessary to choose only one of them, you can run services in parallel, but be careful!
We have just covered the basic architectural requirements and their implementation using Kubernetes. I want to say that the list of tools that I described is not limited to them. In fact, k8s offers much more possibilities to work, and in this article I tried to describe my personal experience in solving tasks that are now being set for development teams. I would be pleased if you share your experience, ideas and examples. I always turn thumbs up on an active discussion of new opportunities!