Kubernetes Use Case

Kubernetes Use Case

What is Kubernetes?

Simply put, Kubernetes, or K8s, is a container orchestration system. In other words, when you use K8s, a container-based application can be deployed, scaled, and managed automatically.

The objective of Kubernetes container orchestration is to abstract away the complexity of managing a fleet of containers that represent packaged applications and include the code and everything needed to them wherever they’re provisioned. By interacting with the K8s REST API, users can describe the desired state of their applications, and K8s does whatever is necessary to make the infrastructure conform. It deploys groups of containers, replicates them, redeploys if some of them fail, and so on.

Because it’s open source, a K8s cluster can run almost anywhere, and the major public cloud providers all provide easy ways to consume this technology. Private clouds based on OpenStack can also run Kubernetes, and bare metal servers can be leveraged as worker nodes for it. So if you describe your application with K8s building blocks, you’ll then be able to deploy it within VMs or bare metal servers, on public or private clouds.

What is a Kubernetes cluster?

The K8s architecture is relatively simple. You never interact directly with the nodes hosting your application, but only with the control plane, which presents an API and is in charge of scheduling and replicating groups of containers named Pods. Kubectl is the command line interface that allows you to interact with the API to share the desired application state or gather detailed information on the infrastructure’s current state.

Let’s look at the various pieces.

Nodes

Each node that hosts part of your distributed application does so by leveraging Docker or a similar container technology, such as Rocket from CoreOS. The nodes also run two additional pieces of software: kube-proxy, which gives access to your running app, and kubelet, which receives commands from the k8s control plane. Nodes can also run flannel, an etcd backed network fabric for containers.

Master

The control plane itself runs the API server (kube-apiserver), the scheduler (kube-scheduler), the controller manager (kube-controller-manager) and etcd, a highly available key-value store for shared configuration and service discovery implementing the Raft consensus Algorithm.

No alt text provided for this image

Now let’s look at some of the terminology you might run into.

Terminology

K8s has its own vocabulary which, once you get used to it, gives you some sense of how things are organized. These terms include:

  • Pods: Pods are a group of one or more containers, their shared storage, and options about how to run them. Each pod gets its own IP address.
  • Labels: Labels are key/value pairs that Kubernetes attaches to any objects, such as pods, Replication Controllers, Endpoints, and so on.
  • Annotations: Annotations are key/value pairs used to store arbitrary non-queryable metadata.
  • Services: Services are an abstraction, defining a logical set of Pods and a policy by which to access them over the network.
  • Replication Controller: Replication controllers ensure that a specific number of pod replicas are running at any one time.
  • Secrets: Secrets hold sensitive information such as passwords, TLS certificates, OAuth tokens, and ssh keys.
  • ConfigMap: ConfigMaps are mechanisms used to inject containers with configuration data while keeping containers agnostic of Kubernetes itself.

Why Kubernetes?

So, what is K8s used for? To justify the added complexity that K8s brings, there need to be some benefits. At its core, a cluster manager such as k8s exists to serve developers so they can serve themselves without having to request support from the operations team.

Reliability is one of the major benefits of K8s; Google has over 10 years of experience when it comes to infrastructure operations with Borg, their internal container orchestration solution, and they’ve built K8s based on this experience. K8s can be used to prevent failure from impacting the availability or performance of your application, and that’s a great benefit.

Scalability is handled by Kubernetes on different levels. You can add cluster capacity by adding more worker nodes, which can even be automated in many public clouds with autoscaling functionality based on CPU and Memory triggers. The Kubernetes Scheduler includes affinity features to spread your workloads evenly across infrastructure resources, maximizing availability. Finally, k8s can autoscale your application using the Pod autoscaler, which can be driven by custom triggers.

 Tinder’s move to Kubernetes

Why?

Almost two years ago, Tinder decided to move its platform to Kubernetes. Kubernetes afforded us an opportunity to drive Tinder Engineering toward containerization and low-touch operation through immutable deployment. Application build, deployment, and infrastructure would be defined as code.

We were also looking to address challenges of scale and stability. When scaling became critical, we often suffered through several minutes of waiting for new EC2 instances to come online. The idea of containers scheduling and serving traffic within seconds as opposed to minutes was appealing to us.

It wasn’t easy. During our migration in early 2019, we reached critical mass within our Kubernetes cluster and began encountering various challenges due to traffic volume, cluster size, and DNS. We solved interesting challenges to migrate 200 services and run a Kubernetes cluster at scale totaling 1,000 nodes, 15,000 pods, and 48,000 running containers.

How?

Starting January 2018, we worked our way through various stages of the migration effort. We started by containerizing all of our services and deploying them to a series of Kubernetes hosted staging environments. Beginning October, we began methodically moving all of our legacy services to Kubernetes. By March the following year, we finalized our migration and the Tinder Platform now runs exclusively on Kubernetes.

Building Images for Kubernetes

There are more than 30 source code repositories for the microservices that are running in the Kubernetes cluster. The code in these repositories is written in different languages (e.g., Node.js, Java, Scala, Go) with multiple runtime environments for the same language.

The build system is designed to operate on a fully customizable “build context” for each microservice, which typically consists of a Dockerfile and a series of shell commands. While their contents are fully customizable, these build contexts are all written by following a standardized format. The standardization of the build contexts allows a single build system to handle all microservices.

No alt text provided for this image

Figure 1–1 Standardized build process through the Builder containerIn order to achieve the maximum consistency between runtime environments, the same build process is being used during the development and testing phase. This imposed a unique challenge when we needed to devise a way to guarantee a consistent build environment across the platform. As a result, all build processes are executed inside a special “Builder” container.

The implementation of the Builder container required a number of advanced Docker techniques. This Builder container inherits local user ID and secrets (e.g., SSH key, AWS credentials, etc.) as required to access Tinder private repositories. It mounts local directories containing the source code to have a natural way to store build artifacts. This approach improves performance, because it eliminates copying built artifacts between the Builder container and the host machine. Stored build artifacts are reused next time without further configuration.

For certain services, we needed to create another container within the Builder to match the compile-time environment with the run-time environment (e.g., installing Node.js bcrypt library generates platform-specific binary artifacts). Compile-time requirements may differ among services and the final Dockerfile is composed on the fly.

Kubernetes Cluster Architecture and Migration

Cluster Sizing

We decided to use kube-aws for automated cluster provisioning on Amazon EC2 instances. Early on, we were running everything in one general node pool. We quickly identified the need to separate out workloads into different sizes and types of instances, to make better use of resources. The reasoning was that running fewer heavily threaded pods together yielded more predictable performance results for us than letting them coexist with a larger number of single-threaded pods.

We settled on:

·       m5.4xlarge for monitoring (Prometheus)

·       c5.4xlarge for Node.js workload (single-threaded workload)

·       c5.2xlarge for Java and Go (multi-threaded workload)

·       c5.4xlarge for the control plane (3 nodes)

Migration

One of the preparation steps for the migration from our legacy infrastructure to Kubernetes was to change existing service-to-service communication to point to new Elastic Load Balancers (ELBs) that were created in a specific Virtual Private Cloud (VPC) subnet. This subnet was peered to the Kubernetes VPC. This allowed us to granularly migrate modules with no regard to specific ordering for service dependencies.

These endpoints were created using weighted DNS record sets that had a CNAME pointing to each new ELB. To cutover, we added a new record, pointing to the new Kubernetes service ELB, with a weight of 0. We then set the Time To Live (TTL) on the record set to 0. The old and new weights were then slowly adjusted to eventually end up with 100% on the new server. After the cutover was complete, the TTL was set to something more reasonable.

Our Java modules honored low DNS TTL, but our Node applications did not. One of our engineers rewrote part of the connection pool code to wrap it in a manager that would refresh the pools every 60s. This worked very well for us with no appreciable performance hit.

Learnings

Network Fabric Limits

In the early morning hours of January 8, 2019, Tinder’s Platform suffered a persistent outage. In response to an unrelated increase in platform latency earlier that morning, pod and node counts were scaled on the cluster. This resulted in ARP cache exhaustion on all of our nodes.

There are three Linux values relevant to the ARP cache:

No alt text provided for this image

Credit

gc_thresh3 is a hard cap. If you’re getting “neighbor table overflow” log entries, this indicates that even after a synchronous garbage collection (GC) of the ARP cache, there was not enough room to store the neighbor entry. In this case, the kernel just drops the packet entirely.

We use Flannel as our network fabric in Kubernetes. Packets are forwarded via VXLAN. VXLAN is a Layer 2 overlay scheme over a Layer 3 network. It uses MAC Address-in-User Datagram Protocol (MAC-in-UDP) encapsulation to provide a means to extend Layer 2 network segments. The transport protocol over the physical data center network is IP plus UDP.

No alt text provided for this image

Figure 2–1 Flannel diagram (credit

No alt text provided for this image

Figure 2–2 VXLAN Packet (credit)Each Kubernetes worker node allocates its own /24 of virtual address space out of a larger /9 block. For each node, this results in 1 route table entry, 1 ARP table entry (on flannel.1 interface), and 1 forwarding database (FDB) entry. These are added when the worker node first launches or as each new node is discovered.

In addition, node-to-pod (or pod-to-pod) communication ultimately flows over the eth0 interface (depicted in the Flannel diagram above). This will result in an additional entry in the ARP table for each corresponding node source and node destination.

In our environment, this type of communication is very common. For our Kubernetes service objects, an ELB is created and Kubernetes registers every node with the ELB. The ELB is not pod aware and the node selected may not be the packet’s final destination. This is because when the node receives the packet from the ELB, it evaluates its iptables rules for the service and randomly selects a pod on another node.

At the time of the outage, there were 605 total nodes in the cluster. For the reasons outlined above, this was enough to eclipse the default gc_thresh3 value. Once this happens, not only are packets being dropped, but entire Flannel /24s of virtual address space are missing from the ARP table. Node to pod communication and DNS lookups fail. (DNS is hosted within the cluster, as will be explained in greater detail later in this article.)

To resolve, the gc_thresh1gc_thresh2, and gc_thresh3 values are raised and Flannel must be restarted to re-register missing networks.

Unexpectedly Running DNS At Scale

To accommodate our migration, we leveraged DNS heavily to facilitate traffic shaping and incremental cutover from legacy to Kubernetes for our services. We set relatively low TTL values on the associated Route53 RecordSets. When we ran our legacy infrastructure on EC2 instances, our resolver configuration pointed to Amazon’s DNS. We took this for granted and the cost of a relatively low TTL for our services and Amazon’s services (e.g. DynamoDB) went largely unnoticed.

As we onboarded more and more services to Kubernetes, we found ourselves running a DNS service that was answering 250,000 requests per second. We were encountering intermittent and impactful DNS lookup timeouts within our applications. This occurred despite an exhaustive tuning effort and a DNS provider switch to a CoreDNS deployment that at one time peaked at 1,000 pods consuming 120 cores.

While researching other possible causes and solutions, we found an article describing a race condition affecting the Linux packet filtering framework netfilter. The DNS timeouts we were seeing, along with an incrementing insert_failed counter on the Flannel interface, aligned with the article’s findings.

The issue occurs during Source and Destination Network Address Translation (SNAT and DNAT) and subsequent insertion into the conntrack table. One workaround discussed internally and proposed by the community was to move DNS onto the worker node itself. In this case:

·       SNAT is not necessary, because the traffic is staying locally on the node. It doesn’t need to be transmitted across the eth0 interface.

·       DNAT is not necessary because the destination IP is local to the node and not a randomly selected pod per iptables rules.

We decided to move forward with this approach. CoreDNS was deployed as a DaemonSet in Kubernetes and we injected the node’s local DNS server into each pod’s resolv.conf by configuring the kubelet — cluster-dns command flag. The workaround was effective for DNS timeouts.

However, we still see dropped packets and the Flannel interface’s insert_failed counter increment. This will persist even after the above workaround because we only avoided SNAT and/or DNAT for DNS traffic. The race condition will still occur for other types of traffic. Luckily, most of our packets are TCP and when the condition occurs, packets will be successfully retransmitted. A long-term fix for all types of traffic is something that we are still discussing.

Using Envoy to Achieve Better Load Balancing

As we migrated our backend services to Kubernetes, we began to suffer from unbalanced load across pods. We discovered that due to HTTP Keepalive, ELB connections stuck to the first ready pods of each rolling deployment, so most traffic flowed through a small percentage of the available pods. One of the first mitigations we tried was to use a 100% MaxSurge on new deployments for the worst offenders. This was marginally effective and not sustainable long term with some of the larger deployments.

Another mitigation we used was to artificially inflate resource requests on critical services so that colocated pods would have more headroom alongside other heavy pods. This was also not going to be tenable in the long run due to resource waste and our Node applications were single threaded and thus effectively capped at 1 core. The only clear solution was to utilize better load balancing.

We had internally been looking to evaluate Envoy. This afforded us a chance to deploy it in a very limited fashion and reap immediate benefits. Envoy is an open source, high-performance Layer 7 proxy designed for large service-oriented architectures. It is able to implement advanced load balancing techniques, including automatic retries, circuit breaking, and global rate limiting.

The configuration we came up with was to have an Envoy sidecar alongside each pod that had one route and cluster to hit the local container port. To minimize potential cascading and to keep a small blast radius, we utilized a fleet of front-proxy Envoy pods, one deployment in each Availability Zone (AZ) for each service. These hit a small service discovery mechanism one of our engineers put together that simply returned a list of pods in each AZ for a given service.

The service front-Envoys then utilized this service discovery mechanism with one upstream cluster and route. We configured reasonable timeouts, boosted all of the circuit breaker settings, and then put in a minimal retry configuration to help with transient failures and smooth deployments. We fronted each of these front Envoy services with a TCP ELB. Even if the keepalive from our main front proxy layer got pinned on certain Envoy pods, they were much better able to handle the load and were configured to balance via least_request to the backend.

For deployments, we utilized a preStop hook on both the application and the sidecar pod. This hook called the sidecar health check fail admin endpoint, along with a small sleep, to give some time to allow the inflight connections to complete and drain.

One reason we were able to move so quickly was due to the rich metrics we were able to easily integrate with our normal Prometheus setup. This allowed us to see exactly what was happening as we iterated on configuration settings and cut traffic over.

The results were immediate and obvious. We started with the most unbalanced services and at this point have it running in front of twelve of the most important services in our cluster. This year we plan on moving to a full-service mesh, with more advanced service discovery, circuit breaking, outlier detection, rate limiting, and tracing.

No alt text provided for this image

Figure 3–1 CPU convergence of one service during cutover to envoy

No alt text provided for this image
No alt text provided for this image


The Result

Through these learnings and additional research, we have developed a strong in-house infrastructure team with great familiarity on how to design, deploy, and operate large Kubernetes clusters. Tinder’s entire engineering organization now has knowledge and experience on how to containerize and deploy their applications on Kubernetes.

On our legacy infrastructure, when additional scale was required, we often suffered through several minutes of waiting for new EC2 instances to come online. Containers now schedule and serve traffic within seconds as opposed to minutes. Scheduling multiple containers on a single EC2 instance also provides improved horizontal density. As a result, we project substantial cost savings on EC2 in 2019 compared to the previous year.

It took nearly two years, but we finalized our migration in March 2019. The Tinder Platform runs exclusively on a Kubernetes cluster consisting of 200 services, 1,000 nodes, 15,000 pods, and 48,000 running containers. Infrastructure is no longer a task reserved for our operations teams. Instead, engineers throughout the organization share in this responsibility and have control over how their applications are built and deployed with everything as code.


Sharing with Askllyy community ( building.. ) WhatsApp Group:? https://chat.whatsapp.com/HfnCnJXJ1zO1HyZHpj57rA Ask, Answer Questions related to DevOps, SRE, DevSecOps, IT & Software Security.? Share, Find Jobs in DevOps, DevSecOps, SRE, AppSec, Cybersecurity domains. Find / Post Gigs in DevSecOps, IT, SRE. Find Freelancers. Network with DevOps, Security Experts.

回复
Abdhesh Kumar

??3x RedHat Certified || ??1x CKA ||??5x GitLab Certified || ??1x GCP Certified || ??1x AWS Certified || Azure Cloud || Ansible || Docker || Kubernete || Helm Chart || Python || Jenkins || Terraform || ArgoCD ||

3 年

Great work

要查看或添加评论,请登录

社区洞察

其他会员也浏览了