Best Practices For Running Cost-Optimized Kubernetes Applications On Amazon?EKS

Best Practices For Running Cost-Optimized Kubernetes Applications On Amazon?EKS

This document discusses Amazon EKS features and options, and the best practices for running cost-optimized applications on EKS to take advantage of the elasticity provided by Amazon Cloud. This document assumes that you are familiar with Kubernetes, Amazon Cloud, EKS, and autoscaling.

Introduction

Kubernetes is a container orchestration platform that includes a variety of configurations. Containers are used to run applications, and they rely on container images to define all of the resources required. Kubernetes handles these containers by forming pods out of one or more of them. Within a cluster of compute nodes, pods can be scheduled and scaled.

Then there are namespaces, which are used to organize Kubernetes resources like pods and deployments. A namespace can be used to simulate the structure of an organization, such as having a single namespace for each team or an environment for developers.

As Kubernetes gets momentum, more businesses and platform-as-a-service (PaaS) and software-as-a-service (SaaS) providers are deploying multi-tenant Kubernetes clusters for their workloads. As a result, a single cluster could be hosting applications from many teams, departments, customers, or environments. Kubernetes’ multi-tenancy allows businesses to manage a few large clusters rather than several smaller ones, resulting in better resource planning, effective supervision, and fragmentation reduction.

Some of these businesses with rapidly increasing Kubernetes clusters begin to see a disproportionate increase in costs over time. This occurs because traditional businesses that use cloud-based technologies like Kubernetes lack cloud experience, which results in unstable applications while autoscaling.

This document provides best practices for running cost-optimized Kubernetes workloads on EKS. The following diagram outlines this approach.

The foundation of building cost-optimized applications is spreading the cost-saving culture across teams. Beyond moving cost discussions to the beginning of the development process, this approach forces you to better understand the environment that your applications are running in?—?in this context, the EKS environment.

In order to achieve low cost and application stability, you must correctly set or tune some features and configurations (such as autoscaling, machine types, and region selection). Another important consideration is your workload type because, depending on the workload type and your application’s requirements, you must apply different configurations in order to further lower your costs. Finally, you must monitor your spending and create guardrails so that you can enforce best practices early in your development cycle.

EKS cost-optimization features and?options

Cost-optimized Kubernetes applications rely heavily on EKS autoscaling. To balance cost, reliability, and scaling performance on EKS, you must understand how autoscaling works and what options you have. This section discusses EKS autoscaling and other useful cost-optimized configurations for both serving and batch workloads.

Fine-tune EKS autoscaling

Autoscaling is the strategy EKS uses to let Amazon Cloud customers pay only for what they need by minimizing infrastructure uptime. In other words, autoscaling saves costs by

  1. making workloads and their underlying infrastructure start before demand increases
  2. shutting them down when demand decreases.

EKS handles these autoscaling scenarios by using features like the following:

  • Horizontal Pod Autoscaler (HPA): for adding and removing Pods based on utilization metrics.
  • Vertical Pod Autoscaler (VPA): for sizing your Pods.
  • Cluster Autoscaler: for adding and removing Nodes based on the scheduled workload.
  • Karpenter?: Karpenter is an open-source, flexible, high-performance Kubernetes cluster autoscaler built with AWS.
  • Managed Node Groups, for dynamically creating new node pools with nodes that match the needs of users’ Pods.

Horizontal Pod Autoscaler

Horizontal Pod Autoscaler (HPA) scales the number of Pods in a Deployment. Statefulset, ReplicaSet based on CPU/Memory utilization or any custom metrics exposed by your application. The HPA works on a control loop. Each separate HPA exists for each Deployment, Statefulset, Replicaset. The HPA object constantly checks the Deployments, Statefulset, Replicaset’s metrics against the Memory/CPU threshold that you specify and keeps on increasing/decreasing the replicas count. By using HPA we will only be paying for the extra resources only when we need them.

The following are best practices for enabling HPA in your application:

  • Size your application correctly by setting appropriate resource requests and limits.
  • Set your target utilization to reserve a buffer that can handle requests during a spike.
  • Make sure your application starts as quickly as possible and shuts down according to Kubernetes expectations.
  • Set meaningful readiness and liveness probes.
  • Make sure that your Metrics Server is always up and running.
  • Inform clients of your application that they must consider implementing exponential retries for handling transient issues.

For more information, see Configuring a Horizontal Pod Autoscaler.

Vertical Pod Autoscaler

Unlike Horizontal Pod Autoscaler ( HPA ), Vertical Pod Autoscaler ( VPA ) automatically adjusts the CPU and Memory attributes for your Pods. The Vertical Pod Autoscaler ( VPA ) will automatically recreate your pod with the suitable CPU and Memory attributes. This will free up the CPU and Memory for the other pods and help you better utilize your Kubernetes cluster. The Kubernetes worker nodes are used efficiently because Pods use exactly what they need. The Vertical Pod Autoscaler ( VPA ) can suggest the Memory/CPU requests and Limits and it can also automatically update it if enabled by the user. This will reduce the time taken by the engineers running the Performance/Benchmark testing to determine the correct values for CPU and memory requests

VPA can work in three different modes:

  • Off. In this mode, also known as recommendation mode, VPA does not apply any change to your Pod. The recommendations are calculated and can be inspected in the VPA object.
  • Initial: VPA assigns resource requests only at Pod creation and never changes them later.
  • Auto: VPA updates CPU and memory requests during the life of a Pod. That means, the Pod is deleted, CPU and memory are adjusted, and then a new Pod is started.

If you plan to use VPA, the best practice is to start with the Off mode for pulling VPA recommendations. Make sure it’s running for 24 hours, ideally one week or more, before pulling recommendations. Then, only when you feel confident, consider switching to either Initial or Auto mode.

Follow these best practices for enabling VPA, either in Initial or Auto mode, in your application:

  • Don’t use VPA either in Initial or Auto mode if you need to handle sudden spikes in traffic. Use HPA instead.
  • Make sure your application can grow vertically.
  • Set minimum and maximum container sizes in the VPA objects to avoid the autoscaler making significant changes when your application is not receiving traffic.
  • Don’t make abrupt changes, such as dropping the Pod’s replicas from 30 to 5 all at once. This kind of change requires a new deployment, a new label set, and a new VPA object.
  • Make sure your application starts as quickly as possible and shuts down according to Kubernetes expectations.
  • Set meaningful readiness and liveness probes.
  • Make sure that your Metrics Server is always up and running.
  • Inform clients of your application that they must consider implementing exponential retries for handling transient issues.
  • Consider using node auto-provisioning along with VPA so that if a Pod gets large enough to fit into existing machine types, Cluster Autoscaler provisions larger machines to fit the new Pod.

Cluster Autoscaler

Cluster Autoscaler (CA) automatically resizes the underlying computer infrastructure. CA provides nodes for Pods that don’t have a place to run in the cluster and removes under-utilized nodes. CA is optimized for the cost of infrastructure. In other words, if there are two or more node types in the cluster, CA chooses the least expensive one that fits the given demand.

Certain Pods cannot be restarted by any auto scaler when they cause some temporary disruption, so the node they run on can’t be deleted. For example, system Pods (such as metrics-server and kube-dns), and Pods using local storage won't be restarted. However, you can change this behavior by defining PDBs for these system Pods and by setting "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation for Pods using local storage that is safe for the autoscaler to restart. Moreover, consider running long-lived Pods that can't be restarted on a separate node pool, so they don't block scale-down of other nodes. Finally, learn how to analyze CA events in the logs to understand why a particular scaling activity didn't happen as expected.

If your workloads are resilient to nodes restarting inadvertently and to capacity losses, you can save more money by creating a cluster or node pool with preemptible VMs. For CA to work as expected, Pod resource requests need to be large enough for the Pod to function normally. If resource requests are too small, nodes might not have enough resources and your Pods might crash or have troubles during runtime.

The following is a summary of the best practices for enabling Cluster Autoscaler in your cluster:

  • Use either HPA or VPA to autoscale your workloads.
  • Make sure you are following the best practices described in the chosen Pod autoscaler.
  • Size your application correctly by setting appropriate resource requests and limits or using VPA.
  • Define a PDB for your applications.
  • Define PDB for system Pods that might block your scale-down. For example, kube-dns. To avoid temporary disruption in your cluster, don't set PDB for system Pods that have only 1 replica (such as metrics-server).
  • Run short-lived Pods and Pods that can be restarted in separate node pools, so that long-lived Pods don’t block their scale-down.
  • Avoid over-provisioning by configuring idle nodes in your cluster. For that, you must know your minimum capacity?—?for many companies, it’s during the night?—?and set the minimum number of nodes in your node pools to support that capacity.
  • If you need extra capacity to handle requests during spikes, use pause Pods, which are discussed in Autoscaler and over-provisioning.

For more information, see Autoscaling a cluster.

Karpenter

Karpenter is an open-source, flexible, high-performance Kubernetes cluster autoscaler built with AWS. It helps improve your application availability and cluster efficiency by rapidly launching right-sized compute resources in response to changing application load. Karpenter also provides just-in-time compute resources to meet your application’s needs and will soon automatically optimize a cluster’s compute resource footprint to reduce costs and improve performance.

Managed-node-groups

Amazon EKS managed node groups automate the provisioning and lifecycle management of nodes (Amazon EC2 instances) for Amazon EKS Kubernetes clusters.

With Amazon EKS-managed node groups, you don’t need to separately provision or register the Amazon EC2 instances that provide compute capacity to run your Kubernetes applications. You can create, automatically update, or terminate nodes for your cluster with a single operation. Node updates and terminations automatically drain nodes to ensure that your applications stay available.

Every managed node is provisioned as part of an Amazon EC2 Auto Scaling group that’s managed for you by Amazon EKS. Every resource including the instances and Auto Scaling groups runs within your AWS account. Each node group runs across multiple Availability Zones that you define.

You can add a managed node group to new or existing clusters using the Amazon EKS console, eksctl, AWS CLI; AWS API, or infrastructure as code tools including AWS CloudFormation. Nodes launched as part of a managed node group are automatically tagged for auto-discovery by the Kubernetes cluster autoscaler. You can use the node group to apply Kubernetes labels to nodes and update them at any time.

There are no additional costs to use Amazon EKS-managed node groups, you only pay for the AWS resources you provision. These include Amazon EC2 instances, Amazon EBS volumes, Amazon EKS cluster hours, and any other AWS infrastructure. There are no minimum fees and no upfront commitments.

To get started with a new Amazon EKS cluster and managed node group, see Getting started with Amazon EKS?—?AWS Management Console and AWS CLI.

To add a managed node group to an existing cluster, see Creating a managed node group.

Choose the right machine?type

Beyond autoscaling, other configurations can help you run cost-optimized Kubernetes applications on EKS. This section discusses choosing the right machine type.

Spot Instances

Amazon EC2 Spot Instances are spare EC2 capacity that offers discounts of 70–90% over On-Demand prices. The Spot price is determined by term trends in supply and demand and the amount of On-Demand capacity on a particular instance size, family, Availability Zone, and AWS Region.

If the available On-Demand capacity of a particular instance type is depleted, the Spot Instance is sent an interruption notice two minutes ahead to gracefully wrap up things. I recommend a diversified fleet of instances, with multiple instance types created by Spot Fleets or EC2 Fleets.

Whatever the workload type, you must pay attention to the following constraints:

  • Pod Disruption Budget might not be respected because preemptible nodes can shut down inadvertently.
  • There is no guarantee that your Pods will shut down gracefully once node preemption ignores the Pod grace period.
  • It might take several minutes for EKS to detect that the node was preempted and that the Pods are no longer running, which delays rescheduling the Pods to a new node.

Spot instance can be a great fit if you are running a short-lived Kubernetes cluster for POC or a non-production environment. (there could be many use cases).

On-Demand Instance

With On-Demand Instances, you pay for computing capacity by the second with no long-term commitments. You have full control over its lifecycle?—?you decide when to launch, stop, hibernate, start, reboot, or terminate it.

On-demand instances are more suitable for running a stable Kubernetes cluster with a mixed pool of instances without purchasing any RI or saving plans, based on your application requirement.

Reserved Instance

An AWS reserved instance is officially described as a “billing discount” applied to the use of an on-demand instance in your account. In other words, a reserved instance is not actually a physical instance, rather it is the discounted billing you get when you commit to using a specific on-demand instance for a long-term period of one or three years.

Reserved instances are ideal for steady and predictable usage. They can help you save significantly on your Amazon EC2 costs compared to on-demand instance pricing because, in exchange for your commitment to pay for all the hours in a one-year or three-year term, the hourly rate is lowered significantly.

Reserved instances are more suitable for running long-term Kubernetes clusters with pre-defined sizes of ec2 instances.

Select the appropriate region

When cost is a constraint, where you run your EKS clusters matters. Due to many factors, cost varies per computing region. So make sure you are running your workload in the least expensive option but where latency doesn’t affect your customer. If your workload requires copying data from one region to another?—?for example, to run a batch job?—?you must also consider the cost of moving this data.

Use RI’s and Savings?Plans

There are three important ways to optimize compute costs, and AWS has the tools to help you with all of them. It starts with choosing the right EC2 purchase model for your workloads, selecting the right instance to fine-tune price performance, and mapping usage to actual demand.

Amazon EC2 Reserved Instances (RI) provide a significant discount (up to 72%) compared to On-Demand pricing and provide a capacity reservation when used in a specific Availability Zone.

Savings Plans is a flexible pricing model that can help you reduce your bill by up to 72% compared to On-Demand prices, in exchange for a one- or three-year hourly spend commitment. AWS offers three types of Savings Plans: Compute Savings Plans, EC2 Instance Savings Plans.

Review small development clusters

For small development clusters, such as clusters with three or fewer nodes or clusters that use machine types with limited resources, you can reduce resource usage by disabling or fine-tuning a few cluster add-ons. This practice is especially useful if you have a cluster-per-developer strategy and your developers don’t need things like autoscaling, logging, and monitoring. However, because of the cost per cluster and simplified management, we recommend that you start using a multi-tenancy cluster strategy.

Understand your application capacity

When you plan for application capacity, know how many concurrent requests your application can handle, how much CPU and memory it requires, and how it responds under heavy load. Most teams don’t know these capacities, so we recommend that you test how your application behaves under pressure. Try isolating a single application Pod replica with autoscaling off, and then execute the tests simulating an actual usage load. This helps you understand your per-Pod capacity. We then recommend configuring your Cluster Autoscaler, resource requests and limits, and either HPA or VPA. Then stress your application again, but with more strength to simulate sudden bursts or spikes.

Ideally, to eliminate latency concerns, these tests must run from the same region or zone that the application is running on Google Cloud. You can use the tool of your choice for these tests, whether it’s a homemade script or a more advanced performance tool, like Apache Benchmark, JMetter, or Locust.

Make sure your application can grow vertically and horizontally

Ensure that your application can grow and shrink. This means you can choose to handle traffic increases either by adding more CPU and memory or adding more Pod replicas. This gives you the flexibility to experiment with what fits your application better, whether that’s a different autoscaler setup or a different node size. Unfortunately, some applications are single-threaded or limited by a fixed number of workers or subprocesses that make this experiment impossible without a complete refactoring of their architecture.

Set appropriate resource requests and?limits

By understanding your application capacity, you can determine what to configure in your container resources. Resources in Kubernetes are mainly defined as CPU and memory (RAM). You configure CPU or memory as the amount required to run your application by using the request spec.containers[].resources.requests.<cpu|memory>, and you configure the cap by using the request spec.containers[].resources.limits.<cpu|memory>.

When you’ve correctly set resource requests, the Kubernetes scheduler can use them to decide which node to place your Pod on. This guarantees that Pods are being placed in nodes that can make them function normally, so you experience better stability and reduced resource waste. Moreover, defining resource limits helps ensure that these applications never use all available underlying infrastructure provided by computing nodes.

A good practice for setting your container resources is to use the same amount of memory for requests and limits, and a larger or unbounded CPU limit. Take the following deployment as an example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: wordpress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: wp
  template:
    metadata:
      labels:
        app: wp
    spec:
      containers:
  - name: wp
    image: wordpress
    resources:
      requests:
        memory: "128Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"        

The reasoning for the preceding pattern is founded on how Kubernetes out-of-resource handling works. Briefly, when computer resources are exhausted, nodes become unstable. To avoid this situation, kubelet monitors and prevents total starvation of these resources by ranking the resource-hungry Pods. When the CPU is contended, these Pods can be throttled down to its requests. However, because memory is an incompressible resource when memory is exhausted, the Pod needs to be taken down. To avoid having Pods taken down—and consequently, destabilizing your environment—you must set the requested memory to the memory limit.

You can also use VPA in recommendation mode to help you determine CPU and memory usage for a given application. Because VPA provides such recommendations based on your application usage, we recommend that you enable it in a production-like environment to face real traffic. VPA status then generates a report with the suggested resource requests and limits, which you can statically specify in your deployment manifest.

Make sure your container is as lean as?possible

When you run applications in containers, it’s important to follow some practices for building those containers. When running those containers on Kubernetes, some of these practices are even more important because your application can start and stop at any moment. This section focuses mainly on the following two practices:

  • Have the smallest image possible. It’s a best practice to have small images because every time Cluster Autoscaler provisions a new node for your cluster, the node must download the images that will run in that node. The smaller the image, the faster the node can download it.
  • Start the application as quickly as possible. Some applications can take minutes to start because of class loading, caching, and so on. When a Pod requires a long startup, your customers’ requests might fail while your application is booting.

Consider these two practices when designing your system, especially if you are expecting bursts or spikes. Having a small image and a fast startup helps you reduce scale-up latency. Consequently, you can better handle traffic increases without worrying too much about instability.

Set meaningful readiness and liveness probes for your application

Setting meaningful probes ensures your application receives traffic only when it is up and running and ready to accept traffic. EKS uses readiness probes to determine when to add Pods to or remove Pods from load balancers. EKS uses liveness probes to determine when to restart your Pods.

Make sure your applications are shutting down according to Kubernetes expectations

Autoscalers help you respond to spikes by spinning up new Pods and nodes, and by deleting them when the spikes finish. That means that to avoid errors while serving your Pods must be prepared for either a fast startup or a graceful shutdown.

Because Kubernetes asynchronously updates endpoints and load balancers, it’s important to follow these best practices in order to ensure non-disruptive shutdowns:

  • Don’t stop accepting new requests right after SIGTERM. Your application must not stop immediately, but instead, finish all requests that are in flight and still listen to incoming connections that arrive after the Pod termination begins. It might take a while for Kubernetes to update all kube-proxies and load balancers. If your application terminates before these are updated, some requests might cause errors on the client side.
  • If your application doesn’t follow the preceding practice, use the preStop hook. Most programs don't stop accepting requests right away. However, if you're using third-party code or are managing a system that you don't have control over, such as nginx, the preStop hook is a good option for triggering a graceful shutdown without modifying the application. One common strategy is to execute, in the preStop hook, a sleep of a few seconds to postpone the SIGTERM. This gives Kubernetes extra time to finish the Pod deletion process and reduces connection errors on the client side.
  • Handle SIGTERM for cleanups. If your application must clean up or has an in-memory state that must be persisted before the process terminates, now is the time to do it. Different programming languages have different ways to catch this signal, so find the right way in your language.
  • Configure terminationGracePeriodSeconds to fit your application needs. Some applications need more than the default 30 seconds to finish. In this case, you must specify terminationGracePeriodSeconds. High values might increase the time for node upgrades or rollouts, for example. Low values might not allow enough time for Kubernetes to finish the Pod termination process. Either way, we recommend that you set your application's termination period to less than 10 minutes because Cluster Autoscaler honors it for 10 minutes only.
  • If your application uses container-native load balancing, start failing your readiness probe when you receive a SIGTERM. This action directly signals load balancers to stop forwarding new requests to the backend Pod. Depending on the race between health check configuration and endpoint programming, the backend Pod might be taken out of traffic earlier.

Monitor your environment and enforce cost-optimized configurations and practices

In many medium and large enterprises, a centralized platform and infrastructure team are often responsible for creating, maintaining, and monitoring Kubernetes clusters for the entire company. This represents a strong need for resource usage accountability and for making sure all teams are following the company’s policies.

Amazon EKS supports Kubecost, which you can use to monitor your costs broken down by Kubernetes resources including pods, nodes, namespaces, and labels. As a Kubernetes platform administrator and finance leader, you can use Kubecost to visualize a breakdown of Amazon EKS charges, allocate costs, and charge back organizational units such as application teams. You can provide your internal teams and business units with transparent and accurate cost data based on their actual AWS bill. Moreover, you can also get customized recommendations for cost optimization based on their infrastructure environment and usage patterns within their clusters. For more information about Kubecost, see the Kubecost documentation. For More details refer to the link:

Cost monitoring Amazon EKS supports Kubecost, which you can use to monitor your costs broken down by Kubernetes resources including…docs.aws.amazon.com

Important Implementation links:

EKS Cluster with spot instance

Run your Kubernetes Workloads on Amazon EC2 Spot Instances with Amazon EKS | Amazon Web Services Contributed by Madhuri Peri, Sr. EC2 Spot Specialist SA, and Shawn OConnor, AWS Enterprise Solutions Architect Update?…aws.amazon.com

EKS Cluster with Karpenter

Introducing Karpenter - An Open-Source High-Performance Kubernetes Cluster Autoscaler | Amazon Web… Today we are announcing that Karpenter is ready for production. Karpenter is an open-source, flexible, high-performance…aws.amazon.com

EKS Cluster with Kubecost

Cost monitoring Amazon EKS supports Kubecost, which you can use to monitor your costs broken down by Kubernetes resources including…docs.aws.amazon.com

EKS Cluster Autoscaling

Autoscaling Autoscaling is a function that automatically scales your resources up or down to meet changing demands. This is a major…docs.aws.amazon.com

要查看或添加评论,请登录

Ashish Kasaudhan的更多文章

社区洞察

其他会员也浏览了