Kubernetes Observability Boosts Productivity, Reduce Costs

Practical Steps to Kubernetes Observability

Following these 10 simple steps can help you get or take back control of your observability data:

1. Define Your Goals

First, establish your digital transformation initiative vision and set goals to achieve it. If it’s an application, for example, that allows a hybrid workforce to connect with customers, what should your service-level objectives (SLOs) be?

On the flip side, what is the target mean time to remediate (MTTR), or how much downtime can the organization afford? What kind of resource spike should be allowed and how much funding can be invested should all be defined from the start. Working backward from here, you can determine what your mean time to detect (MTTD) needs to be in order to meet your goals.

2. Select the Best Observability Solution(s)

As with every digital transformation project, your team needs the best solution or combination of solutions possible, which will depend on your use case(s) and goals. There is not one way to monitor Kubernetes or to be cloud native. It depends on your people, your organization, your business goals and your existing technology stack.

The types of tools to consider when choosing the optimal observability solution for your organization are:

Open source metrics tools: Prometheus is the de facto standard for Kubernetes metrics. Since it has scalability limitations, people often choose to add a time series database like Thanos, Cortex or M3DB for long-term storage.

Open source logging tools: In cloud native environments, many people look to log aggregation tools like the Cloud Native Computing Foundation (CNCF) project Fluentd which acts as a data collector and can send data to many different backends.

Open source tracing tools: Jaeger began as one of the most popular of the open source tracing tools. It’s since been eclipsed by OpenTelemetry, currently the second most active project at the CNCF, which is the de facto standard for tracing data.

Open source cost optimization tools: These tools report cost information about your cluster and resources so you can take action. The most popular ones in this category are Kubecost and OpenCost.

Non–open source tools: There are many vendors available that offer observability tooling. The most important issue here is to evaluate their use of open standards for observability data encoding, transport, ingestion and querying. Look for CNCF-based project integration in these observability vendors, such as Chronosphere’s commitment to adhere to community standards.

Once you determine the solution(s) you need, it’s time to decide how to take advantage of them. Open source is a critical characteristic of any cloud native ecosystem collector, especially when you rely on Kubernetes.

These are some of the key ways to deploy and access observability:

Open source, self-management (aka DIY): This is the entry point for most organizations because they want to stay in control of their data and observability innovation without vendor timelines and lock-in. In-house observability is a good choice if the environment is not required to scale rapidly or enormously. However, you must have the resources, experience and scale to run, host, operate, host and monitor the solution in an availability zone that is different from the one running your applications for production.

Proprietary SaaS solution: The health and well-being of your DevOps and site reliability engineering (SRE) organization is often a primary factor in choosing SaaS. If you have an existing software as a service (SaaS) observability vendor from your non-containerized environment, you might be able to extend it to monitor Kubernetes as well, but there are several pitfalls you need to watch out for: high licensing costs due to pricing that is built for VM-centric environments, slow dashboards, delayed alerts due to the abundance of data and lack of visibility into key metrics in your Kubernetes cluster.

Open source-compatible SaaS solution: This “golden path” eases implementation by delivering all the benefits of full, scalable visibility into your stack along with data control and without vendor lock-in. Solutions that are fully Prometheus- and OpenTelemetry-compatible can give you the best of both worlds. As a result, your organization reaps the rewards of not only the open source solution, but also the community, ecosystem and tutorials. An open source service also provides an off-ramp back to self-hosting, should goals change.

The final step in selecting a solution is choosing a cloud provider tool. For a single cloud, using the cloud provider’s analytics and monitoring is smart because you gain a price advantage and visibility from deep integration with existing cloud infrastructure. Whether you use a single cloud or multiple clouds, you are responsible for the customer experience.

3. Instrument Your Code

You need to instrument your code to get the most out of the tool(s) you’re using and to enable distributed tracing (see No. 7). In practice, instrumenting your code means collecting data and then sending it wherever you want — no more vendor lock-in like application performance monitoring (APM) or infrastructure monitoring providers required. Many solutions work out of the box without much work, but you get the most and best available data to take the best course of action by instrumenting your code.

In the open source world, Prometheus is the standard to understand Kubernetes cluster health. However, be cautious because you might not actually need all of the data that is emitted. If the data isn’t useful to you and your organizations, it becomes costly. Adapting for a specific use case and business need is always better than a one-size-fits-all monitoring experience. Be aware of this if you are learning with Prometheus dashboards.

4. Collecting and Visualizing Observability Data Using Dashboards

Your engineers will be tasked with creating dashboards that deliver data visualization at the ready. That way, you can quickly glance in and understand exactly what’s happening in your system. Many solutions include a dashboard system. For example, Chronosphere helps you experience faster dashboards with Query Accelerator technology. Across your fleet, it’s fast and performant and requires no manual optimizations.

This approach is simpler because your engineers don’t need to be deep experts in a query language (such as PromQL), the architecture and scale of the environment, the observability solution’s underlying data model or how a query in testing will perform in production.

5. Track Your Resource Utilization

Significant resource changes can mean good news or bad — your customer base has suddenly spiked or something has stopped working. Either way, it can be challenging with existing APM or infrastructure monitoring tools to understand how much of a resource you’re using, what resource it is, for what application and whether it’s using too much of your resources.

The Observability Data Optimization Cycle from Chronosphere helps your organization overcome these challenges by better understanding and taking action on the cost of your observability data through a process consisting of analyzing, refining and operating.

6. Logging and Logging Aggregation

Logging is important in the cloud native world because it helps your team capture, aggregate and understand system events. In a cloud native architecture, the number of incidents increases, but so does the amount of logs that are not correlated in a single system. This makes it difficult to find the data you need and troubleshoot the problem. While metrics are an important tool to diagnose the symptom of an issue, you use traces to locate the problem, and logs are best suited to uncover the root cause of the issue.

To keep logs under control in a Kubernetes environment, you need to be able to aggregate and filter data to reduce waste, save money and make it easier to locate the data you need in a timely manner.

7. Distributed Tracing

If you don’t instrument code properly (see No. 2), you can’t support distributed tracing. Yet distributed tracing allows you to see what a request did throughout the system. It is the way you determine that a single function is taking a very long time so you can dive deeper into why — preferably before it hurts customer experience.

8. Alerts and Notifications

After completing steps 1–7, a best practice is setting up alerts and notifications to send to yourself or your team. That way, if and when something goes wrong, someone can triage and fix the issue.

9. Follow Best Practices and Updates

This step is common sense. Future-proofing is hard with new updates coming out almost daily. Keep up with solution patching and observability best practices. Add automation when you can to eliminate time-consuming and error-prone manual processes.

10. Control Costs

The best observability platforms will help you control your cloud costs and observability spend. Choosing a solution such as Chronosphere with its Control Plane that gives you different tools along the observability pipeline allows your organization to:

Understand the data you have

Analyze what is useful, what is waste

Know whether people are alerting on a metric

See what label causes the biggest amount of cardinality — which costs the most, but is used the least

This type of transparency allows valuable, talented engineers to work on projects that are more impactful to the business. With cost controls in place, you can then begin to fine-tune and see how useful data is; set quotas for teams based on observability spend and perform cost accounting trend analyses across teams running independent microservices.

要查看或添加评论,请登录

Gopal Das的更多文章

社区洞察

其他会员也浏览了