登录查看更多内容

Is it up?

Chris Kranz

Fighting the good fight to improve sales and startups - CRO @ The Grafter

发布日期: 2019年2月8日

I had a question this week about setting up alerts based on container or pod count. They thought they had a problem, but they actually highlighted a couple of the really powerful features of Sysdig, and why I get so expressive about metadata labels! But actually the core of the problem, and reason for sharing the story, is that it highlighted one of the most fundamental changes in the way we look at what is healthy in a micro-services, containerised world.

The health of a service is not gained from looking at whether servers / components behind it are up

First, a word on segmentation and labels

Initially the question was simply: "I need to count how many pods are responsible for this service, but I can't seem to get a simple up/down count for a deployment." This is easy enough, and while we do have metrics for pod count through kube-state-metrics, we actually provide more flexibility on this for catching errors and failures early on.

When you create an alert in Sysdig you get the ability to firstly scope the alert, so maybe you only want it to trigger based on production workloads, or a very specific application, this is really easy because we've pulled in all that information from the orchestrator.

namespace.name = production
deployment.name contains prod
pod.name does not contain dev
...and many more

The metric here doesn't necessarily need to be pod or container count either, using memory and CPU checks can be important for detecting early failures or preempt a failure.

By a long shot the majority of our customers are using Kubernetes and k8s variants, so having native support for all these labels provides a familiar, simplified interface to create very flexible observability and alerting. Many people I talk to ask if they can automate this into their deployment pipelines, which of course they can using the API, but actually with the flexibility of using labels to segment and scope the alerts, this allows for a simplified number of rules to have a very large impact.

Segmentation for multiple triggers

The same orchestrator labels can be used to create segmentation of the alerts. Again, this is really powerful; instead of applying a single rule to a whole group, you can segment this by label groupings. So, for example, you want to be alerted when any of your deployments pod count goes to zero, you simply create an alert for 'pod.count = 0' and then segment by kubernetes.deployment.name. Now any time that any deployment (existing now or in the future) has a pod.count of zero for a defined time period, no maintenance required.

Both segmentations and metrics can be combined, so you could say is the CPU is about 90% and memory is over 95% for a period of 10 minutes, trigger the alert. Segmentation could allow you to create individual alerts for both deployments and statefulsets. And the action of the alert can trigger notifications to external services so you can easily trigger a workflow or just notify your admins.

But this isn't HEALTH

This is where Google's Site Reliability Engineering comes in, and I highly recommend reading, even if you just skim through the chapters and concepts. In a microservices world, you really shouldn't care about individual components being up, it is largely irrelevant to the health of a service. To abuse my Oak Trees vs Corn Fields analogy, I don't care that individual corn plants die, I care that my total corn yield is enough to support my demands.

Jump straight to the Monitoring Distributed Systems section of the SRE book, we see their definition of the 4 main metrics that help understand the health of a service, and these form the Golden Signals:

Latency - Is the response time to my users acceptance
Traffic - Is the load expected (no load is often worse than high load)
Errors - Fairly self explanatory, but out of context of Traffic, this can often be misleading in isolation
Saturation - I often refer to this as stress, but is usually a combination of things like CPU, memory, network and disk metrics and it shows how much of available resources have been consumed. In a container world, this is definitely not the same as host resources (I won't cover reservations and limits, but this basically)

In Sysdig, once again using both the huge amount of cloud-native metrics we are collecting, combined with the labels we ingest from the orchestrator, we can quickly build really comprehensive but easy to view dashboards of service health. In the below screenshot we are segmenting by Deployment in order to show a per-deployment view of health. Additionally using the Explore tree on the left, we can narrow down the scope to view simply what is relevant to the user.

This is our default dashboard for Golden Signals, you can see the graphs segmented by Deployment, and these are easily customised or the scope narrowed to make them relevant to your users, even creating different views for different teams.

From this table you can also see our APM like capability by showing the Golden Signal metrics per end-point, so you can easily identify health web services (including APIs). Scroll down on this same page and you get a view on saturation of resources, CPU, memory, network, etc. This single page gives you full insight to the health of your services, and you don't need to have any idea (or care) how many individual components are behind this supporting them. Of course you can go and see that, but you only need to when there is an anomaly appearing on this dashboard.

It turns out that native understanding of micro-services is really important when understanding the _health_ of those micro-services.

Quick Demo

Find out more

Drop me or one of my colleagues a note, or head over to Sysdig.com to learn a bit more and sign-up for a free trial.

要查看或添加评论，请登录

Chris Kranz的更多文章

Kids today!

2024年2月2日

Kids today!

I want to write this article mostly so I can use it myself as a reference when I see folks commenting something…

5 条评论
2023 So Far...

2023年9月26日

2023 So Far...

I'm sharing this really for all the people I know that wonder what I'm up to these days. It's been a very interesting…

81 条评论
What is a Security Audit Really About?

2023年9月15日

What is a Security Audit Really About?

Having spent many years coaching and mentoring sales teams, at some point or other the topic of security audits comes…

1 条评论
AI Fireside Chat: Could you go out of business if you fail an audit?

2023年9月7日

AI Fireside Chat: Could you go out of business if you fail an audit?

I’ve run a fair few training classes of eager cyber security sales people. At some point, because we’re selling…

1 条评论
What is a “Rock Star” in IT anyway?

2022年1月22日

What is a “Rock Star” in IT anyway?

We’re on a big recruiting drive at the moment, and I notice many of our ecosystem cousins in the cloud & cloud-native…

5 条评论
We Want You! But Why Join Sysdig?

2021年12月29日

We Want You! But Why Join Sysdig?

If you missed it, we recently announced a round of funding that takes Sysdig to a valuation of $2.5bn.

2 条评论
Infrastructure Admin to DevOps & Site Reliability Engineer

2019年6月8日

Infrastructure Admin to DevOps & Site Reliability Engineer

Over the past 5/6 years I’ve slowly made the transition from being an infrastructure engineer / architect into being…

11 条评论
Algorithms for Confirmation Bias

2019年3月14日

Algorithms for Confirmation Bias

This is probably more of a rant than anything useful, but I've been meaning to put my thoughts down on paper regarding…

5 条评论
Doomsday Exploits – What happened to security good practices?

2019年2月13日

Doomsday Exploits – What happened to security good practices?

And I looked, as he opened the runc seal, and behold, there was a great earthquake, and the kernel became as black as…
Pets vs Cattle - vegan edition!

2018年12月13日

Pets vs Cattle - vegan edition!

I heard a story recently that someone disconnected from a talk about microservices and containers due to the…

1 条评论

See all articles

Is it up?

Chris Kranz

Fighting the good fight to improve sales and startups - CRO @ The Grafter

First, a word on segmentation and labels

Segmentation for multiple triggers

But this isn't HEALTH

Quick Demo

Find out more

Chris Kranz的更多文章

社区洞察

其他会员也浏览了

Syself Product Update - July 30th 2024

??? Secure Your Virtualization Environment with the Right File System! ???

Azure Weekly Updates - April 22nd, 2024

SAP on Azure high availability – change from SPN to MSI for Pacemaker clusters using Azure fencing

Is scaling always the right answer? Insights from Performance Testing with JMeter

Install Openshift CRC on your Laptop/Desktop/Workstation - RHEL8/RockyLinux8

Asynchronous Patching in VCF 5.2

What are Load Balancers ???

What is Kubernetes?

Setting Up a Kubernetes Cluster on Azure Using Ubuntu 24.04 VMs: A Step-by-Step Guide

First, a word on segmentation and labels

Segmentation for multiple triggers

But this isn't HEALTH

Quick Demo

Find out more

Chris Kranz的更多文章

Kids today!

2023 So Far...

What is a Security Audit Really About?

AI Fireside Chat: Could you go out of business if you fail an audit?

What is a “Rock Star” in IT anyway?

We Want You! But Why Join Sysdig?

Infrastructure Admin to DevOps & Site Reliability Engineer

Algorithms for Confirmation Bias

Doomsday Exploits – What happened to security good practices?

Pets vs Cattle - vegan edition!

社区洞察

其他会员也浏览了

Syself Product Update - July 30th 2024

??? Secure Your Virtualization Environment with the Right File System! ???

Azure Weekly Updates - April 22nd, 2024

SAP on Azure high availability – change from SPN to MSI for Pacemaker clusters using Azure fencing

Is scaling always the right answer? Insights from Performance Testing with JMeter

Install Openshift CRC on your Laptop/Desktop/Workstation - RHEL8/RockyLinux8

Asynchronous Patching in VCF 5.2

What are Load Balancers ???

What is Kubernetes?

Setting Up a Kubernetes Cluster on Azure Using Ubuntu 24.04 VMs: A Step-by-Step Guide