Is it up?
I had a question this week about setting up alerts based on container or pod count. They thought they had a problem, but they actually highlighted a couple of the really powerful features of Sysdig, and why I get so expressive about metadata labels! But actually the core of the problem, and reason for sharing the story, is that it highlighted one of the most fundamental changes in the way we look at what is healthy in a micro-services, containerised world.
The health of a service is not gained from looking at whether servers / components behind it are up
First, a word on segmentation and labels
Initially the question was simply: "I need to count how many pods are responsible for this service, but I can't seem to get a simple up/down count for a deployment." This is easy enough, and while we do have metrics for pod count through kube-state-metrics, we actually provide more flexibility on this for catching errors and failures early on.
When you create an alert in Sysdig you get the ability to firstly scope the alert, so maybe you only want it to trigger based on production workloads, or a very specific application, this is really easy because we've pulled in all that information from the orchestrator.
- namespace.name = production
- deployment.name contains prod
- pod.name does not contain dev
- ...and many more
The metric here doesn't necessarily need to be pod or container count either, using memory and CPU checks can be important for detecting early failures or preempt a failure.
By a long shot the majority of our customers are using Kubernetes and k8s variants, so having native support for all these labels provides a familiar, simplified interface to create very flexible observability and alerting. Many people I talk to ask if they can automate this into their deployment pipelines, which of course they can using the API, but actually with the flexibility of using labels to segment and scope the alerts, this allows for a simplified number of rules to have a very large impact.
Segmentation for multiple triggers
The same orchestrator labels can be used to create segmentation of the alerts. Again, this is really powerful; instead of applying a single rule to a whole group, you can segment this by label groupings. So, for example, you want to be alerted when any of your deployments pod count goes to zero, you simply create an alert for 'pod.count = 0' and then segment by kubernetes.deployment.name. Now any time that any deployment (existing now or in the future) has a pod.count of zero for a defined time period, no maintenance required.
Both segmentations and metrics can be combined, so you could say is the CPU is about 90% and memory is over 95% for a period of 10 minutes, trigger the alert. Segmentation could allow you to create individual alerts for both deployments and statefulsets. And the action of the alert can trigger notifications to external services so you can easily trigger a workflow or just notify your admins.
But this isn't HEALTH
This is where Google's Site Reliability Engineering comes in, and I highly recommend reading, even if you just skim through the chapters and concepts. In a microservices world, you really shouldn't care about individual components being up, it is largely irrelevant to the health of a service. To abuse my Oak Trees vs Corn Fields analogy, I don't care that individual corn plants die, I care that my total corn yield is enough to support my demands.
Jump straight to the Monitoring Distributed Systems section of the SRE book, we see their definition of the 4 main metrics that help understand the health of a service, and these form the Golden Signals:
- Latency - Is the response time to my users acceptance
- Traffic - Is the load expected (no load is often worse than high load)
- Errors - Fairly self explanatory, but out of context of Traffic, this can often be misleading in isolation
- Saturation - I often refer to this as stress, but is usually a combination of things like CPU, memory, network and disk metrics and it shows how much of available resources have been consumed. In a container world, this is definitely not the same as host resources (I won't cover reservations and limits, but this basically)
In Sysdig, once again using both the huge amount of cloud-native metrics we are collecting, combined with the labels we ingest from the orchestrator, we can quickly build really comprehensive but easy to view dashboards of service health. In the below screenshot we are segmenting by Deployment in order to show a per-deployment view of health. Additionally using the Explore tree on the left, we can narrow down the scope to view simply what is relevant to the user.
This is our default dashboard for Golden Signals, you can see the graphs segmented by Deployment, and these are easily customised or the scope narrowed to make them relevant to your users, even creating different views for different teams.
From this table you can also see our APM like capability by showing the Golden Signal metrics per end-point, so you can easily identify health web services (including APIs). Scroll down on this same page and you get a view on saturation of resources, CPU, memory, network, etc. This single page gives you full insight to the health of your services, and you don't need to have any idea (or care) how many individual components are behind this supporting them. Of course you can go and see that, but you only need to when there is an anomaly appearing on this dashboard.
It turns out that native understanding of micro-services is really important when understanding the _health_ of those micro-services.
Quick Demo
Find out more
Drop me or one of my colleagues a note, or head over to Sysdig.com to learn a bit more and sign-up for a free trial.