Are my systems "Observable"?
Kaushik Banerjee ( He/Him/His )
SVP| Autonomous & Accountable DevOps, APAC SRE Head for Trading Tech| Execution, Empathy & Unleashing Team's Potential| I help Organizations reduce TOIL ,MTTR & MTTD while Improving Resiliency & Reliability
So people ask/wonder often, we have so much ( too much ? ) alerting and monitoring. Are my systems what they call "Observable" ?.
Maybe..maybe not. Let's explore.
Monitoring: It is the gathering of surface-level data points ( in legacy systems, monitoring may mostly be events-based alerting ). In very simple cases some of these isolated data points/alerts can tell you the cause of your system failure e.g. Hardware off-line or Database Crashed.
Visibility: It is understanding the various components in your system in isolation. So visibility of your servers, visibility of your networks, of your market data, of your distributed devices.
Observability: To understand the internal state of a system from its surface-level information i.e. from the data it's spewing out.
It's putting the above two ( monitoring and visibility ) together and contextualizing it by adding more layers to it.
A holistic view of the entire system and/or ecosystem. It contains ( but is not limited to ) logs, traces (especially on distributed systems), metrics, Machine learning.
( Image credit, OpsRamp).
Centralized Monitoring of the right data points, of all your devices and environments, is the foundation piece of observability.
Juxtaposing algorithmic real-time log analysis with centralized monitoring, visibility of the entire ecosystem, and tracing of distributed systems will go a long way in providing observability in our systems.?
领英推荐
Applying ML on these will provide actionable insights that can allow DevOps/SRE/ITOps teams to increase the stability of the systems. With a virtuous cycle of the above, and improving SLIs your SLO and SLA should be achievable.
To visualize this, imagine single drawings on various tracing papers. One has a Sun, one has a Palm tree, one has a lake, one has boats. By themselves, they are a correct data point but don’t tell much. Juxtapose them on top of each other. And they form a story, a complete picture of? A sunset on a lake.
To take another example.?
Imagine if your system was a person, who has communication problems ( Say has a different language or is mute) and hence unable to tell you if anything is wrong with her/him.
We check her/his temperature. It's a bit high, is something wrong ?.
(S)he implies that her/his left arm feels a bit tingly sometimes ( like your intermittent connection errors from various systems).
So we SUSPECT something might be wrong, but don’t know how wrong and/or whether it even merits any action (and if yes, what action ?).
But if the person could talk and elaborate all what (S)he is feeling properly ( i.e. we had proper observability), then it could have told us that there was some numbness, a bit of dizziness,?left side leg and arm not responding intermittently, haziness of vision. That would have told you that there is a high chance that the person had a stroke. And take emergency measures accordingly.
So a perfectly observable system is one whose complete internal state is understandable just by the data ( and patterns in that data ) being provided by that system.
In such a system, you can tell straight away ( and maybe even see it coming from a few miles away) whether a slow response is due to some calls going into loops, failed servers, memory exhaustion, or even network/switch level issues.
Right monitoring is at its core but it is much more than that. And plain events, threshold-based alerting is not the complete toolkit for complex systems.
Strategy & Transformation at Manulife Asia
3 年Keep going Kaushik Banerjee ( He/Him/His )