A Zen of monitoring
This article is not a tutorial, but a philosophical reflection on the question that many professionals involved in creating or using monitoring systems ask: “Why are we doing what we are doing, and how are we doing that.”?
Introduction.
What I will say in this blog post may sound strange or incomplete. But this is how this problem presented itself. I do not know the ultimate answer to “what is monitoring and how to do it properly.” I doubt any human being knows an answer to this question, which will cross all t's and dot all i's. But this post is somewhat of a silly attempt to clarify some points. And if some of my thoughts will be useful to you as well, I will be glad that I've spent time bringing all my arguments on this matter together. Of course, as I perceive it, the side effect of this article is to ask more questions and dive deeper into this fascinating issue.? And now, without any further ado.??
Humble History
When speaking about IT monitoring and observability, many IT professionals make the same mistakes. They are to make an impression that the idea of monitoring and observability for the matter is something:
None of those statements are true. Monitoring has been a part of human activities for centuries. Whenever there is some process, someone usually?observes that process. Making sure that it is through. And IT monitoring and observability as it emerged as an IT topic initially, IT monitoring and observability were treated equally to any other form of monitoring and observability. And it shouldn't be treated any differently today. Because there is nothing new in the world regarding how human beings connect with their surroundings. We can create better tools, but we have yet to change the nether mechanics of an eye or how the human brain processes input data about that surroundings. And with this idea in mind, let us try to answer a straightforward question:
Why are we doing that?
There are many answers to that question in the IT crowd, but let's think, for a second:?
I brought those few samples of human activities to make the point. The purpose of all those actions is to keep driving some process, of the fire, of the ship movements, of the engine safety, of the control of the airplane. So, every time we think of “monitoring” or “observability,” we do need to think, “This is all for the control of some process,” not for personal curiosity. Not to satisfy some external requirements without questioning “why.” Not only to establish some fact without bringing this fact into a proper context. So, the foremost task of any “monitoring” and “observability” is “Control.” Everything else is a secondary task. Even if we are involved in the monitoring and instrumentation of some scientific experiment, the primary task is to keep the process under control to a maximum extent and then gain scientific data. So, after answering the first and probably most important question, we must ask ourselves:?
How are we doing that?
And at this point, there will be no shortage of various answers. We will hear about how we get and compute the data, thresholds and aggregations, statistical computation, and visualization. But let us step back and think now: “What is the matter of controlling? How can we be certain that the process is controllable? How shall we organize our observability, so each element in this effort matters ?”
And again, I am taking a step back from the beginning to propose a multiple-choice solution and try to dig to the root of the idea of “observability.” What are we observing while seeking control over something? Every time we build a fire, we watch that the fire shouldn't die or get out of control. The whole purpose of seamanship is to deliver humans and cargo across the waters. While doing that, you are taking care of unfavorable conditions preventing you from getting this delivery done. Whenever you control some process or mechanism, you are looking at what you are managing and doing everything to complete this task by removing the obstacles. So, the method of control is detecting and preventing barriers that stand in the way of some processes and may block this process from fulfilling the process's purpose. So, to keep something under control, we have to detect not the problems. For example, if the train boiler is disintegrated into smithereens, we can safely say that the current state of the boiler is beyond any control. But we shall rather find the traits that lead to the issues. But what are those traits? What are the indicators that the issue is about? In the dynamic environment, which is characteristic of any process we seek to gain control over, those traits through the collection and observation of patterns through various methods.
So, if we are beginning to look at the root of the problem, monitoring, and observability is the process of constant search and detection of patterns in the data. For the user, who is trying to keep some processes under control, the observability platform produces help in catching “fingerprints” of the data that may lead to concerns and the patterns that lead to the restoration of normality. While observing the repeatable patterns is the primary tool for controlling some processes, more than just patterns is needed. But why? What's wrong with just the patterns? Detecting potential issues by searching for the known (or hardly known) traits in telemetry will remove a significant burden from the observer. Still, neither the observer nor us knows everything. And as the second line of observation, we may set the relationship between known patterns and other events in the system. Because “all good” and “all bad” are related, one observation usually leads to another. Otherwise, you may need to learn how to detect something.?What is the summary, and what can we do correctly or incorrectly? Let me recapture my thoughts and come up to the conclusion:
And now, let me counter-sample of how we can get the monitoring wrong:
This is all I can say about this very intricate engineering problem. While I am not claiming that I am 100% correct, I am instead claiming that thoughts that reflections are the result of years of observing monitoring and observability as IT and industrial discipline and how engineers perceive it.