There is no spoon
Every joke is usually deeply rooted in the necessity for human beings to tell some facts and a story as individual human beings see fit best. And good laughingstocks are suitable not for a few, but for many. Stating that, let me tell you an old joke that originated from some historical facts about the Soviet Union, survived the years and the Soviet Union itself, and is somewhat still alive and relevant nowadays.
At the international exhibition, the Soviet Union announced the first of its kind, a mechanized and robotized barber machine. Curious spectators visited the venue and observed an enormous gray box with a head-sized hole in the middle. Most asked: “How does this work ?” The dude in a white lab coat standing by the machine and presenting says that this is a new and breakthrough invention that removes the human barbers' necessity to cut someone's hair. This apparatus can be installed everywhere and even dropped into a combat zone on a parachute. All you need to do is to put your head into the hole, and the machine will do the rest. In the end, your hair will be trimmed appropriately. “But ain't everyone has a different shape of a head. How is this machine dealing with that??” spectators asked curiously. “Oh, no problem,” the cheerful presenter smiled. “This difference will be maintained until the first séance of cutting. After that, there will be no difference”.
As with any joke, this one has a hidden message, and of course, the factual statement in this joke is not about cutting the hair. It is about unifying the views and the way of thinking in what you can call a “totalitarian state.”
How could this be related to the Art of Monitoring and Observability? In 2016, Beyer, B., Jones, C., Murphy, N. & Petoff, J. came up with the article “Site Reliability Engineering. How Google runs production systems.” They came to the idea of “Four Golden Signals.”
While, in general, this book is handy for a practicing SRE, as it has some very decent tips on risk management, toil eliminating, tracking, troubleshooting, and many others, it is not an unquestionable “Bible.” Every SRE must read this book and use what applies to their practices.
One of the most questionable statements in this book is the idea of “golden signals.” They are:?
You may ask, —Why do I see this selection of those categories of the signals as “problematic”? Like in the opening joke, there are “different heads,” This difference must be maintained even after the “first cut.” The problem is that many people, including many observability practitioners, trust Google's opinion without any questions and a shred of doubt. So, whenever you try to apply those four generally applicable categories to all IT practices and monitoring needs, sometimes you have to “hammer in” those categories forcefully. Why? And this is where the difficulties begin.
领英推荐
First, there are multiple personas in the observability business. And those different personas serve different needs by providing various services. “Golden signals,” as they have been defined by Google, are suitable for a very narrow stretch of IT professionals, primarily employed in the subset of “Site Reliability Engineer” roles as someone responsible for “reliability.” Those signals are not “golden” even for various tasks for which traditional SRE is frequently accountable. Let me bring this roster of some functions requiring access to observability data and not covered by the “golden signals” category:
This non-exhaustive list of tasks, not even touching the needs of:
And this is just a scratch on the surface. Numerous “personas” in the IT business have extremely diverse ideas about what is “golden signal” is in a particular context and for specific and sometimes not well-defined purposes. And those ideas are rooted in the fact that different metrics, and sometimes calculated and compound metrics, make more sense for specific tasks.
And what classic Google “golden metrics” are good for? As mentioned before, latency, traffic, errors, and saturation are only suitable for measuring and reliability. The latency is a time measure of some operation or request. When “persona” is responsible for “reliability,” latency is a critical KPI? But for other purposes, such as “capacity planning,” latency is a secondary KPI, and capacity-based metrics are becoming more acute. Traffic is a non-descriptive measure of ether bytes or requests generated to some endpoints. When you are taking care of “reliability,” traffic is usually directly related to latency, giving you an idea of how well your endpoints handle the load. DevOps “persona's” interests typically do not directly connect to measuring the loads. So, this is secondary telemetry for this“persona” While critical for operational “reliability” tasks, errors are secondary at best for capacity planning and resource management. They are also secondary for a Sec Ops “persona” For this class of IT personnel, analysis of the signals and patterns is more crucial. Although saturation is a class of KPI, we can call it “most universal” across different personas. Most “personas” need to observe some exhaustive resource as a primary task. So, saturation is spot-on, more or less a universal “golden signal”
Therefore, what is the verdict? What kind of outcomes can you derive from this short article? Foremost, no “golden signals” as defined in Google's “SRE Book” are universal across the board. Different environments for different “personas” represent a different subset of metrics and categories that provides an adequate and neat view of specific problems or series of problems through the metrics or classes that fit the best. And yes, precisely as there are no“golden metrics” we can say with certainty that building a computerized barber machine that will consider different facts about different human beings will be a task that is not only not easy but on the brink of feasibility and practicality.?