In the search for unknown in the data
What are we looking for?
In the business of observability, we are trying to comprehend the processes happening on the target for our observation through the use of telemetry data. Different types and classes of telemetry data tell us various stories, and you want to pay attention to all those stories to get complete observation. But, the first problem you will see heads on is that it is almost impossible to predefine and pre-determine all data variations and patterns of the data and, thus, recognize and comprehend all stories your source is telling you. Occasionally, you are not even ready to acknowledge a story hidden in the data because you are not aware of what some pattern in the data means; you cannot even see and immediately recognize the pattern.
So, what to do? What can you use to help to recognize the patterns and stories associated with patterns? How do you start?
What is an anomaly?
You may already hear the whisper in the air — ”search for the anomalies in the data.” And while this suggestion is partially covering “the issue,” it is not a ready-to-use solution. Because before you can treat this suggestion as a solution, you must understand the “anomaly.”
In essence, the anomaly in telemetry data is a combination of telemetry data values that were only in the present moment observed and categorized. This combination is not an anomaly if you observed this pattern in the telemetry data before but did not categorize it. And uncategorized repeated data is called an “uncategorized pattern.” So, you can see that you are dealing with one of the two cases while observing telemetry data:
And in this essay, we will discuss the importance of the “second type” or identifying the “previously unobserved patterns.”
Previously unobserved pattern
While I've touched on the importance of finding “unknown” data in the telemetry stream, I did not focus on the practicality of searching for such data. Why shall we look for it?
In the other article that I've published called “Zen of monitoring”, I've emphasized that you must develop a habit of searching for a change in the values in the telemetry data stream. And to the fact that there is no such thing as predefined “golden signals.” But let's talk about patterns first; what to do if you do not have them yet? How do you start to detect and define ones? The answer to those questions is this: observing “unknown” values in the telemetry stream. “Unknown” values represent a previously unobserved pattern. Whenever you observe a value that is “out of line,” this might indicate a new pattern. Or, sometimes, just a glitch in the data. I will talk about detecting glitches in some other essay. But how can you notice new and unseen data? Do you have multiple sources and thousands of telemetry items? There are quite a few mathematical instruments that you can use, and I will present you with a straightforward one you can start with. I call it a “Coefficient-based filter.” It may already have a fancier name, but let's use my definition in this essay.
Filtering your data.
Imagine you are prospecting for gold. You have a river, sand, and a pan. You put sand on the pan and “wash” it in flowing water. The water stream will carry lightweight sand away, and the gold heavier than the sand will be collected at the bottom of the pan. That is essentially what we will do with our telemetry data. We will let “the sand telemetry” flow away and “gold telemetry” be known to us.?
So, we will build a filter in which the telemetry values similar to one we've seen before will be signaling TRUE, which means “yes! You've seen something like this before”, and unseen data will trigger FALSE, and this will be an indication that you potentially have something new. The idea behind such a filter is elementary. First, we take a sample of telemetry data and alert us if the value that is not in this set has arrived. But this filter will not be beneficial, as numeric data is only sometimes 100% match. So, we need to build the filter that detects “that and similar values.” Again, implementation of this filter for the single value is trivial:
领英推荐
The next step is to define those intervals for all values in our sample, and then merge overlapping intervals. And if the new value is within this interval, more than the start and less than the end values of the gap, then we consider this value “known.” And if the new value does not fall within this interval range, this is “unseen data.”
So, the values we are looking for will be those that did not pass into an interval window.
Would you mind showing me the code?
To demonstrate how this approach will work, I will use some of the code I've created for the programming language called Bund I am developing. This language is not finished yet, but I can use it for demonstration. I will not expect you to be fully familiar with RPN notation or stack-based languages, but that will not be required. The generator “I” will create an empty interval set and place it on the stack. Operator “I/Coeff” will take an Interval Set and data from the stack and create Interval Set. “println” will do what you expect and print stuff on the console. The data set is a set of data elements between “(*” and “),” and the first number is not data but a Coefficient that is used to build intervals.
(* 0.2 1.0 2.0 3.0 5.0 1.0 100.0 150.0 10.0) I I/Coeff println
In this example, we are building Intervals for the values ( 1.0 2.0 3.0 5.0 1.0, 100.0, 150.0, 10.0 ) with a Variability Coefficient of 0.2.
The outcome of this operation is our configured Interval Set is consisting of 5 intervals.
[ (0.8,1.2) (1.6,3.6) (4,6) (8,12) (80,180) ]
If the value is within this interval, we will consider it “existing data,” but if not, it will be “unseen data” and worthy of alarm. This is the gold value we are looking for.
4.5 I/Test
Will return True; the value 4.5 is different from what we are looking for. But?will be an example of “unseen data.”
45 I/Test
Conclusion
This Coefficient-based Interval Filter could be an essential instrument for identifying new patterns or observing anomalies without concern about patterns (if that is what you are looking for). The use of coefficients is not the only mathematical method available to you. Instead of a hard-coded coefficient, you can dynamically calculate Standard Deviation from the sample and use it as a Coefficient. Try it; let's see what kind of results you can get. Try another method of creating filters that will detect “unseen data”. This article is a thought-provoking exercise rather than a production solution.