登录查看更多内容

In the search for unknown in the data

Vladimir Ulogov

Solving problems in Amadeus

发布日期: 2023年4月26日

What are we looking for?

In the business of observability, we are trying to comprehend the processes happening on the target for our observation through the use of telemetry data. Different types and classes of telemetry data tell us various stories, and you want to pay attention to all those stories to get complete observation. But, the first problem you will see heads on is that it is almost impossible to predefine and pre-determine all data variations and patterns of the data and, thus, recognize and comprehend all stories your source is telling you. Occasionally, you are not even ready to acknowledge a story hidden in the data because you are not aware of what some pattern in the data means; you cannot even see and immediately recognize the pattern.

So, what to do? What can you use to help to recognize the patterns and stories associated with patterns? How do you start?

What is an anomaly?

You may already hear the whisper in the air — ”search for the anomalies in the data.” And while this suggestion is partially covering “the issue,” it is not a ready-to-use solution. Because before you can treat this suggestion as a solution, you must understand the “anomaly.”

In essence, the anomaly in telemetry data is a combination of telemetry data values that were only in the present moment observed and categorized. This combination is not an anomaly if you observed this pattern in the telemetry data before but did not categorize it. And uncategorized repeated data is called an “uncategorized pattern.” So, you can see that you are dealing with one of the two cases while observing telemetry data:

The data we observe match the data pattern we observed before, but we do not assign a category for this pattern.
We are genuinely observing this pattern for the first time. This observation reveals a new type of telemetry behavior, and we do need to either categorize this pattern or push this pattern to the “uncategorized” category.

And in this essay, we will discuss the importance of the “second type” or identifying the “previously unobserved patterns.”

Previously unobserved pattern

While I've touched on the importance of finding “unknown” data in the telemetry stream, I did not focus on the practicality of searching for such data. Why shall we look for it?

In the other article that I've published called “Zen of monitoring”, I've emphasized that you must develop a habit of searching for a change in the values in the telemetry data stream. And to the fact that there is no such thing as predefined “golden signals.” But let's talk about patterns first; what to do if you do not have them yet? How do you start to detect and define ones? The answer to those questions is this: observing “unknown” values in the telemetry stream. “Unknown” values represent a previously unobserved pattern. Whenever you observe a value that is “out of line,” this might indicate a new pattern. Or, sometimes, just a glitch in the data. I will talk about detecting glitches in some other essay. But how can you notice new and unseen data? Do you have multiple sources and thousands of telemetry items? There are quite a few mathematical instruments that you can use, and I will present you with a straightforward one you can start with. I call it a “Coefficient-based filter.” It may already have a fancier name, but let's use my definition in this essay.

Filtering your data.

Imagine you are prospecting for gold. You have a river, sand, and a pan. You put sand on the pan and “wash” it in flowing water. The water stream will carry lightweight sand away, and the gold heavier than the sand will be collected at the bottom of the pan. That is essentially what we will do with our telemetry data. We will let “the sand telemetry” flow away and “gold telemetry” be known to us.?

So, we will build a filter in which the telemetry values similar to one we've seen before will be signaling TRUE, which means “yes! You've seen something like this before”, and unseen data will trigger FALSE, and this will be an indication that you potentially have something new. The idea behind such a filter is elementary. First, we take a sample of telemetry data and alert us if the value that is not in this set has arrived. But this filter will not be beneficial, as numeric data is only sometimes 100% match. So, we need to build the filter that detects “that and similar values.” Again, implementation of this filter for the single value is trivial:

领英推荐

The Data Deluge: Why More Information Doesn’t Always…

Andre Ripla PgCert, PgDip 1 个月前

Why Data Products?

Karthik Ravindran 5 个月前

Data Discovery Market Size To USD 20.03 Billion By…

Mayuri Raut 1 年前

Let's take the coefficient of value variation.
Apply it to the value.
Get the interval with the start and end.

The next step is to define those intervals for all values in our sample, and then merge overlapping intervals. And if the new value is within this interval, more than the start and less than the end values of the gap, then we consider this value “known.” And if the new value does not fall within this interval range, this is “unseen data.”

So, the values we are looking for will be those that did not pass into an interval window.

Would you mind showing me the code?

To demonstrate how this approach will work, I will use some of the code I've created for the programming language called Bund I am developing. This language is not finished yet, but I can use it for demonstration. I will not expect you to be fully familiar with RPN notation or stack-based languages, but that will not be required. The generator “I” will create an empty interval set and place it on the stack. Operator “I/Coeff” will take an Interval Set and data from the stack and create Interval Set. “println” will do what you expect and print stuff on the console. The data set is a set of data elements between “(*” and “),” and the first number is not data but a Coefficient that is used to build intervals.

(* 0.2 1.0 2.0 3.0 5.0 1.0 100.0 150.0 10.0) I I/Coeff println

In this example, we are building Intervals for the values ( 1.0 2.0 3.0 5.0 1.0, 100.0, 150.0, 10.0 ) with a Variability Coefficient of 0.2.

The outcome of this operation is our configured Interval Set is consisting of 5 intervals.

[ (0.8,1.2) (1.6,3.6) (4,6) (8,12) (80,180) ]

If the value is within this interval, we will consider it “existing data,” but if not, it will be “unseen data” and worthy of alarm. This is the gold value we are looking for.

4.5 I/Test

Will return True; the value 4.5 is different from what we are looking for. But?will be an example of “unseen data.”

45 I/Test

Conclusion

This Coefficient-based Interval Filter could be an essential instrument for identifying new patterns or observing anomalies without concern about patterns (if that is what you are looking for). The use of coefficients is not the only mathematical method available to you. Instead of a hard-coded coefficient, you can dynamically calculate Standard Deviation from the sample and use it as a Coefficient. Try it; let's see what kind of results you can get. Try another method of creating filters that will detect “unseen data”. This article is a thought-provoking exercise rather than a production solution.

要查看或添加评论，请登录

Vladimir Ulogov的更多文章

Listen to a silence

2023年5月9日

Listen to a silence

Monitoring and observability start with collecting the various telemetry and other related data. Then, various…

3 条评论
Open your horizon!

2023年4月23日

Open your horizon!

Life before “golden signals.” There was a time when there was no Google.
Establishing causality

2023年4月20日

Establishing causality

Establishing causality is one of the most essential tasks of Monitoring and Observability. Let's show what cause is…
A Zen of monitoring

2023年4月20日

A Zen of monitoring

This article is not a tutorial, but a philosophical reflection on the question that many professionals involved in…

3 条评论
There is no spoon

2023年4月20日

There is no spoon

Every joke is usually deeply rooted in the necessity for human beings to tell some facts and a story as individual…
Incomplete bits and pieces about telemetry types

2022年7月28日

Incomplete bits and pieces about telemetry types

In the business of monitoring and observability, you are constantly hear words "Telemetry" and "Metrics". And for the…
Few pointers on how to survive a job hunt.

2020年4月21日

Few pointers on how to survive a job hunt.

Now, when number of people actually looking for a new place of employment and some job-seeking activity on the rise…
Integrating Zabbix into your enterprise for fun and profit. DNS integraton.

2014年10月9日

Integrating Zabbix into your enterprise for fun and profit. DNS integraton.

1. Why ? There are lot of ways of how you can manage you company, home or corporate DNS zones.
How I Learned to Stop Worrying and Love the Zabbix. (Part 3 and the last one)

2014年9月25日

How I Learned to Stop Worrying and Love the Zabbix. (Part 3 and the last one)

Here is a Part 1 and a Part 2. So, you are the systems manager or administrator or whatever title you've got.
How I Learned to Stop Worrying and Love the Zabbix. (Part 2)

2014年9月10日

How I Learned to Stop Worrying and Love the Zabbix. (Part 2)

In the Part 1 of my path towards choice f the monitoring platform, which could satisfy my requirements, first, I have…

See all articles

社区洞察

Analytical Skills

You're faced with shifting data trends. How do you realign your task priorities effectively?

In the search for unknown in the data

Vladimir Ulogov

Solving problems in Amadeus

What are we looking for?

What is an anomaly?

Previously unobserved pattern

Filtering your data.

领英推荐

Would you mind showing me the code?

Conclusion

Vladimir Ulogov的更多文章

社区洞察

其他会员也浏览了

Doppelganger: Your Data has a Twin?

Basic Terminologies in Time Series Forecasting - Chapter 2

How To Use Data In A Crisis

Data: The Key to Understanding Our World

Make Better Decisions: Don’t Buy Into The Big Data/Small Data Hype

A Friendly Introduction to Features in Time Series Data

The importance of data for Public Sector leaders (who understand more about data than they think)

Path to Data science - Zero to Hero Series 1 - Week2

Data Does Not Speak for Itself

Happy Birthday Big Data! (But should we celebrate?)

What are we looking for?

What is an anomaly?

Previously unobserved pattern

Filtering your data.

领英推荐

Would you mind showing me the code?

Conclusion

Vladimir Ulogov的更多文章

Listen to a silence

Open your horizon!

Establishing causality

A Zen of monitoring

There is no spoon

Incomplete bits and pieces about telemetry types

Few pointers on how to survive a job hunt.

Integrating Zabbix into your enterprise for fun and profit. DNS integraton.

How I Learned to Stop Worrying and Love the Zabbix. (Part 3 and the last one)

How I Learned to Stop Worrying and Love the Zabbix. (Part 2)

社区洞察

其他会员也浏览了

Doppelganger: Your Data has a Twin?

Basic Terminologies in Time Series Forecasting - Chapter 2

How To Use Data In A Crisis

Data: The Key to Understanding Our World

Make Better Decisions: Don’t Buy Into The Big Data/Small Data Hype

A Friendly Introduction to Features in Time Series Data

The importance of data for Public Sector leaders (who understand more about data than they think)

Path to Data science - Zero to Hero Series 1 - Week2

Data Does Not Speak for Itself

Happy Birthday Big Data! (But should we celebrate?)