Anomalous Azure SPNs activities triage with ML clustering (Azure Sentinel Part 4)
Finally... After one year, here is the long awaited sequel of my previous articles on Azure Sentinel.
Today's article will provide a case study that sheds lights on how to track anomalies at scale and in an automated way, once your HMM has been trained an is operational.
Farewell, periodicity!
Even though period analysis on raw activity logs do not provide useful information, one might be tempted to wonder if it fares any better on Markov sequences. Kusto only provides series decomposition on timestamps, not arbitrary sequences, so we need to plug-in our own custom Discrete Fourier Transforms (DFTs) to the output of the HMM residual anomalies.
Unfortunately, the few real-life samples I have fed to a DFT do not seem to provide a spectral measurement which is accurate enough to be exploited without human intervention... Maybe it's me, but I couldn't get it right.
I had to give up exploring this lane and had to find another route. Fortunately, my latest foray in Quantum Computing territory (see my recent articles on the matter) prompted me to study random walks and finite fields. Eventually this is what led me to the very promising solution I'm just going to explain :-)
Correlating SPN sequences
Unlike Azure Active Directory users, Application and service principals are very well groomed animals. So there has to be a way to nail down their deterministic behavior, don't you agree?
And you're right. To that end, we are going to get help from an old dependable friend: the "mark 1 eyeball"...
Armed with the precious device, take a look at residual anomalous sequences in this Application Insights workbook (look at the bottom part of the graph):
Each particular SPN sequence is depicted as a broken line with a distinct color. You see that several sequences behave in a similar fashion: their vertical scaling is different (and, in fact, the Y-axis is a logarithmic scale), but their envelopes are the same. It means that this bunch of SPNs are correlated: the sequences fire in harmony to perform a composite, multi-staged Azure operation.
Envelopes and multi-dimensions
So here we are with this new object: the sequence envelope. How do we compare two envelopes to check if they correlate?
We can imagine plenty of ways, here is my line of thoughts: to me and my Mark 1 eyeball, an envelope looks like a random walk in one-direction: up or down. I can describe an envelope as a list of just three tokens: '1' for upward, '-1' for downward, and '0' for steady.
Under this convention, [-1,1,1,-1,-1,1] is a zigzag, whereas [0,0,1,-1,0,0,0,0,0,0] is a single bump and [0,10,0,0,0,-1,0] is a plateau with ridges at both ends.
Now close your eyeball and remember your math at school, when you learnt about vectors... What is our zigzag, if not a 6-dimension vector? Or our bump, if not a 10-dimension vector?
More specifically, envelopes of length n are vectors over the finite field GF(3)^n.
Clustering and machine learning
The more dimensions, the merrier! Why? Because high-dimensional spaces are extremely empty.
To give you an idea, take the volume of a sphere in dimensions 1 to 25:
After an initial sharp increase, it quickly vanishes into nothingness... The volume of a unit-sphere of dimension 3 is a little more than 4, but it's about 1 in dimension 13.
领英推荐
I don't know for you, but
I would rather eat an apple in dimension 3 than two in dimension 13!
So if something unusual pops-up in such vast expanses of void, it's very easy to spot. Not easy for us, of course (we struggle to visualize things in a mere 3D space... ), I mean for the machine.
What's more, in high-dimensional spaces, if two or more things pop up close to one another, there's next to no chances it's an accident.
Then all we have to do is pour our n-dimension vectors into a Machine Learning clustering algorithm. The algorithm I'll pick for the demonstration is called Kmeans.
Demonstration time!
Let's start with this anomalous sequences sample:
We see that the dark blue and red lines roughly follow the same winding pattern.
The time span we consider here is two days, or 48 hours. We have used data grouped by bins of 1 hour to construct the sequences, it means that our envelopes each contains 48 dimensions.
Here is how they get converted into two GF(3)^48 vectors by a simple normalization script:
Now that the vertical scaling effect has gone, we can easily tell the similarity. Here there is indeed only a couple of differences located in dimensions 19 and 20; the one in dimension 20 is highlighted in green above: it reads '-1' in the top vector, and '0' in the bottom one.
Fine, but that's not us who we want to wake up at night for investigations... That's the machine! And not for only this particular example, but for all anomalies.
So we feed not only those two vectors, but all the concurrent anomalous vectors (there are 23 of them sharing this same time range) into Kmeans, a ML multi-dimension clustering algorithm available in the mighty Python scikit-learn module or as part of Azure Machine Learning.
Kmeans needs to have an idea of the number of clusters it needs to work on. Here we have the choice, for the demo I have chosen 10 clusters labelled C0 to C9.
After only a couple of seconds, here is the result of KMeans(n_clusters=10, init='k-means++', max_iter=300, n_init=10, random_state=0):
Conclusion
I wonder if you will be as impressed as I am with the outcome of this single, unoptimized run of Kmeans:
Maybe there will be no need to wake up at night using soaring Mk1 eyeballs... Hopefully Kmeans could do the job? :)
What remains to be done is trigger a low severity alert into Sentinel for out-of-band investigation. (Had Kmeans found some more hectic patterns, we would have fired a higher severity alert)
Note: the results I'm sharing in this article are in an early phase of exploration. The solution needs much more analysis, refining and real-life testing to be considered a good candidate for production environments.
Cyber Security Expert
3 年Great blogpost series ! Well doen Christophe Parisel
Wizard in Chief @cloudswizards.com | IT Security, Infrastructure, Architecture
3 年Mind blowing (I need some refresh in math:))
Senior Cloud security architect at Société Générale
3 年FYI Adi Eldar Rod Trent Younes Khaldi Donald Lutz and David Knickerbocker
Cross Solutions, Multi-Cloud Tech Thought Leader, Advisor to Industry Leaders
3 年This is awesome write up and approach