Anomalous Azure SPNs activities triage with ML clustering (Azure Sentinel Part 4)

Anomalous Azure SPNs activities triage with ML clustering (Azure Sentinel Part 4)

Finally... After one year, here is the long awaited sequel of my previous articles on Azure Sentinel.

  • Part 1 demonstrates the current limitations?of Azure Kusto and Azure Sentinel time series decomposition for detecting statistical anomalies in SPN Azure Activity logs;
  • Part 2 proposes a new detection model?based on a Hidden-Markov Model (HMM) operating on sequences, ie. transitions between SPN activities.
  • Part 3 introduces CRISPR-Cas9, a threat injector able to replay and/or forge malicious sequences that one may try against the model.

Today's article will provide a case study that sheds lights on how to track anomalies at scale and in an automated way, once your HMM has been trained an is operational.

Farewell, periodicity!

Even though period analysis on raw activity logs do not provide useful information, one might be tempted to wonder if it fares any better on Markov sequences. Kusto only provides series decomposition on timestamps, not arbitrary sequences, so we need to plug-in our own custom Discrete Fourier Transforms (DFTs) to the output of the HMM residual anomalies.

Unfortunately, the few real-life samples I have fed to a DFT do not seem to provide a spectral measurement which is accurate enough to be exploited without human intervention... Maybe it's me, but I couldn't get it right.

I had to give up exploring this lane and had to find another route. Fortunately, my latest foray in Quantum Computing territory (see my recent articles on the matter) prompted me to study random walks and finite fields. Eventually this is what led me to the very promising solution I'm just going to explain :-)

Correlating SPN sequences

Unlike Azure Active Directory users, Application and service principals are very well groomed animals. So there has to be a way to nail down their deterministic behavior, don't you agree?

And you're right. To that end, we are going to get help from an old dependable friend: the "mark 1 eyeball"...

Armed with the precious device, take a look at residual anomalous sequences in this Application Insights workbook (look at the bottom part of the graph):

No alt text provided for this image

Each particular SPN sequence is depicted as a broken line with a distinct color. You see that several sequences behave in a similar fashion: their vertical scaling is different (and, in fact, the Y-axis is a logarithmic scale), but their envelopes are the same. It means that this bunch of SPNs are correlated: the sequences fire in harmony to perform a composite, multi-staged Azure operation.

Envelopes and multi-dimensions

So here we are with this new object: the sequence envelope. How do we compare two envelopes to check if they correlate?

We can imagine plenty of ways, here is my line of thoughts: to me and my Mark 1 eyeball, an envelope looks like a random walk in one-direction: up or down. I can describe an envelope as a list of just three tokens: '1' for upward, '-1' for downward, and '0' for steady.

Under this convention, [-1,1,1,-1,-1,1] is a zigzag, whereas [0,0,1,-1,0,0,0,0,0,0] is a single bump and [0,10,0,0,0,-1,0] is a plateau with ridges at both ends.

Now close your eyeball and remember your math at school, when you learnt about vectors... What is our zigzag, if not a 6-dimension vector? Or our bump, if not a 10-dimension vector?

More specifically, envelopes of length n are vectors over the finite field GF(3)^n.

Clustering and machine learning

The more dimensions, the merrier! Why? Because high-dimensional spaces are extremely empty.

To give you an idea, take the volume of a sphere in dimensions 1 to 25:

No alt text provided for this image

After an initial sharp increase, it quickly vanishes into nothingness... The volume of a unit-sphere of dimension 3 is a little more than 4, but it's about 1 in dimension 13.

I don't know for you, but

I would rather eat an apple in dimension 3 than two in dimension 13!

So if something unusual pops-up in such vast expanses of void, it's very easy to spot. Not easy for us, of course (we struggle to visualize things in a mere 3D space... ), I mean for the machine.

What's more, in high-dimensional spaces, if two or more things pop up close to one another, there's next to no chances it's an accident.

Then all we have to do is pour our n-dimension vectors into a Machine Learning clustering algorithm. The algorithm I'll pick for the demonstration is called Kmeans.

Demonstration time!

Let's start with this anomalous sequences sample:

No alt text provided for this image

We see that the dark blue and red lines roughly follow the same winding pattern.

The time span we consider here is two days, or 48 hours. We have used data grouped by bins of 1 hour to construct the sequences, it means that our envelopes each contains 48 dimensions.

Here is how they get converted into two GF(3)^48 vectors by a simple normalization script:

No alt text provided for this image

Now that the vertical scaling effect has gone, we can easily tell the similarity. Here there is indeed only a couple of differences located in dimensions 19 and 20; the one in dimension 20 is highlighted in green above: it reads '-1' in the top vector, and '0' in the bottom one.

Fine, but that's not us who we want to wake up at night for investigations... That's the machine! And not for only this particular example, but for all anomalies.

So we feed not only those two vectors, but all the concurrent anomalous vectors (there are 23 of them sharing this same time range) into Kmeans, a ML multi-dimension clustering algorithm available in the mighty Python scikit-learn module or as part of Azure Machine Learning.

Kmeans needs to have an idea of the number of clusters it needs to work on. Here we have the choice, for the demo I have chosen 10 clusters labelled C0 to C9.

After only a couple of seconds, here is the result of KMeans(n_clusters=10, init='k-means++', max_iter=300, n_init=10, random_state=0):

No alt text provided for this image

Conclusion

I wonder if you will be as impressed as I am with the outcome of this single, unoptimized run of Kmeans:

  • our two correlated sequences have been put into cluster C1, and this cluster only contains the two of them!
  • many other SPN anomalies are also efficiently correlated (look at clusters C0, C2 or C3)
  • eventually, only 8 out of 10 clusters have been used: C0 to C7.

Maybe there will be no need to wake up at night using soaring Mk1 eyeballs... Hopefully Kmeans could do the job? :)

What remains to be done is trigger a low severity alert into Sentinel for out-of-band investigation. (Had Kmeans found some more hectic patterns, we would have fired a higher severity alert)

Note: the results I'm sharing in this article are in an early phase of exploration. The solution needs much more analysis, refining and real-life testing to be considered a good candidate for production environments.

Jeroen Vandeleur

Cyber Security Expert

3 年

Great blogpost series ! Well doen Christophe Parisel

Christophe Humbert

Wizard in Chief @cloudswizards.com | IT Security, Infrastructure, Architecture

3 年

Mind blowing (I need some refresh in math:))

Christophe Parisel

Senior Cloud security architect at Société Générale

3 年
Akash Kumar 阿卡什·库马尔

Cross Solutions, Multi-Cloud Tech Thought Leader, Advisor to Industry Leaders

3 年

This is awesome write up and approach

要查看或添加评论,请登录

Christophe Parisel的更多文章

  • How will Microsoft Majorana quantum chip ??compute??, exactly?

    How will Microsoft Majorana quantum chip ??compute??, exactly?

    During the 2020 COVID lockdown, I investigated braid theory in the hope it would help me on some research I was…

    14 条评论
  • Zero-shot attack against multimodal AI (Part 2)

    Zero-shot attack against multimodal AI (Part 2)

    In part 1, I showcased how AI applications could be affected by a new kind of AI-driven attack: Mystic Square. In the…

    6 条评论
  • Zero-shot attack against multimodal AI (Part 1)

    Zero-shot attack against multimodal AI (Part 1)

    The arrow is on fire, ready to strike its target from two miles away..

    11 条评论
  • 2015-2025: a decade of preventive Cloud security!

    2015-2025: a decade of preventive Cloud security!

    Since its birth in 2015, preventive Cloud security has proven a formidable achievement. By raising the security bar of…

    11 条评论
  • Exploiting Azure AI DocIntel for ID spoofing

    Exploiting Azure AI DocIntel for ID spoofing

    Sensitive transactions execution often requires to show proofs of ID and proofs of ownership: this requirements is…

    10 条评论
  • How I trained an AI model for nefarious purposes!

    How I trained an AI model for nefarious purposes!

    The previous episode prepared ground for today’s task: we walked through the foundations of AI curiosity. As we've…

    19 条评论
  • AI curiosity

    AI curiosity

    The incuriosity of genAI is an understatement. When chatGPT became popular in early 2023, it was even more striking…

    3 条评论
  • The nested cloud

    The nested cloud

    Now is the perfect time to approach Cloud security through the interplay between data planes and control planes—a…

    8 条评论
  • Overcoming the security challenge of Text-To-Action

    Overcoming the security challenge of Text-To-Action

    LLM's Text-To-Action (T2A) is one of the most anticipated features of 2025: it is expected to unleash a new cycle of…

    19 条评论
  • Cloud drift management for Cyber

    Cloud drift management for Cyber

    Optimize your drift management strategy by tracking the Human-to-Scenario (H/S) ratio: the number of dedicated human…

    12 条评论

社区洞察

其他会员也浏览了