登录查看更多内容

Behaviour Analysis: Outlier Sequence Detection

Marjan Sterjev

IT Engineer | CISSP | CCSP | CEH (Master): research | learn | do | MENTOR

发布日期: 2024年4月22日

Each production system deals with a lot of signals and sequence of events. Some of the sequences are unusual and require further investigation. Manual investigation is impossible. Having automatic analysis in place is of paramount importance. The initial huge collection of event sequences should be narrowed down into a sensible count of alerts that will be further analysed.

So, let's see what we can do in order to detect these suspicious sequences!

Most of the system and user activities can be described as sequences of symbols.

The set of all possible symbols {A, B, C, D...} could be anything:

If we analyse malware program behaviour, each particular System API call is a symbol
If we analyse user interaction with a Web Site, each particular Web Page is a symbol
In the case of DNA sequences, each symbol is particular nucleotide

The sequence of n symbols can be represented as:

S = s_1 s_2 s_3 . . . s_n-1 s_n

where a_i belongs to the set {A, B, C, D....}

At the very beginning when dealing with sequence outlier analysis the system starts with the learning phase . It must learn the baseline that defines what the normal is. In the case of Web Site navigation analysis, the system learns the user's browsing behaviour.

Once we have the baseline, the system switches into the detection phase where it has to decide if some new sequence is normal and fits into the baseline learned or it is an outlier that was probably never seen up to that moment.

The baseline is represented as a probabilistic model. For each sequence the model give us the probability of that sequence appearance. If that probability is sufficiently low, the sequence is an outlier or anomaly and we shall trigger an alert.

The probability of particular n-symbol sequence is:

P(S) = 
P(s_1 s_2 s_3 . . . s_n-1 s_n) =
P(s_1) * 
P(s_2 | s_1) * 
P(s_3 | s_1 s_2) * 
. . . *
P(s_n | s_1 s_2 . . .s_n-1)

The question is how can we build a probabilistic model based on a set of known sequences and calculate the sequence probabilities?

One of the most cited algorithms and papers for this purpose is Mining for Outliers in Sequential Databases. The algorithm relies on the concept of Probabilistic Suffix Tree.

Probabilistic Suffix Tree for 2 symbol alphabet {A, B}

Each node in this tree contains:

Label (Example: BA)
Probabilities for the next symbol conditioned on the current node label (Example: P(A | BA) and P(B | BA))
Child nodes; this node's label (BA) is suffix of the child node labels (Example: ABA and BBA)

Why do we need this suffix based logic? Remember the sequence probability formula?

P(S) = 
P(s_1 s_2 s_3 . . . s_n-1 s_n) =
P(s_1) * 
P(s_2 | s_1) * 
P(s_3 | s_1 s_2) * 
. . . *
P(s_n | s_1 s_2 . . .s_n-1)

The probability is a product of terms having the following conditional probability format:

P(s_i | s_1 s_2 . . .s_i-1)

Let's calculate the following conditional probability:

P(A | BBBA)

We do not have guarantees that the suffix BBBA exists in the tree. However, we can always start from the Root and search for the longest label (suffix) that exists in the tree, including the empty one.

P(A | BBBA) = ?

P(A) # check if A exists in the tree and use it if true

P(A | A) # check if  BA exists in the tree and use it if true

P(A | BA) # check if BBA exists in the tree and use it if true

P(A | BBA) # check if BBBA exists in the tree and use it if true

There is no BBBA in the tree

The longest and closest suffix to BBBA in the tree is BBA. The probability is approximated as:

领英推荐

Monthly Bugs Digest – February Edition

QATestLab 2 年前

Harnessing Data Mining for Business Intelligence

Markovate 8 个月前

Anomaly Detection Part 2: Isolation Forest

Dr. Anish Roychowdhury, Ph.D. 2 个月前

P(A | BBBA) ~ P(A | BBA)

Having this approximation logic in place, we are able to work with sequences of variable length and calculate their approximated probability.

Instead of sequence probability, the paper introduces the so-called normalised similarity defined as log of the probability above and divided with the length of the sequence:

SIM(S) = 1/len(S) * log(P(s))

= 1/len(S) * (log(P(s_1)) + SUM(log(P(s_i | s_1 s_2 ... s_i-1))))

The SUM is over i from 2 to len(S)

The tree generation algorithm is defined in the paper. Let's implement it here in Python. As I've stated many times, most of the AI (like) algorithms are counting algorithms. Yes, we count!

Let's define the normal sequences first acquired during the learning phase.

import math

import statistics

VOCABULARY_SIZE = 10

def get_sequences():

    sequences = [
        [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
        [0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7],
        [8, 9, 5, 6, 7],
        [5, 6, 7, 8, 9, 0, 1, 2, 3, 4],
        [8, 9, 1, 2, 3, 4],
        [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
        [8, 9, 5, 6, 7, 0, 1, 2, 3],
        [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
        [0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7]
    ]

    return sequences

We need a function that will create probability distribution from counts. In order to avoid zeros, we need a logic that will smooth these probabilities if needed:

def get_smooth_probability_from_counts(counts):

    min_val = min(counts)

    max_val = max(counts)

    if min_val == 0:

        if max_val == 0:

            return [0.00001] * len(counts)

        min_val = min(filter(lambda x: x != 0, counts)) / 100.00

        counts = [c if c != 0 else min_val for c in counts]

    sum_val = float(sum(counts))

    return [c / sum_val for c in counts]

The main logic is located in the class PSTNode:

class PSTNode:

    def __init__(self, label, probabilities):

        self.label = label

        self.probabilities = probabilities

        self.children = {}

    def add_child(self, symbol, probabilities):

        t_symbol = tuple((symbol,))

        child_label = t_symbol + self.label

        self.children[t_symbol] = PSTNode(child_label, probabilities)

    def find_node(self, label):

        label = label[::-1]

        current_node = self

        for i in range(0, len(label)):

            child = label[i : i+1]

            if child in current_node.children:

                current_node = current_node.children[child]

            else:

                return None

        return current_node

    def get_conditional_probability(T, symbol, suffix):

        path = suffix[::-1]

        current_node = T

        for i in range(0, len(path)):

            child = path[i : i+1]

            if child in current_node.children:

                current_node = current_node.children[child]

            else:

                break

        return current_node.probabilities[symbol]

    def get_similarity(T, sequence):

        p = 0.0

        for i, symbol in enumerate(sequence):

            p_i = PSTNode.get_conditional_probability(T, symbol, tuple(sequence[0:i]))

            p += math.log(p_i)

        return p / len(sequence)

The actual logic for building the tree, as described in the paper is:

def build_PST(sequences, l=5, min_count=1):

    S = [0] * VOCABULARY_SIZE

    for s in sequences:

        for i in s:

            S[i] += 1

    HM = {
        tuple((i,)): {"Before": [0] * VOCABULARY_SIZE, "After": [0] * VOCABULARY_SIZE}
        for i, v in enumerate(S)
        if v >= min_count
    }

    probabilities = get_smooth_probability_from_counts(S)

    T = PSTNode(tuple(), probabilities)

    k = 1

    while k <= l:

        for s in sequences:

            for i in range(0, len(s) - k):

                key = tuple(s[i : i + k])

                if key in HM:

                    maps = HM[key]

                    if i > 0:

                        before = s[i - 1]

                        maps["Before"][before] += 1

                    if i < len(s) - 1:

                        after = s[i + 1]

                        maps["After"][after] += 1

        HM_next = {}

        for child, maps in HM.items():

            probabilities = get_smooth_probability_from_counts(maps["After"])

            parent = T.find_node(child[1:])

            parent.add_child(child[0], probabilities)

            for idx, before in enumerate(maps["Before"]):

                if before >= min_count:

                    key = tuple((idx,)) + child

                    HM_next[key] = {
                        "Before": [0] * VOCABULARY_SIZE,
                        "After": [0] * VOCABULARY_SIZE,
                    }

        HM = HM_next

        k += 1

    return T

Now, build the tree and enumerate it:

sequences = get_sequences()

T = build_PST(sequences, 10, 1)

def enum_tree(T):

    print(T.label)

    for key, child_node in T.children.items():

        enum_tree(child_node)

enum_tree(T)

The enumerated tree is:

()
(0,)
(3, 0)
(2, 3, 0)
(1, 2, 3, 0)
(0, 1, 2, 3, 0)
(7, 0)
(6, 7, 0)
(5, 6, 7, 0)

. . . . . . . .

Calculate similarity for each normal sequence and after that calculate the mean and standard deviation of these similarities. Let's assume that the outlier threshold is defined as the value of mean - 3 * stdev.

similarities = []

for sequence in sequences:

    similarity = PSTNode.get_similarity(T, sequence)

    similarities.append(similarity)


mean = statistics.mean(similarities)

stdev = statistics.stdev(similarities)

threshold = mean - 3 * stdev

print("Mean:", mean)

print("Stdev:", stdev)

print("Threshold:", threshold)

The output is:

Mean: -3.7365682297537006
Stdev: 0.18905646530756265
Threshold: -4.303737625676389

Finally, let's define some new query sequences, calculate the similarity for each query sequence and decide if it is an outlier or not.

query_sequences = [
    [0, 1, 2, 3],
    [3, 2, 1, 0],
    [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
    [0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7],
    [0, 2, 4, 6],
    [8, 9, 5, 6, 7],
    [2, 2, 2, 9, 8, 7],
]

for sequence in query_sequences:

    similarity = PSTNode.get_similarity(T, sequence)
    print("- - - - - - - -")
    print("Sequence:", sequence)
    print("Similarity:", similarity)
    print("Is it Outlier?", similarity < threshold)

The results of this outlier analysis are:

- - - - - - - -
Sequence: [0, 1, 2, 3]
Similarity: -2.873706579357047
Is it Outlier? False
- - - - - - - -
Sequence: [3, 2, 1, 0]
Similarity: -4.434283676143275
Is it Outlier? True
- - - - - - - -
Sequence: [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7]
Similarity: -3.611735573367646
Is it Outlier? False
- - - - - - - -
Sequence: [0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7]
Similarity: -3.8430810681236816
Is it Outlier? False
- - - - - - - -
Sequence: [0, 2, 4, 6]
Similarity: -5.730393521539341
Is it Outlier? True
- - - - - - - -
Sequence: [8, 9, 5, 6, 7]
Similarity: -3.456957297665089
Is it Outlier? False
- - - - - - - -
Sequence: [2, 2, 2, 9, 8, 7]
Similarity: -4.453352477235748
Is it Outlier? True

The more data we have, the more precise detection system will be. Each production system should be further polished with optimised tree parameters (max tree depth L or the pruning value min_count).

Let's summarise that the sequence outlier analysis is such an important topic and the Probabilistic Suffix Tree based Outlier Detection algorithm, as demonstrated in this article, is a powerful tool created for that purpose.

Marjan

Antonio Gabriele

Lean Thinking | Prince2 | Solution Architect | Cloud Platform, Internet of Things, Industry 4.0 | Microsoft Certified

10 个月

One of the best things that I did in my life, it was to enable notifications on your posts. I don't want to suggest anything, but in a similar context (production system) I perform similar task using variance/median analysis and applying six-sigma concepts. I want to ask if, in your opinion, my approach is correct, and which are the limitations. Thanks.

1 次回应

查看更多评论

要查看或添加评论，请登录

Marjan Sterjev的更多文章

Secure Multi-Party Computation: Idea and Example

2024年11月29日

Secure Multi-Party Computation: Idea and Example

We are seeing massive AI adoption these days. At least everyone is talking about it.

1 条评论
Theseus TryHackMe Room Writeup

2024年9月2日

Theseus TryHackMe Room Writeup

Theseus is an old room, almost 5 years old. Its difficulty has been specified as Insane.
Breaking weak Captcha: Tesseract OCR

2024年4月25日

Breaking weak Captcha: Tesseract OCR

Capture Returns TryHackMe room has been recently added. Sometimes CTF scenarios could be artificial.
RSA Optimizations: Think Twice

2024年4月11日

RSA Optimizations: Think Twice

I will start here straight ahead with the summary: Resist your temptations and do not implement your customised and…

2 条评论
RSA & Chinese Remainder Theorem MEGA Exploits

2024年3月7日

RSA & Chinese Remainder Theorem MEGA Exploits

There is a lot of data we are dealing with today. Most of us have multiple digital devices.

4 条评论
A word or two about the parallelisation

2024年2月6日

A word or two about the parallelisation

This article will be a short one. It talks about parallelisation and efficiency, so it shall waste as less as possible…
Covert Tcp - Scapy Version

2023年6月12日

Covert Tcp - Scapy Version

Covert Tcp is one of the tools for covert communication mentioned in the white hacking courses. Instead of sending…
Network Traffic Anomaly Detection

2023年5月30日

Network Traffic Anomaly Detection

It is well known today that anomaly detection helps us spot interesting data, events, malicious behaviour, potential…
Kyber: NTT and Efficient Polynomial Multiplication

2023年5月5日

Kyber: NTT and Efficient Polynomial Multiplication

In my previous article From Baby to Teenage Kyber I've explained how Kyber encryption/decryption works. As it was…

4 条评论
From Baby to Teenage Kyber

2023年5月1日

From Baby to Teenage Kyber

Quantum super computers will be soon available. These computers will be able to crack password hashes (Grover's…

4 条评论

See all articles

Behaviour Analysis: Outlier Sequence Detection

Marjan Sterjev

IT Engineer | CISSP | CCSP | CEH (Master): research | learn | do | MENTOR

领英推荐

Marjan Sterjev的更多文章

社区洞察

其他会员也浏览了

Hash Table Internals - Part 6 - Double Hashing

Comparative analysis with EDA for an highly imbalanced dataset using MLA with Sampling Technique for CCFD-Yadu

A test of confidence : Layering Baysian statistics onto KT Problem Analysis

Quantile Regression Random Forests

Flight Test: A Case Study in aiR for Review

More about Dynamic and Static Polymorphism

ALGORITHMS BORN OF OUR PREJUDICES

Fuzzy Logic: A Powerful Petrophysical Tool

PDA Series #2 Facies Classification from Well Logs

A Tutorial on Ridge and Lasso Regression

领英推荐

Marjan Sterjev的更多文章

Secure Multi-Party Computation: Idea and Example

Theseus TryHackMe Room Writeup

Breaking weak Captcha: Tesseract OCR

RSA Optimizations: Think Twice

RSA & Chinese Remainder Theorem MEGA Exploits

A word or two about the parallelisation

Covert Tcp - Scapy Version

Network Traffic Anomaly Detection

Kyber: NTT and Efficient Polynomial Multiplication

From Baby to Teenage Kyber

社区洞察

其他会员也浏览了

Hash Table Internals - Part 6 - Double Hashing

Comparative analysis with EDA for an highly imbalanced dataset using MLA with Sampling Technique for CCFD-Yadu

A test of confidence : Layering Baysian statistics onto KT Problem Analysis

Quantile Regression Random Forests

Flight Test: A Case Study in aiR for Review

More about Dynamic and Static Polymorphism

ALGORITHMS BORN OF OUR PREJUDICES

Fuzzy Logic: A Powerful Petrophysical Tool

PDA Series #2 Facies Classification from Well Logs

A Tutorial on Ridge and Lasso Regression