Behaviour Analysis: Outlier Sequence Detection

Behaviour Analysis: Outlier Sequence Detection

Each production system deals with a lot of signals and sequence of events. Some of the sequences are unusual and require further investigation. Manual investigation is impossible. Having automatic analysis in place is of paramount importance. The initial huge collection of event sequences should be narrowed down into a sensible count of alerts that will be further analysed.

So, let's see what we can do in order to detect these suspicious sequences!

Most of the system and user activities can be described as sequences of symbols.

The set of all possible symbols {A, B, C, D...} could be anything:

  • If we analyse malware program behaviour, each particular System API call is a symbol
  • If we analyse user interaction with a Web Site, each particular Web Page is a symbol
  • In the case of DNA sequences, each symbol is particular nucleotide

The sequence of n symbols can be represented as:

S = s_1 s_2 s_3 . . . s_n-1 s_n        

where a_i belongs to the set {A, B, C, D....}

At the very beginning when dealing with sequence outlier analysis the system starts with the learning phase . It must learn the baseline that defines what the normal is. In the case of Web Site navigation analysis, the system learns the user's browsing behaviour.

Once we have the baseline, the system switches into the detection phase where it has to decide if some new sequence is normal and fits into the baseline learned or it is an outlier that was probably never seen up to that moment.

The baseline is represented as a probabilistic model. For each sequence the model give us the probability of that sequence appearance. If that probability is sufficiently low, the sequence is an outlier or anomaly and we shall trigger an alert.

The probability of particular n-symbol sequence is:

P(S) = 
P(s_1 s_2 s_3 . . . s_n-1 s_n) =
P(s_1) * 
P(s_2 | s_1) * 
P(s_3 | s_1 s_2) * 
. . . *
P(s_n | s_1 s_2 . . .s_n-1)        

The question is how can we build a probabilistic model based on a set of known sequences and calculate the sequence probabilities?

One of the most cited algorithms and papers for this purpose is Mining for Outliers in Sequential Databases. The algorithm relies on the concept of Probabilistic Suffix Tree.

Probabilistic Suffix Tree for 2 symbol alphabet {A, B}


Each node in this tree contains:

  • Label (Example: BA)
  • Probabilities for the next symbol conditioned on the current node label (Example: P(A | BA) and P(B | BA))
  • Child nodes; this node's label (BA) is suffix of the child node labels (Example: ABA and BBA)

Why do we need this suffix based logic? Remember the sequence probability formula?

P(S) = 
P(s_1 s_2 s_3 . . . s_n-1 s_n) =
P(s_1) * 
P(s_2 | s_1) * 
P(s_3 | s_1 s_2) * 
. . . *
P(s_n | s_1 s_2 . . .s_n-1)        

The probability is a product of terms having the following conditional probability format:

P(s_i | s_1 s_2 . . .s_i-1)        

Let's calculate the following conditional probability:

P(A | BBBA)        

We do not have guarantees that the suffix BBBA exists in the tree. However, we can always start from the Root and search for the longest label (suffix) that exists in the tree, including the empty one.

P(A | BBBA) = ?

P(A) # check if A exists in the tree and use it if true

P(A | A) # check if  BA exists in the tree and use it if true

P(A | BA) # check if BBA exists in the tree and use it if true

P(A | BBA) # check if BBBA exists in the tree and use it if true

There is no BBBA in the tree        

The longest and closest suffix to BBBA in the tree is BBA. The probability is approximated as:

P(A | BBBA) ~ P(A | BBA)        

Having this approximation logic in place, we are able to work with sequences of variable length and calculate their approximated probability.

Instead of sequence probability, the paper introduces the so-called normalised similarity defined as log of the probability above and divided with the length of the sequence:

SIM(S) = 1/len(S) * log(P(s))

= 1/len(S) * (log(P(s_1)) + SUM(log(P(s_i | s_1 s_2 ... s_i-1))))

The SUM is over i from 2 to len(S)        

The tree generation algorithm is defined in the paper. Let's implement it here in Python. As I've stated many times, most of the AI (like) algorithms are counting algorithms. Yes, we count!

Let's define the normal sequences first acquired during the learning phase.

import math

import statistics

VOCABULARY_SIZE = 10

def get_sequences():

    sequences = [
        [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
        [0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7],
        [8, 9, 5, 6, 7],
        [5, 6, 7, 8, 9, 0, 1, 2, 3, 4],
        [8, 9, 1, 2, 3, 4],
        [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
        [8, 9, 5, 6, 7, 0, 1, 2, 3],
        [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
        [0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7]
    ]

    return sequences        

We need a function that will create probability distribution from counts. In order to avoid zeros, we need a logic that will smooth these probabilities if needed:

def get_smooth_probability_from_counts(counts):

    min_val = min(counts)

    max_val = max(counts)

    if min_val == 0:

        if max_val == 0:

            return [0.00001] * len(counts)

        min_val = min(filter(lambda x: x != 0, counts)) / 100.00

        counts = [c if c != 0 else min_val for c in counts]

    sum_val = float(sum(counts))

    return [c / sum_val for c in counts]        

The main logic is located in the class PSTNode:

class PSTNode:

    def __init__(self, label, probabilities):

        self.label = label

        self.probabilities = probabilities

        self.children = {}

    def add_child(self, symbol, probabilities):

        t_symbol = tuple((symbol,))

        child_label = t_symbol + self.label

        self.children[t_symbol] = PSTNode(child_label, probabilities)

    def find_node(self, label):

        label = label[::-1]

        current_node = self

        for i in range(0, len(label)):

            child = label[i : i+1]

            if child in current_node.children:

                current_node = current_node.children[child]

            else:

                return None

        return current_node

    def get_conditional_probability(T, symbol, suffix):

        path = suffix[::-1]

        current_node = T

        for i in range(0, len(path)):

            child = path[i : i+1]

            if child in current_node.children:

                current_node = current_node.children[child]

            else:

                break

        return current_node.probabilities[symbol]

    def get_similarity(T, sequence):

        p = 0.0

        for i, symbol in enumerate(sequence):

            p_i = PSTNode.get_conditional_probability(T, symbol, tuple(sequence[0:i]))

            p += math.log(p_i)

        return p / len(sequence)        

The actual logic for building the tree, as described in the paper is:

def build_PST(sequences, l=5, min_count=1):

    S = [0] * VOCABULARY_SIZE

    for s in sequences:

        for i in s:

            S[i] += 1

    HM = {
        tuple((i,)): {"Before": [0] * VOCABULARY_SIZE, "After": [0] * VOCABULARY_SIZE}
        for i, v in enumerate(S)
        if v >= min_count
    }

    probabilities = get_smooth_probability_from_counts(S)

    T = PSTNode(tuple(), probabilities)

    k = 1

    while k <= l:

        for s in sequences:

            for i in range(0, len(s) - k):

                key = tuple(s[i : i + k])

                if key in HM:

                    maps = HM[key]

                    if i > 0:

                        before = s[i - 1]

                        maps["Before"][before] += 1

                    if i < len(s) - 1:

                        after = s[i + 1]

                        maps["After"][after] += 1

        HM_next = {}

        for child, maps in HM.items():

            probabilities = get_smooth_probability_from_counts(maps["After"])

            parent = T.find_node(child[1:])

            parent.add_child(child[0], probabilities)

            for idx, before in enumerate(maps["Before"]):

                if before >= min_count:

                    key = tuple((idx,)) + child

                    HM_next[key] = {
                        "Before": [0] * VOCABULARY_SIZE,
                        "After": [0] * VOCABULARY_SIZE,
                    }

        HM = HM_next

        k += 1

    return T        

Now, build the tree and enumerate it:

sequences = get_sequences()

T = build_PST(sequences, 10, 1)

def enum_tree(T):

    print(T.label)

    for key, child_node in T.children.items():

        enum_tree(child_node)

enum_tree(T)        

The enumerated tree is:

()
(0,)
(3, 0)
(2, 3, 0)
(1, 2, 3, 0)
(0, 1, 2, 3, 0)
(7, 0)
(6, 7, 0)
(5, 6, 7, 0)

. . . . . . . .        

Calculate similarity for each normal sequence and after that calculate the mean and standard deviation of these similarities. Let's assume that the outlier threshold is defined as the value of mean - 3 * stdev.

similarities = []

for sequence in sequences:

    similarity = PSTNode.get_similarity(T, sequence)

    similarities.append(similarity)


mean = statistics.mean(similarities)

stdev = statistics.stdev(similarities)

threshold = mean - 3 * stdev

print("Mean:", mean)

print("Stdev:", stdev)

print("Threshold:", threshold)        

The output is:

Mean: -3.7365682297537006
Stdev: 0.18905646530756265
Threshold: -4.303737625676389        

Finally, let's define some new query sequences, calculate the similarity for each query sequence and decide if it is an outlier or not.

query_sequences = [
    [0, 1, 2, 3],
    [3, 2, 1, 0],
    [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
    [0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7],
    [0, 2, 4, 6],
    [8, 9, 5, 6, 7],
    [2, 2, 2, 9, 8, 7],
]

for sequence in query_sequences:

    similarity = PSTNode.get_similarity(T, sequence)
    print("- - - - - - - -")
    print("Sequence:", sequence)
    print("Similarity:", similarity)
    print("Is it Outlier?", similarity < threshold)        

The results of this outlier analysis are:

- - - - - - - -
Sequence: [0, 1, 2, 3]
Similarity: -2.873706579357047
Is it Outlier? False
- - - - - - - -
Sequence: [3, 2, 1, 0]
Similarity: -4.434283676143275
Is it Outlier? True
- - - - - - - -
Sequence: [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7]
Similarity: -3.611735573367646
Is it Outlier? False
- - - - - - - -
Sequence: [0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7]
Similarity: -3.8430810681236816
Is it Outlier? False
- - - - - - - -
Sequence: [0, 2, 4, 6]
Similarity: -5.730393521539341
Is it Outlier? True
- - - - - - - -
Sequence: [8, 9, 5, 6, 7]
Similarity: -3.456957297665089
Is it Outlier? False
- - - - - - - -
Sequence: [2, 2, 2, 9, 8, 7]
Similarity: -4.453352477235748
Is it Outlier? True        

The more data we have, the more precise detection system will be. Each production system should be further polished with optimised tree parameters (max tree depth L or the pruning value min_count).

Let's summarise that the sequence outlier analysis is such an important topic and the Probabilistic Suffix Tree based Outlier Detection algorithm, as demonstrated in this article, is a powerful tool created for that purpose.

Marjan


Antonio Gabriele

Lean Thinking | Prince2 | Solution Architect | Cloud Platform, Internet of Things, Industry 4.0 | Microsoft Certified

10 个月

One of the best things that I did in my life, it was to enable notifications on your posts. I don't want to suggest anything, but in a similar context (production system) I perform similar task using variance/median analysis and applying six-sigma concepts. I want to ask if, in your opinion, my approach is correct, and which are the limitations. Thanks.

要查看或添加评论,请登录

Marjan Sterjev的更多文章

  • Secure Multi-Party Computation: Idea and Example

    Secure Multi-Party Computation: Idea and Example

    We are seeing massive AI adoption these days. At least everyone is talking about it.

    1 条评论
  • Theseus TryHackMe Room Writeup

    Theseus TryHackMe Room Writeup

    Theseus is an old room, almost 5 years old. Its difficulty has been specified as Insane.

  • Breaking weak Captcha: Tesseract OCR

    Breaking weak Captcha: Tesseract OCR

    Capture Returns TryHackMe room has been recently added. Sometimes CTF scenarios could be artificial.

  • RSA Optimizations: Think Twice

    RSA Optimizations: Think Twice

    I will start here straight ahead with the summary: Resist your temptations and do not implement your customised and…

    2 条评论
  • RSA & Chinese Remainder Theorem MEGA Exploits

    RSA & Chinese Remainder Theorem MEGA Exploits

    There is a lot of data we are dealing with today. Most of us have multiple digital devices.

    4 条评论
  • A word or two about the parallelisation

    A word or two about the parallelisation

    This article will be a short one. It talks about parallelisation and efficiency, so it shall waste as less as possible…

  • Covert Tcp - Scapy Version

    Covert Tcp - Scapy Version

    Covert Tcp is one of the tools for covert communication mentioned in the white hacking courses. Instead of sending…

  • Network Traffic Anomaly Detection

    Network Traffic Anomaly Detection

    It is well known today that anomaly detection helps us spot interesting data, events, malicious behaviour, potential…

  • Kyber: NTT and Efficient Polynomial Multiplication

    Kyber: NTT and Efficient Polynomial Multiplication

    In my previous article From Baby to Teenage Kyber I've explained how Kyber encryption/decryption works. As it was…

    4 条评论
  • From Baby to Teenage Kyber

    From Baby to Teenage Kyber

    Quantum super computers will be soon available. These computers will be able to crack password hashes (Grover's…

    4 条评论

社区洞察

其他会员也浏览了