Behaviour Analysis: Outlier Sequence Detection
Marjan Sterjev
IT Engineer | CISSP | CCSP | CEH (Master): research | learn | do | MENTOR
Each production system deals with a lot of signals and sequence of events. Some of the sequences are unusual and require further investigation. Manual investigation is impossible. Having automatic analysis in place is of paramount importance. The initial huge collection of event sequences should be narrowed down into a sensible count of alerts that will be further analysed.
So, let's see what we can do in order to detect these suspicious sequences!
Most of the system and user activities can be described as sequences of symbols.
The set of all possible symbols {A, B, C, D...} could be anything:
The sequence of n symbols can be represented as:
S = s_1 s_2 s_3 . . . s_n-1 s_n
where a_i belongs to the set {A, B, C, D....}
At the very beginning when dealing with sequence outlier analysis the system starts with the learning phase . It must learn the baseline that defines what the normal is. In the case of Web Site navigation analysis, the system learns the user's browsing behaviour.
Once we have the baseline, the system switches into the detection phase where it has to decide if some new sequence is normal and fits into the baseline learned or it is an outlier that was probably never seen up to that moment.
The baseline is represented as a probabilistic model. For each sequence the model give us the probability of that sequence appearance. If that probability is sufficiently low, the sequence is an outlier or anomaly and we shall trigger an alert.
The probability of particular n-symbol sequence is:
P(S) =
P(s_1 s_2 s_3 . . . s_n-1 s_n) =
P(s_1) *
P(s_2 | s_1) *
P(s_3 | s_1 s_2) *
. . . *
P(s_n | s_1 s_2 . . .s_n-1)
The question is how can we build a probabilistic model based on a set of known sequences and calculate the sequence probabilities?
One of the most cited algorithms and papers for this purpose is Mining for Outliers in Sequential Databases. The algorithm relies on the concept of Probabilistic Suffix Tree.
Each node in this tree contains:
Why do we need this suffix based logic? Remember the sequence probability formula?
P(S) =
P(s_1 s_2 s_3 . . . s_n-1 s_n) =
P(s_1) *
P(s_2 | s_1) *
P(s_3 | s_1 s_2) *
. . . *
P(s_n | s_1 s_2 . . .s_n-1)
The probability is a product of terms having the following conditional probability format:
P(s_i | s_1 s_2 . . .s_i-1)
Let's calculate the following conditional probability:
P(A | BBBA)
We do not have guarantees that the suffix BBBA exists in the tree. However, we can always start from the Root and search for the longest label (suffix) that exists in the tree, including the empty one.
P(A | BBBA) = ?
P(A) # check if A exists in the tree and use it if true
P(A | A) # check if BA exists in the tree and use it if true
P(A | BA) # check if BBA exists in the tree and use it if true
P(A | BBA) # check if BBBA exists in the tree and use it if true
There is no BBBA in the tree
The longest and closest suffix to BBBA in the tree is BBA. The probability is approximated as:
领英推荐
P(A | BBBA) ~ P(A | BBA)
Having this approximation logic in place, we are able to work with sequences of variable length and calculate their approximated probability.
Instead of sequence probability, the paper introduces the so-called normalised similarity defined as log of the probability above and divided with the length of the sequence:
SIM(S) = 1/len(S) * log(P(s))
= 1/len(S) * (log(P(s_1)) + SUM(log(P(s_i | s_1 s_2 ... s_i-1))))
The SUM is over i from 2 to len(S)
The tree generation algorithm is defined in the paper. Let's implement it here in Python. As I've stated many times, most of the AI (like) algorithms are counting algorithms. Yes, we count!
Let's define the normal sequences first acquired during the learning phase.
import math
import statistics
VOCABULARY_SIZE = 10
def get_sequences():
sequences = [
[0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
[0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7],
[8, 9, 5, 6, 7],
[5, 6, 7, 8, 9, 0, 1, 2, 3, 4],
[8, 9, 1, 2, 3, 4],
[0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
[8, 9, 5, 6, 7, 0, 1, 2, 3],
[0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
[0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7]
]
return sequences
We need a function that will create probability distribution from counts. In order to avoid zeros, we need a logic that will smooth these probabilities if needed:
def get_smooth_probability_from_counts(counts):
min_val = min(counts)
max_val = max(counts)
if min_val == 0:
if max_val == 0:
return [0.00001] * len(counts)
min_val = min(filter(lambda x: x != 0, counts)) / 100.00
counts = [c if c != 0 else min_val for c in counts]
sum_val = float(sum(counts))
return [c / sum_val for c in counts]
The main logic is located in the class PSTNode:
class PSTNode:
def __init__(self, label, probabilities):
self.label = label
self.probabilities = probabilities
self.children = {}
def add_child(self, symbol, probabilities):
t_symbol = tuple((symbol,))
child_label = t_symbol + self.label
self.children[t_symbol] = PSTNode(child_label, probabilities)
def find_node(self, label):
label = label[::-1]
current_node = self
for i in range(0, len(label)):
child = label[i : i+1]
if child in current_node.children:
current_node = current_node.children[child]
else:
return None
return current_node
def get_conditional_probability(T, symbol, suffix):
path = suffix[::-1]
current_node = T
for i in range(0, len(path)):
child = path[i : i+1]
if child in current_node.children:
current_node = current_node.children[child]
else:
break
return current_node.probabilities[symbol]
def get_similarity(T, sequence):
p = 0.0
for i, symbol in enumerate(sequence):
p_i = PSTNode.get_conditional_probability(T, symbol, tuple(sequence[0:i]))
p += math.log(p_i)
return p / len(sequence)
The actual logic for building the tree, as described in the paper is:
def build_PST(sequences, l=5, min_count=1):
S = [0] * VOCABULARY_SIZE
for s in sequences:
for i in s:
S[i] += 1
HM = {
tuple((i,)): {"Before": [0] * VOCABULARY_SIZE, "After": [0] * VOCABULARY_SIZE}
for i, v in enumerate(S)
if v >= min_count
}
probabilities = get_smooth_probability_from_counts(S)
T = PSTNode(tuple(), probabilities)
k = 1
while k <= l:
for s in sequences:
for i in range(0, len(s) - k):
key = tuple(s[i : i + k])
if key in HM:
maps = HM[key]
if i > 0:
before = s[i - 1]
maps["Before"][before] += 1
if i < len(s) - 1:
after = s[i + 1]
maps["After"][after] += 1
HM_next = {}
for child, maps in HM.items():
probabilities = get_smooth_probability_from_counts(maps["After"])
parent = T.find_node(child[1:])
parent.add_child(child[0], probabilities)
for idx, before in enumerate(maps["Before"]):
if before >= min_count:
key = tuple((idx,)) + child
HM_next[key] = {
"Before": [0] * VOCABULARY_SIZE,
"After": [0] * VOCABULARY_SIZE,
}
HM = HM_next
k += 1
return T
Now, build the tree and enumerate it:
sequences = get_sequences()
T = build_PST(sequences, 10, 1)
def enum_tree(T):
print(T.label)
for key, child_node in T.children.items():
enum_tree(child_node)
enum_tree(T)
The enumerated tree is:
()
(0,)
(3, 0)
(2, 3, 0)
(1, 2, 3, 0)
(0, 1, 2, 3, 0)
(7, 0)
(6, 7, 0)
(5, 6, 7, 0)
. . . . . . . .
Calculate similarity for each normal sequence and after that calculate the mean and standard deviation of these similarities. Let's assume that the outlier threshold is defined as the value of mean - 3 * stdev.
similarities = []
for sequence in sequences:
similarity = PSTNode.get_similarity(T, sequence)
similarities.append(similarity)
mean = statistics.mean(similarities)
stdev = statistics.stdev(similarities)
threshold = mean - 3 * stdev
print("Mean:", mean)
print("Stdev:", stdev)
print("Threshold:", threshold)
The output is:
Mean: -3.7365682297537006
Stdev: 0.18905646530756265
Threshold: -4.303737625676389
Finally, let's define some new query sequences, calculate the similarity for each query sequence and decide if it is an outlier or not.
query_sequences = [
[0, 1, 2, 3],
[3, 2, 1, 0],
[0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7],
[0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7],
[0, 2, 4, 6],
[8, 9, 5, 6, 7],
[2, 2, 2, 9, 8, 7],
]
for sequence in query_sequences:
similarity = PSTNode.get_similarity(T, sequence)
print("- - - - - - - -")
print("Sequence:", sequence)
print("Similarity:", similarity)
print("Is it Outlier?", similarity < threshold)
The results of this outlier analysis are:
- - - - - - - -
Sequence: [0, 1, 2, 3]
Similarity: -2.873706579357047
Is it Outlier? False
- - - - - - - -
Sequence: [3, 2, 1, 0]
Similarity: -4.434283676143275
Is it Outlier? True
- - - - - - - -
Sequence: [0, 1, 2, 3, 0, 1, 2, 3, 5, 6, 7]
Similarity: -3.611735573367646
Is it Outlier? False
- - - - - - - -
Sequence: [0, 1, 2, 3, 5, 6, 7, 0, 1, 2, 3, 5, 6, 7]
Similarity: -3.8430810681236816
Is it Outlier? False
- - - - - - - -
Sequence: [0, 2, 4, 6]
Similarity: -5.730393521539341
Is it Outlier? True
- - - - - - - -
Sequence: [8, 9, 5, 6, 7]
Similarity: -3.456957297665089
Is it Outlier? False
- - - - - - - -
Sequence: [2, 2, 2, 9, 8, 7]
Similarity: -4.453352477235748
Is it Outlier? True
The more data we have, the more precise detection system will be. Each production system should be further polished with optimised tree parameters (max tree depth L or the pruning value min_count).
Let's summarise that the sequence outlier analysis is such an important topic and the Probabilistic Suffix Tree based Outlier Detection algorithm, as demonstrated in this article, is a powerful tool created for that purpose.
Marjan
Lean Thinking | Prince2 | Solution Architect | Cloud Platform, Internet of Things, Industry 4.0 | Microsoft Certified
10 个月One of the best things that I did in my life, it was to enable notifications on your posts. I don't want to suggest anything, but in a similar context (production system) I perform similar task using variance/median analysis and applying six-sigma concepts. I want to ask if, in your opinion, my approach is correct, and which are the limitations. Thanks.