Chapter 4: Isolation Forest
https://a.co/d/aiKKJQk
If you were asked to separate the above trees one by one, which tree will be the first one to start with? You may pick the one to the left because it stands alone by itself. After removing that tree, what is the next tree that is easy to separate? Probably the one in the bottom left in the big cluster. After removing that tree, which one is the next? Probably the one in the upper left, and so forth. Here I present a very important intuition:?an outlier should be the easiest to be isolated. Just like peeling an onion, an outlier is on the outside layer. That’s the intuition of Isolate Forest to find outliers.
Isolation Forest?is fast and does not consume much memory because it does not use any distance measures to detect anomalies. This advantage makes it suitable for large data sizes and high-dimensional?problems.
(A) What Is Isolate Forest?
Many outlier detection techniques profile the norm data points first, then identify those instances that do not conform to the patterns of the normal data.?The?Isolation Forest?or?IForest, proposed by?Liu, Ting, and Zhou (2008) [1], departs from those methods.?Rather than profiling normal data points to find outliers,?IForest?identifies anomalies directly.?It applies a tree structure to isolate every instance. Anomalies will be the data points first to be singled out, whereas normal points tend to hide deep in the tree.?They call?each tree the?Isolation Tree?or iTree. Their algorithm builds an?ensemble of iTrees.?Anomalies are those instances that have?short average path lengths on the iTrees.
Figure (A) uses the partition map and a tree to explain how iTree?isolates the data points. The red dot is the farthest from the other dots, then the green dot, then the blue dot. In the partition map, it takes only one “cut” to separate the red dot from the others. The second cut is for the green dot, and the third cut is for the blue dot, and so on. The more cuts it takes to separate a dot, the deeper it is in the tree. The inverse of the number of cuts is the anomaly score. The tree structure on the right of Figure (A) tells the same story. It takes one split to single out the red dot, then the second split to reach out to the green dot, then the third split to get to the blue dot, and so on.?The number of depths becomes a good proxy for the anomaly score.?To make it consistent with the convention that an anomaly is associated with a high score, the anomaly score is defined as the inverse of the number of depths.
An iTree?is a binary tree, where each node in the tree has exactly zero or two daughter nodes.?An iTree starts to grow until one of the conditions is met: (i) the end node has only one data point, (ii) all data in a node has the same values, or (iii) the tree reaches the height limit set (by the researcher).?An iTree does not need to be fully developed until all the end node has one data point. Often it stops growing when the depth of height reaches the set limit. This is because?our interest is in the anomalies closer to the root.?Therefore, it is not necessary to construct a large iTree because most of the data in an iTree are normal data points. A small sample size produces better iTrees because the swamping and masking effects are reduced. Note that?this iTree algorithm is different from the decision tree algorithm because iTree does not use a target variable to train the tree. It is an unsupervised learning?method.
(B) Why “Forest”?
You probably hear?Random Forests?more often than?Isolated Forests. The?“forest” refers to the?ensembling learning?that constructs a forest for the trees. Why is this needed? We all know the?drawback of a single decision tree is overfitting, which means a model predicts very well with the training data but poorly with new data.?The ensembling strategy overcomes the issue by building many trees and then averaging the predictions of the trees.
Figure (B) shows a data matrix in that each row is an instance with multi-dimensional?values. The goal of IForest?is to assign an outlier score to each instance. To start with, it randomly selects?any number of rows?and?any number of columns?to create tables such as (1), (2), and (3). An instance will appear in at least one table.?An iTree?is built for each table to render outlier scores.?Table (1) has 6 rows and 3 columns. The first cut for Table (1) probably is the 6th instance because its values are very different from others. After that, the second cut for Table (1) probably is the 4th instance. Similarly, in Table (3) the first cut is probably the 6th instance (which is the third record). The second cut is the 4th instance (which is the first record in the table). In short, if there are?N?tables, there will be?N?iTrees. An instance can have up to?N?scores.?IForest computes the arithmetic mean of the scores to get the final score.
(C) Modeling Procedure
Like other chapters, I use the following modeling procedure for the model development, assessment, and interpretation of the results.
·??????Model?development
·??????Threshold determination
·??????Profiling of the normal and abnormal?groups
The profiling of two groups is important to communicate the soundness of the model. Your business knowledge shall tell you if the mean of a feature should be higher or lower in the abnormal?group. If it is counter-intuitive, you are advised to investigate, modify, or drop the feature. You shall iterate the modeling process until all features make sense.
(C.1) Step 1: Build the Model
I generate six variables and 500 instances for the training and test data. Although the data have the target variable Y, the unsupervised models only use the X variables. The Y variable is simply for validation. The percentage of outliers is set to 5% with “contamination=0.05.” We can plot a scatter plot for the first two variables. The yellow points are the outliers, and the purple points are the normal data points.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyod.utils.data import generate_data
?
X_train, X_test, y_train, y_test?
= generate_data(?
????n_train=500, n_test=500, n_features=?6,??
????contamination=0.5,?random_state=123)
X_train_pd = pd.DataFrame(X_train)
X_train_pd.head()
Plot_data()
Below we declare and fit the model.?The size of a tree “max_samples” is set to 40 instances. In IForest, it is not necessary to assign a large tree size?and a small sample size can produce better iTrees. The percentage of outliers to be 5% by using “contamination = 0.05”. If unspecified, it will use the?default value 10%.
from pyod.models.iforest import IForest
isft = IForest(contamination=0.05,
???????max_samples=40, behaviour='new')?
isft.fit(X_train)
Next, the function?decision_function()?generates the outlier score, and the function?predict()?assigns the labels (1 or 0) based on the 5% contamination rate. The contamination parameter is only used by?predict()?to assign the binary values and does not affect the model.?
# Threshold for the defined contamination rate
print("The threshold for the defined contamination rate:" , isft.threshold_)
print("The training data:",count_stat(y_train_pred))
print("The training data:",count_stat(y_test_pred))# Training data
y_train_scores = isft.decision_function(X_train)
y_train_pred = isft.predict(X_train)
?
# Test data
# outlier labels (0 or 1)
y_test_scores = isft.decision_function(X_test)
y_test_pred = isft.predict(X_test)?
The threshold for the defined contamination rate: -5.007279313407054e-15
The training data: {0: 475, 1: 25}
The training data: {0: 474, 1: 26}
The built-in function?threshold_?calculates the threshold value of the training data.?In this case, the threshold value is -5.0-15 when the contamination rate is 0.05.?
(C.1.1) Model Parameters
I am going to explain a few important parameters.
isft.get_params()
{'behaviour': 'new',
?'bootstrap': False,
?'contamination': 0.05,
?'max_features': 1.0,
?'max_samples': 40,
?'n_estimators': 100,
?'n_jobs': 1,
?'random_state': None,
?'verbose': 0}
First, “max_samples” is the number of samples to draw from X to train each base estimator. This is the size of a tree and is an important parameter.
领英推荐
Second, “n_estimators” is the number of trees in the ensemble. The default is 100 trees.
Third, “max_features” is the number of features to draw from X to train each base estimator. The default value is 1.0.
Fourth, “n_jobs” is the number of jobs to run in parallel for both `fit` and `predict`. The default value is 1.0. If it is set as -1, the number of jobs is set to the number of cores.
(C.1.2) Variable Importance
Because IForest?applies a tree framework, we will be able to understand the?relative importance of features in determining outliers. The feature importance is measured by the Gini impurity index. The values sum to 1.0.
array([0.18348639, 0.14334461, 0.15951035, 0.18151592, 0.1413287, 0.19081403])
We can plot the feature importance just like that of a tree-based model. Figure (C.1.2) shows the relative strength of features in determining outliers.
(C.2) Step 2 — Determine a Reasonable Threshold
The threshold?can be determined by the histogram of the outlier scores. Figure (C.2) suggests a threshold around 0.0. This means the outlier scores of most of the normal data are less than 0.0. The outlier scores of the abnormal?data are in the high range.
import matplotlib.pyplot as plt
plt.hist(y_train_scores, bins='auto')?
plt.title("Outlier score")
plt.show()
(C.3) Step 3 — Profile the Normal and the Abnormal Groups
Profiling the normal and outlier groups is a critical step to demonstrate the soundness of a model. Let’s get the descriptive statistics below. The code of utility function has been explained in Chapter 2.
descriptive_stat_threshold(
????X_train,y_train_by_average, isft.threshold_)
The above table includes the essential elements for model assessment and model outcome. Let me give a few highlights.
First is the size of the outlier group “Count %”. Table (C.3) shows?it is 5% in our case. Remember the size of the outlier group is determined by the threshold. If you choose a higher value for the threshold, the size will decrease.
Second is the variable statistics in each group. The feature means should be consistent with any prior knowledge. If any feature shows a counter-intuitive result, the feature should be re-examined or removed. The model should be re-iterated until all features make sense. You are reminded to label the features with their feature names for an effective presentation.
Third is the average anomaly score. In Table (C.3), the average outlier score of the outlier group is far higher than that of the normal group, which confirms the outlier group should have a higher anomaly score. You do not need to interpret too much on the scores.
Because we have the ground truth?in our data, we can produce a confusion matrix?to understand the model performance. The function?confusion_matrix()?is a user-defined function as explained in Chapter 2. The model delivers a decent job and identifies all 25 outliers.
confusion_matrix_threshold(
????y_train,y_train_scores,threshold)
(D)?Aggregate Model Predictions to Achieve?Stability
Since IForest?is a proximity-based?algorithm, it is sensitive to outliers and can commit overfitting.?To produce a stable prediction outcome, we can aggregate the scores produced by multiple models. The PyOD?module offers four methods to aggregate the outcome: Average, Maximum of Maximum?(MOM), Average of Maximum?(AOM), and Maximum of Average?(MOA). You only need to use one method to produce your aggregate outcome.?Remember to?pip install combo?for the functions.
from pyod.models.combination
?????import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from pyod.models.iforest import IForest?
It is always safe to standardize the input values for modeling.
# Standardize data
X_train_norm, X_test_norm?
????= standardizer(X_train, X_test)
Next is the model specification. The number of trees probably is the most important one for IForest among the hyper-parameters. I am going to produce 5 models for a range of the number of trees. The average prediction of these models will be the final model prediction.?
# Test a range of maximum samples
k_list = [20, 30, 40, 50, 60]
k_list = [100, 200, 300, 400, 500]
n_clf = len(k_list)?
I prepare empty data frames with 5 columns to store the predicted scores.?
train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])
Below let’s iterate through the 5 models and store the outlier scores.
# Modeling
for i in range(n_clf):
????k = k_list[i]
????isft = IForest(contamination=0.05,
???????????n_estimators=k)
????isft.fit(X_train_norm)
????
????# Store the results in each column:
????train_scores[:, i]?
???????= isft.decision_function(X_train_norm)?
????test_scores[:, i]?
???????= isft.decision_function(X_test_norm)?
Finally, we standardize the predicted scores of the five models, then compute the average scores. I create its histogram in Figure (D).
# Decision scores have to be normalized
train_scores_norm, test_scores_norm?
????= standardizer(train_scores,test_scores)
?
# The result "y_by_average" is a single column:?
y_train_by_average = average(train_scores_norm)
y_test_by_average = average(test_scores_norm)
import matplotlib.pyplot as plt
plt.hist(y_train_by_average, bins='auto')?
plt.title("Combination by average")
plt.show()
Figure (D) suggests the threshold?to be 1.0. With that, I profile the characteristics of the normal and the outlier groups in Table (D). It identifies 25 data points to be the outliers. Readers shall apply similar interpretations to Table (C.3).
descriptive_stat_threshold(
????X_train,y_train_by_average, 1.0)
(E) Summary for IForest
·??????Most model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as outliers.?But IForest?directly and explicitly isolates anomalies.
·??????IForest?adopts a tree structure to isolate every single instance. Anomalies are the singular instances first to be singled out, whereas normal instances tend to cluster together amid the tree.
·??????Because Isolation Forest?does not use any distance measures to detect anomalies,?it is fast and suitable for large data sizes and high-dimensional?problems.
References
1.?????Liu, F. T., Ting, K. M. & Zhou, Z.-H. (2008). Isolation forest.?2008 Eighth IEEE International Conference on Data Mining?(p./pp. 413–422).
This is very helpful prof. Thanks for the share.