Chapter 3: Empirical Cumulative Distribution-based Outlier Detection (ECOD)
The Empirical Cumulative Distribution-based Outlier Detection (ECOD) has a very intuitive approach:?Outliers are the rare events in the tails of a distribution, they can be identified by measuring the location in a distribution.
ECOD first?estimates the distribution of a variable in a non-parametric fashion. It then?multiplies the estimated tail probabilities of all dimensions to get the anomaly score for an observation.?Mathematically, it is hard to estimate the joint distributions of multiple dimensions.?ECOD assumes independence of variables so it can estimate the empirical cumulative distribution of each variable.?Although the assumption of variable independence may be too restrictive, it is not new because HBOS in the previous chapter makes the same assumption and has proven effective.
(A) What Are the Advantages of ECOD?
The authors of [1] demonstrate that ECOD outperforms other popular baseline detection methods.?Because ECOD has no hyper-parameters to tune, it is fast for handling a large amount of data. The authors reported that it only takes about two hours for a large dataset with one million observations and ten thousand features on a standard personal laptop.
Another merit of ECOD is?easy interpretation.?It lets you inspect how each of the multiple tail probabilities contributes to the final outlier score.?I will demonstrate the interpretability in a later section.
(B) How Does ECOD Work?
Many readers are familiar with parametric distributions but not non-parametric distributions. I will describe what parametric and non-parametric distributions are and talk about the formation of the non-parametric distribution. Then I’ll present the ECOD algorithm then compare ECOD and HBOS.
(B.1) Understand Empirical Cumulative Distribution Function
To explain the terms “non-parametric” and “parametric”, it is even helpful to clarify a few related terms “population”, “samples”, and “estimates”. The goal of statistics is to understand a “population” of our interest.?Quantities such as means, standard deviations, and proportions are called “parameters” that describe a population. We usually cannot get all the data of the entire population so we cannot calculate the parameters to describe the population. A practical solution is to collect random “samples” to describe the population.?The distribution of the samples lets us “estimate” the parameters for the distribution of the population.
A “parametric” approach makes assumptions about the shape of the distribution of the underlying population such as a normal distribution. The “nonparametric” approach does not make any assumptions about the shape and parameters of the population distribution.?The distribution will be estimated “empirically” from the samples.
Let me demonstrate the nonparametric approach and estimate a distribution empirically. To generate a distribution that does not follow any particular shape, I aggregate two gamma distributions and a normal distribution arbitrarily as shown in Figure (B.1). Some extreme values can be found in the right tail.
# Create a distribution that is the combination of three other distribution
from matplotlib import pyplot
from numpy.random import normal, gamma
from numpy import hstack
shape, scale = 10, 2.
s1 = gamma(shape, scale, 1000)
s2 = gamma(shape * 2, scale * 2, 1000)
s3 = normal(loc=0, scale=5, size=1000)
sample = hstack((s1, s2, s3))
# plot the histogram
pyplot.hist(sample, bins=50)
pyplot.show()
To estimate the distribution empirically, I use?ECDF()?in the Python?statmodels?module to derive the?cumulative distribution function (CDF)?as shown in Figure (B.2).
# fit a cd
from statsmodels.distributions.empirical_distribution import ECDF
sample_ecdf = ECDF(sample)
# plot the cdf
pyplot.plot(sample_ecdf.x, sample_ecdf.y)
pyplot.show()
I choose some locations in Figure (B.2) to show the?cumulative probability up to those locations.?For example, the?cumulative probability of X<0 is 0.173, and that of X<125 is 0.9967. Or we can say ‘0’ is at the 17.3 percentile, and ‘125’ is at the 99.67 percentile.?Notice that a location with a CDF close to 1.0 implies the point is near the extreme. This property will help us to find extreme values.
print('P(x<-20): %.4f' % sample_ecdf(-20)
print('P(x<-2): %.4f' % sample_ecdf(-2))
print('P(x<0): %.4f' % sample_ecdf(0))
print('P(x<25): %.4f' % sample_ecdf(25))
print('P(x<50): %.4f' % sample_ecdf(50))
print('P(x<75): %.4f' % sample_ecdf(75))
print('P(x<100): %.4f' % sample_ecdf(100))
print('P(x<125): %.4f' % sample_ecdf(125))
print('P(x<140): %.4f' % sample_ecdf(140))
print('P(x<150): %.4f' % sample_ecdf(150))
The above section demonstrates how to derive the distribution of a variable empirically.?Since the?CDF measures the “outlier-ness” in terms of a variable, it can be developed into a univariate outlier score for a variable.
(B.2) ECOD Algorithm
Multi-dimensional data are also called multivariate data. In multi-dimensional data, each observation has multiple values.?An observation can have extreme values in some dimensions and normal values in other dimensions. Thus?an observation may have high outlier scores in some dimensions and low scores in other dimensions.?ECOD aggregates the univariate outlier scores to get the overall outlier score for an observation.
However, there is a small technical challenge when aggregating the univariate outlier scores — The distribution of a dimension can be either left-skewed or right-skewed as shown in Figure (B.2). It does not make sense to assume outliers always fall on the left- or right-hand side of the distribution. We should first do a smart job by determining whether a distribution is left- or right-skewed.?In a left-skewed distribution, the mean is less than its mode, and in a right-skewed distribution, the mean is larger than its mode.?ECOD uses the skewness of distribution to assign the outlier score for a dimension. If a distribution is right-skewed, the outlier score is the CDF, and if left-skewed, one minus CDF or 1-CDF. ECOD then aggregates the univariate outlier scores across all dimensions to get the overall outlier score for an observation.
(B.3) Comparisons of ECOD and HBOS
You may have realized the concepts of HBOS in the previous chapter and ECOD in this chapter are very similar. Both are unsupervised learning methods.?Both have assumed variable independence to obtain the distribution of a variable.?While HBOS derives the histogram for a variable, ECOD derives the cumulative distribution of a variable empirically. There is no hyperparameter to tune in both methods.?Further, Both HBOS and ECOD are distribution-based algorithms. Because distribution-based methods are usually fast, they are recommended as the starting techniques in a modeling project.
(C) Modeling Procedure
This book suggests the Steps 1, 2, 3 modeling procedure for anomaly detection. It involves model development, threshold determination, and feature evaluation.
Once a model is developed and outlier scores are assigned in Step 1, Step 2 suggests you to plot the histogram of the outlier scores to choose a threshold. The histogram usually presents a natural cut as we will see in Figure (C.2).?If you do not see a natural cut in the histogram, it typically means the features are not effective in differentiating outliers and you need to revise the features.
(C.1) Step 1 — Build your Model
I generate a mock dataset of 500 observations and six variables. I set the percentage of outliers to 5% with “contamination=0.05.” I also create a target variable Y as the ground truth. However, the unsupervised models will only use the X variables and the Y variable is simply for validation.
import numpy as n
import pandas as pd
import matplotlib.pyplot as plt
from pyod.utils.data import generate_data
contamination = 0.05 # percentage of outliers
n_train = 500 # number of training points
n_test = 500 # number of testing points
n_features = 6 # number of features
X_train, X_test, y_train, y_test = generate_data(
n_train=n_train,
n_test=n_test,
n_features= n_features,
contamination=contamination,
random_state=123)
X_train_pd = pd.DataFrame(X_train)
X_train_pd.head()
The first few observations look like the following:
Let me plot the first two variables in a scatter plot as shown in Figure (C.1). The yellow points are the outliers and the purple points are the normal data points.
领英推荐
(C.1.1) The Model
Below we declare and fit the model, then use the function?decision_functions()?to generate the outlier scores for the training and test data.
from pyod.models.ecod import ECO
ecod = ECOD(contamination=0.05)
ecod.fit(X_train)
# Training data
y_train_scores = ecod.decision_function(X_train)
y_train_pred = ecod.predict(X_train)
# Test data
y_test_scores = ecod.decision_function(X_test)
y_test_pred = ecod.predict(X_test) # outlier labels (0 or 1)
def count_stat(vector):
# Because it is '0' and '1', we can run a count statistic.
unique, counts = np.unique(vector, return_counts=True)
return dict(zip(unique, counts))
print("The training data:", count_stat(y_train_pred))
print("The training data:", count_stat(y_test_pred))
# Threshold for the defined comtanimation rate
print("The threshold for the defined comtanimation rate:" , ecod.threshold_)
(C.1.2) Explain the Outlier Score of an Observation
Since the ECOD outlier score is the sum of the univariate scores, we can visualize the univariate scores to understand why an outlier has a high score. Such interpretability for individual predictions is important in machine learning, as explained in the book “The eXplainable A.I. with Python Examples”. Let me get the observations with high outlier scores to demonstrate how to visualize the univariate scores. It tells me Observations 475 and 477 and others.
np.where(y_train_scores>22)
ECOD has a special function?explain_outlier()?to explain the univariate outliers.?I plot the univariate outlier scores for the two observations in the left and right graphs of Figure (C.1). The x-axis is the dimension and the y-axis is the univariate outlier score. The blue and orange dashed lines are the 95 and 99 percentiles for outlier scores. The left graph shows the univariate outlier scores are all about the 95% cutoff band except Variable 1, and the right graph is all above the 95% cutoff band.?This explainability for the outlier scores is a plausible property of ECOD.
ecod.explain_outlier(475)
ecod.explain_outlier(477)
(C.2) Step 2 — Determine a reasonable threshold
In most cases, we do not know the percentage of outliers. We can use the histogram of the outlier score to select a reasonable threshold value. The threshold determines the size of the abnormal group. If any prior knowledge suggests the percentage of anomalies should be no more than 1%, you can choose a threshold that results in approximately 1% of anomalies. Figure (C.2) presents the histogram of the ECOD outlier score. It appears we can set the threshold at 16.0 because there is a natural cut in the histogram. If we select a low value for the threshold, the count of outliers will be high, and vice versa.
import matplotlib.pyplot as pl
plt.hist(y_train_scores, bins='auto') # arguments are passed to np.histogram
plt.title("Outlier score")
plt.show()
(C.3) Step 3 — Present the descriptive statistics of the normal and the abnormal groups
As explained in Chapter 1, the descriptive statistics (such as the means and standard deviations) of the features between the two groups are important to demonstrate the soundness of a model. I create a short function?descriptive_stat_threshold()?to show the sizes and descriptive statistics of the features for the normal and the outlier groups based on the threshold. Below I simply use the threshold at 5%. You can test a range of thresholds for a reasonable size for the outlier group.
threshold = ecod.threshold_ # Or other value from the above histogra
def descriptive_stat_threshold(df,pred_score, threshold):
# Let's see how many '0's and '1's.
df = pd.DataFrame(df)
df['Anomaly_Score'] = pred_score
df['Group'] = np.where(df['Anomaly_Score']< threshold, 'Normal', 'Outlier')
# Now let's show the summary statistics:
cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score':'Count'})
cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100 # The count and count %
stat = df.groupby('Group').mean().round(2).reset_index() # The avg.
stat = cnt.merge(stat, left_on='Group',right_on='Group') # Put the count and the avg. together
return (stat)
descriptive_stat_threshold(X_train,y_train_scores, threshold)
The above table presents the characteristics of the normal and abnormal groups. It shows the count and count percentage of the normal and outlier groups. The “Anomalous_Score” is the average anomaly score. You are reminded to label the features with their feature names for an effective presentation. The table tells us several important results:
Because we have the ground truth?y_test?in our data generation, we can produce a confusion matrix to understand the model performance. The model delivers a decent job and identifies all 25 outliers.
def confusion_matrix(actual,pred):
Actual_pred = pd.DataFrame({'Actual': actual, 'Pred': pred})
cm = pd.crosstab(Actual_pred['Actual'],Actual_pred['Pred'])
return (cm)
confusion_matrix(y_train,y_train_pred)
(D) Outliers Identified by Multiple Models
We have learned two models HBOS and ECOD in the previous and this chapters.?If an outlier is identified by multiple models, the chance that it is an outlier is much higher.?In this section, I am going to cross-tabulate the predictions of the two models to identify outliers. I first replicate the HBOS and ECOD models and generate their thresholds.
########
# HBOS #
########
from pyod.models.hbos import HBOS
n_bins = 50
hbos = HBOS(n_bins=n_bins, contamination=0.05)
hbos.fit(X_train)
y_train_hbos_pred = hbos.labels_
y_test_hbos_pred = hbos.predict(X_test)
y_train_hbos_scores = hbos.decision_function(X_train)
y_test_hbos_scores = hbos.decision_function(X_test)
########
# ECOD #
########
from pyod.models.ecod import ECOD
clf_name = 'ECOD'
ecod = ECOD(contamination=0.05)
ecod.fit(X_train)
y_train_ecod_pred = ecod.labels_
y_test_ecod_pred = ecod.predict(X_test)
y_train_ecod_scores = ecod.decision_scores_ # raw outlier scores
y_test_ecod_scores = ecod.decision_function(X_test)
# Thresholds
[ecod.threshold_, hbos.threshold_]
I put the actual Y value, the predicted “1” and “0” values by HBOS and ECOD together in a data frame. When I cross-tabulate the HBOS and the ECOD predictions, 26 observations are identified by both models to be outliers. Both ECOD and HBOS yield consistent results.
# Put the actual, the HBO score and the ECOD score together
Actual_pred = pd.DataFrame({'Actual': y_test, 'HBOS_pred': y_test_hbos_pred, 'ECOD_pred': y_test_ecod_pred})
Actual_pred.head()
pd.crosstab(Actual_pred['HBOS_pred'],Actual_pred['ECOD_pred'])
(E) Summary of the HBOS Algorithm
(F) Python Notebook:?Click?here?for the Notebook
References
Senior Data Scientist @ Eli Lilly and Company | M.Sc Applied Statistics and Informatics IITB
10 个月how are the different scores from univariate variables are combined into a single score?
Director, Advanced Sales Analytics| Stop-Loss & Health at Sun Life
2 年I fully support it!