Chapter 3: Empirical Cumulative Distribution-based Outlier Detection (ECOD)

Chapter 3: Empirical Cumulative Distribution-based Outlier Detection (ECOD)

The Empirical Cumulative Distribution-based Outlier Detection (ECOD) has a very intuitive approach:?Outliers are the rare events in the tails of a distribution, they can be identified by measuring the location in a distribution.

ECOD first?estimates the distribution of a variable in a non-parametric fashion. It then?multiplies the estimated tail probabilities of all dimensions to get the anomaly score for an observation.?Mathematically, it is hard to estimate the joint distributions of multiple dimensions.?ECOD assumes independence of variables so it can estimate the empirical cumulative distribution of each variable.?Although the assumption of variable independence may be too restrictive, it is not new because HBOS in the previous chapter makes the same assumption and has proven effective.

(A) What Are the Advantages of ECOD?

The authors of [1] demonstrate that ECOD outperforms other popular baseline detection methods.?Because ECOD has no hyper-parameters to tune, it is fast for handling a large amount of data. The authors reported that it only takes about two hours for a large dataset with one million observations and ten thousand features on a standard personal laptop.

Another merit of ECOD is?easy interpretation.?It lets you inspect how each of the multiple tail probabilities contributes to the final outlier score.?I will demonstrate the interpretability in a later section.

(B) How Does ECOD Work?

Many readers are familiar with parametric distributions but not non-parametric distributions. I will describe what parametric and non-parametric distributions are and talk about the formation of the non-parametric distribution. Then I’ll present the ECOD algorithm then compare ECOD and HBOS.

(B.1) Understand Empirical Cumulative Distribution Function

To explain the terms “non-parametric” and “parametric”, it is even helpful to clarify a few related terms “population”, “samples”, and “estimates”. The goal of statistics is to understand a “population” of our interest.?Quantities such as means, standard deviations, and proportions are called “parameters” that describe a population. We usually cannot get all the data of the entire population so we cannot calculate the parameters to describe the population. A practical solution is to collect random “samples” to describe the population.?The distribution of the samples lets us “estimate” the parameters for the distribution of the population.

A “parametric” approach makes assumptions about the shape of the distribution of the underlying population such as a normal distribution. The “nonparametric” approach does not make any assumptions about the shape and parameters of the population distribution.?The distribution will be estimated “empirically” from the samples.

Let me demonstrate the nonparametric approach and estimate a distribution empirically. To generate a distribution that does not follow any particular shape, I aggregate two gamma distributions and a normal distribution arbitrarily as shown in Figure (B.1). Some extreme values can be found in the right tail.

# Create a distribution that is the combination of three other distribution
from matplotlib import pyplot
from numpy.random import normal, gamma
from numpy import hstack
shape, scale = 10, 2.  
s1 = gamma(shape, scale, 1000)
s2 = gamma(shape * 2, scale * 2, 1000)
s3 = normal(loc=0, scale=5, size=1000)
sample = hstack((s1, s2, s3))
# plot the histogram
pyplot.hist(sample, bins=50)
pyplot.show()        
No alt text provided for this image
Figure (B.1): A Distribution

To estimate the distribution empirically, I use?ECDF()?in the Python?statmodels?module to derive the?cumulative distribution function (CDF)?as shown in Figure (B.2).

# fit a cd
from statsmodels.distributions.empirical_distribution import ECDF
sample_ecdf = ECDF(sample)

# plot the cdf
pyplot.plot(sample_ecdf.x, sample_ecdf.y)
pyplot.show()        
No alt text provided for this image
Figure (B.2): The Empirical Cumulative Distribution Function (ECDF)

I choose some locations in Figure (B.2) to show the?cumulative probability up to those locations.?For example, the?cumulative probability of X<0 is 0.173, and that of X<125 is 0.9967. Or we can say ‘0’ is at the 17.3 percentile, and ‘125’ is at the 99.67 percentile.?Notice that a location with a CDF close to 1.0 implies the point is near the extreme. This property will help us to find extreme values.

print('P(x<-20): %.4f' % sample_ecdf(-20)
print('P(x<-2): %.4f' % sample_ecdf(-2))
print('P(x<0): %.4f' % sample_ecdf(0))
print('P(x<25): %.4f' % sample_ecdf(25))
print('P(x<50): %.4f' % sample_ecdf(50))
print('P(x<75): %.4f' % sample_ecdf(75))
print('P(x<100): %.4f' % sample_ecdf(100))
print('P(x<125): %.4f' % sample_ecdf(125))
print('P(x<140): %.4f' % sample_ecdf(140))
print('P(x<150): %.4f' % sample_ecdf(150))        
No alt text provided for this image

The above section demonstrates how to derive the distribution of a variable empirically.?Since the?CDF measures the “outlier-ness” in terms of a variable, it can be developed into a univariate outlier score for a variable.

(B.2) ECOD Algorithm

Multi-dimensional data are also called multivariate data. In multi-dimensional data, each observation has multiple values.?An observation can have extreme values in some dimensions and normal values in other dimensions. Thus?an observation may have high outlier scores in some dimensions and low scores in other dimensions.?ECOD aggregates the univariate outlier scores to get the overall outlier score for an observation.

No alt text provided for this image
Figure (B.2): Left- and Right-Skewed Distributions (Image by author)


However, there is a small technical challenge when aggregating the univariate outlier scores — The distribution of a dimension can be either left-skewed or right-skewed as shown in Figure (B.2). It does not make sense to assume outliers always fall on the left- or right-hand side of the distribution. We should first do a smart job by determining whether a distribution is left- or right-skewed.?In a left-skewed distribution, the mean is less than its mode, and in a right-skewed distribution, the mean is larger than its mode.?ECOD uses the skewness of distribution to assign the outlier score for a dimension. If a distribution is right-skewed, the outlier score is the CDF, and if left-skewed, one minus CDF or 1-CDF. ECOD then aggregates the univariate outlier scores across all dimensions to get the overall outlier score for an observation.

(B.3) Comparisons of ECOD and HBOS

You may have realized the concepts of HBOS in the previous chapter and ECOD in this chapter are very similar. Both are unsupervised learning methods.?Both have assumed variable independence to obtain the distribution of a variable.?While HBOS derives the histogram for a variable, ECOD derives the cumulative distribution of a variable empirically. There is no hyperparameter to tune in both methods.?Further, Both HBOS and ECOD are distribution-based algorithms. Because distribution-based methods are usually fast, they are recommended as the starting techniques in a modeling project.

(C) Modeling Procedure

This book suggests the Steps 1, 2, 3 modeling procedure for anomaly detection. It involves model development, threshold determination, and feature evaluation.

Once a model is developed and outlier scores are assigned in Step 1, Step 2 suggests you to plot the histogram of the outlier scores to choose a threshold. The histogram usually presents a natural cut as we will see in Figure (C.2).?If you do not see a natural cut in the histogram, it typically means the features are not effective in differentiating outliers and you need to revise the features.

No alt text provided for this image

(C.1) Step 1 — Build your Model

I generate a mock dataset of 500 observations and six variables. I set the percentage of outliers to 5% with “contamination=0.05.” I also create a target variable Y as the ground truth. However, the unsupervised models will only use the X variables and the Y variable is simply for validation.

import numpy as n
import pandas as pd
import matplotlib.pyplot as plt
from pyod.utils.data import generate_data
contamination = 0.05 # percentage of outliers
n_train = 500       # number of training points
n_test = 500        # number of testing points
n_features = 6      # number of features
X_train, X_test, y_train, y_test = generate_data(
    n_train=n_train, 
    n_test=n_test, 
    n_features= n_features, 
    contamination=contamination, 
    random_state=123)

X_train_pd = pd.DataFrame(X_train)
X_train_pd.head()        

The first few observations look like the following:

No alt text provided for this image

Let me plot the first two variables in a scatter plot as shown in Figure (C.1). The yellow points are the outliers and the purple points are the normal data points.

No alt text provided for this image
Figure (C.1)

(C.1.1) The Model

Below we declare and fit the model, then use the function?decision_functions()?to generate the outlier scores for the training and test data.

from pyod.models.ecod import ECO
ecod = ECOD(contamination=0.05)
ecod.fit(X_train)

# Training data
y_train_scores = ecod.decision_function(X_train)
y_train_pred = ecod.predict(X_train)

# Test data
y_test_scores = ecod.decision_function(X_test)
y_test_pred = ecod.predict(X_test) # outlier labels (0 or 1)

def count_stat(vector):
    # Because it is '0' and '1', we can run a count statistic. 
    unique, counts = np.unique(vector, return_counts=True)
    return dict(zip(unique, counts))

print("The training data:", count_stat(y_train_pred))
print("The training data:", count_stat(y_test_pred))
# Threshold for the defined comtanimation rate
print("The threshold for the defined comtanimation rate:" , ecod.threshold_)        

  • The parameter?contamination=0.05?declares the percentage of outliers to be 5%.?The contamination parameter does not affect the calculation of the outlier scores.
  • PyOD?uses the given contamination rate to derive the threshold for the outlier scores, and applies the function?predict()?to assign the labels (1 or 0).
  • I made a short function?count_stat()?in the code below to show the count of predicted “1” and “0” values.
  • The syntax?.threshold_?shows the threshold at the assigned contamination rate. Any outlier score higher than the threshold is considered an outlier.

No alt text provided for this image

(C.1.2) Explain the Outlier Score of an Observation

Since the ECOD outlier score is the sum of the univariate scores, we can visualize the univariate scores to understand why an outlier has a high score. Such interpretability for individual predictions is important in machine learning, as explained in the book “The eXplainable A.I. with Python Examples”. Let me get the observations with high outlier scores to demonstrate how to visualize the univariate scores. It tells me Observations 475 and 477 and others.

np.where(y_train_scores>22)        
No alt text provided for this image

ECOD has a special function?explain_outlier()?to explain the univariate outliers.?I plot the univariate outlier scores for the two observations in the left and right graphs of Figure (C.1). The x-axis is the dimension and the y-axis is the univariate outlier score. The blue and orange dashed lines are the 95 and 99 percentiles for outlier scores. The left graph shows the univariate outlier scores are all about the 95% cutoff band except Variable 1, and the right graph is all above the 95% cutoff band.?This explainability for the outlier scores is a plausible property of ECOD.

ecod.explain_outlier(475)
ecod.explain_outlier(477)        
No alt text provided for this image
Figure (C.1): ECOD Explains Outliers

(C.2) Step 2 — Determine a reasonable threshold

In most cases, we do not know the percentage of outliers. We can use the histogram of the outlier score to select a reasonable threshold value. The threshold determines the size of the abnormal group. If any prior knowledge suggests the percentage of anomalies should be no more than 1%, you can choose a threshold that results in approximately 1% of anomalies. Figure (C.2) presents the histogram of the ECOD outlier score. It appears we can set the threshold at 16.0 because there is a natural cut in the histogram. If we select a low value for the threshold, the count of outliers will be high, and vice versa.

import matplotlib.pyplot as pl
plt.hist(y_train_scores, bins='auto') # arguments are passed to np.histogram
plt.title("Outlier score")
plt.show()        
No alt text provided for this image
Figure (C.2): The histogram of ECOD outlier score

(C.3) Step 3 — Present the descriptive statistics of the normal and the abnormal groups

As explained in Chapter 1, the descriptive statistics (such as the means and standard deviations) of the features between the two groups are important to demonstrate the soundness of a model. I create a short function?descriptive_stat_threshold()?to show the sizes and descriptive statistics of the features for the normal and the outlier groups based on the threshold. Below I simply use the threshold at 5%. You can test a range of thresholds for a reasonable size for the outlier group.

threshold = ecod.threshold_ # Or other value from the above histogra

def descriptive_stat_threshold(df,pred_score, threshold):
    # Let's see how many '0's and '1's.
    df = pd.DataFrame(df)
    df['Anomaly_Score'] = pred_score
    df['Group'] = np.where(df['Anomaly_Score']< threshold, 'Normal', 'Outlier')

    # Now let's show the summary statistics:
    cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score':'Count'})
    cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100 # The count and count %
    stat = df.groupby('Group').mean().round(2).reset_index() # The avg.
    stat = cnt.merge(stat, left_on='Group',right_on='Group') # Put the count and the avg. together
    return (stat)

descriptive_stat_threshold(X_train,y_train_scores, threshold)        
No alt text provided for this image

The above table presents the characteristics of the normal and abnormal groups. It shows the count and count percentage of the normal and outlier groups. The “Anomalous_Score” is the average anomaly score. You are reminded to label the features with their feature names for an effective presentation. The table tells us several important results:

  • The size of the outlier group:?The outlier group is about 5%. Remember the size of the outlier group is determined by the threshold. The size will shrink if you choose a higher value for the threshold.
  • The average anomaly score:?The average outlier score of the outlier group is far higher than that of the normal group (22.86 > 9.40). You do not need to interpret too much on the HBO scores.
  • The feature statistics in each group:?The above shows the means of the features in the outlier group are smaller than those of the normal group. Whether the means of the features in the outlier group should be higher or lower depends on business applications. All the means must be consistent with the domain knowledge.

Because we have the ground truth?y_test?in our data generation, we can produce a confusion matrix to understand the model performance. The model delivers a decent job and identifies all 25 outliers.

def confusion_matrix(actual,pred):
    Actual_pred = pd.DataFrame({'Actual': actual, 'Pred': pred})
    cm = pd.crosstab(Actual_pred['Actual'],Actual_pred['Pred'])
    return (cm)
confusion_matrix(y_train,y_train_pred)        
No alt text provided for this image

(D) Outliers Identified by Multiple Models

We have learned two models HBOS and ECOD in the previous and this chapters.?If an outlier is identified by multiple models, the chance that it is an outlier is much higher.?In this section, I am going to cross-tabulate the predictions of the two models to identify outliers. I first replicate the HBOS and ECOD models and generate their thresholds.

########
# HBOS #
########
from pyod.models.hbos import HBOS
n_bins = 50
hbos = HBOS(n_bins=n_bins, contamination=0.05)
hbos.fit(X_train)
y_train_hbos_pred = hbos.labels_
y_test_hbos_pred = hbos.predict(X_test)
y_train_hbos_scores = hbos.decision_function(X_train)
y_test_hbos_scores = hbos.decision_function(X_test)

########
# ECOD #
########
from pyod.models.ecod import ECOD
clf_name = 'ECOD'
ecod = ECOD(contamination=0.05)
ecod.fit(X_train)
y_train_ecod_pred = ecod.labels_
y_test_ecod_pred = ecod.predict(X_test)
y_train_ecod_scores = ecod.decision_scores_  # raw outlier scores
y_test_ecod_scores = ecod.decision_function(X_test)

# Thresholds
[ecod.threshold_, hbos.threshold_]        
No alt text provided for this image

I put the actual Y value, the predicted “1” and “0” values by HBOS and ECOD together in a data frame. When I cross-tabulate the HBOS and the ECOD predictions, 26 observations are identified by both models to be outliers. Both ECOD and HBOS yield consistent results.

# Put the actual, the HBO score and the ECOD score together
Actual_pred = pd.DataFrame({'Actual': y_test, 'HBOS_pred': y_test_hbos_pred, 'ECOD_pred': y_test_ecod_pred})
Actual_pred.head()
pd.crosstab(Actual_pred['HBOS_pred'],Actual_pred['ECOD_pred'])
        
No alt text provided for this image

(E) Summary of the HBOS Algorithm

  • If an observation is an outlier in terms of almost all variables, the observation is likely an outlier.
  • The HBOS defines the outlier score for each variable based on its histogram.
  • The outlier scores of all variables can be added up to get the multivariate outlier score for an observation.
  • Because histograms are easy to construct, the HBOS is an efficient unsupervised method to detect anomalies.

(F) Python Notebook:?Click?here?for the Notebook

References

  • [1] Z. Li, Y. Zhao, X. Hu, N. Botta, C. Ionescu, and G. Chen, “ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions,” in IEEE Transactions on Knowledge and Data Engineering, DOI: 10.1109/TKDE.2022.3159580.

Sanket Maiti

Senior Data Scientist @ Eli Lilly and Company | M.Sc Applied Statistics and Informatics IITB

10 个月

how are the different scores from univariate variables are combined into a single score?

回复
Jie Gao

Director, Advanced Sales Analytics| Stop-Loss & Health at Sun Life

2 年

I fully support it!

要查看或添加评论,请登录

Chris Kuo, Ph.D., CPCU的更多文章

  • Chapter 15 - Real-World Use Cases

    Chapter 15 - Real-World Use Cases

    (Click https://a.co/d/6Zu9MfV to access the book) Innovation comes by referencing best practices and applications.

  • Preface

    Preface

    With the arrival of ChatGPT in late 2022 and GPT-4 in early 2023, there is an ignited interest in natural language…

    6 条评论
  • Chapter 4: Isolation Forest

    Chapter 4: Isolation Forest

    https://a.co/d/aiKKJQk If you were asked to separate the above trees one by one, which tree will be the first one to…

    2 条评论
  • Chapter 2: Histogram-based Outlier Score (HBOS)

    Chapter 2: Histogram-based Outlier Score (HBOS)

    Consider multi-dimensional data like a data frame in an Excel Spreadsheet. The columns are the dimensions or variables,…

    1 条评论
  • Handbook of Anomaly Detection: Chapter 1 Introduction

    Handbook of Anomaly Detection: Chapter 1 Introduction

    Chapter 1: Introduction Insurance fraud, cyber hacking, malfunctioning equipment, and production failure are examples…

  • Increase in the business investment

    Increase in the business investment

    U.S.

  • A Review on The 2017 Tax Reform

    A Review on The 2017 Tax Reform

    The biggest year-end news is that Congress passes tax reform. I found that the tax statistics cited in many news…

    1 条评论
  • Machine learning or econometrics?

    Machine learning or econometrics?

    In recent years we see fruitful developments and undeniable success in machine learning. The development in…

    7 条评论

社区洞察

其他会员也浏览了