登录查看更多内容

Chapter 3: Empirical Cumulative Distribution-based Outlier Detection (ECOD)

Chris Kuo, Ph.D., CPCU

Data science & insurance professional

发布日期: 2023年1月28日

The Empirical Cumulative Distribution-based Outlier Detection (ECOD) has a very intuitive approach:?Outliers are the rare events in the tails of a distribution, they can be identified by measuring the location in a distribution.

ECOD first?estimates the distribution of a variable in a non-parametric fashion. It then?multiplies the estimated tail probabilities of all dimensions to get the anomaly score for an observation.?Mathematically, it is hard to estimate the joint distributions of multiple dimensions.?ECOD assumes independence of variables so it can estimate the empirical cumulative distribution of each variable.?Although the assumption of variable independence may be too restrictive, it is not new because HBOS in the previous chapter makes the same assumption and has proven effective.

(A) What Are the Advantages of ECOD?

The authors of [1] demonstrate that ECOD outperforms other popular baseline detection methods.?Because ECOD has no hyper-parameters to tune, it is fast for handling a large amount of data. The authors reported that it only takes about two hours for a large dataset with one million observations and ten thousand features on a standard personal laptop.

Another merit of ECOD is?easy interpretation.?It lets you inspect how each of the multiple tail probabilities contributes to the final outlier score.?I will demonstrate the interpretability in a later section.

(B) How Does ECOD Work?

Many readers are familiar with parametric distributions but not non-parametric distributions. I will describe what parametric and non-parametric distributions are and talk about the formation of the non-parametric distribution. Then I’ll present the ECOD algorithm then compare ECOD and HBOS.

(B.1) Understand Empirical Cumulative Distribution Function

To explain the terms “non-parametric” and “parametric”, it is even helpful to clarify a few related terms “population”, “samples”, and “estimates”. The goal of statistics is to understand a “population” of our interest.?Quantities such as means, standard deviations, and proportions are called “parameters” that describe a population. We usually cannot get all the data of the entire population so we cannot calculate the parameters to describe the population. A practical solution is to collect random “samples” to describe the population.?The distribution of the samples lets us “estimate” the parameters for the distribution of the population.

A “parametric” approach makes assumptions about the shape of the distribution of the underlying population such as a normal distribution. The “nonparametric” approach does not make any assumptions about the shape and parameters of the population distribution.?The distribution will be estimated “empirically” from the samples.

Let me demonstrate the nonparametric approach and estimate a distribution empirically. To generate a distribution that does not follow any particular shape, I aggregate two gamma distributions and a normal distribution arbitrarily as shown in Figure (B.1). Some extreme values can be found in the right tail.

# Create a distribution that is the combination of three other distribution
from matplotlib import pyplot
from numpy.random import normal, gamma
from numpy import hstack
shape, scale = 10, 2.  
s1 = gamma(shape, scale, 1000)
s2 = gamma(shape * 2, scale * 2, 1000)
s3 = normal(loc=0, scale=5, size=1000)
sample = hstack((s1, s2, s3))
# plot the histogram
pyplot.hist(sample, bins=50)
pyplot.show()

No alt text provided for this image — Figure (B.1): A Distribution

To estimate the distribution empirically, I use?ECDF()?in the Python?statmodels?module to derive the?cumulative distribution function (CDF)?as shown in Figure (B.2).

# fit a cd
from statsmodels.distributions.empirical_distribution import ECDF
sample_ecdf = ECDF(sample)

# plot the cdf
pyplot.plot(sample_ecdf.x, sample_ecdf.y)
pyplot.show()

I choose some locations in Figure (B.2) to show the?cumulative probability up to those locations.?For example, the?cumulative probability of X<0 is 0.173, and that of X<125 is 0.9967. Or we can say ‘0’ is at the 17.3 percentile, and ‘125’ is at the 99.67 percentile.?Notice that a location with a CDF close to 1.0 implies the point is near the extreme. This property will help us to find extreme values.

print('P(x<-20): %.4f' % sample_ecdf(-20)
print('P(x<-2): %.4f' % sample_ecdf(-2))
print('P(x<0): %.4f' % sample_ecdf(0))
print('P(x<25): %.4f' % sample_ecdf(25))
print('P(x<50): %.4f' % sample_ecdf(50))
print('P(x<75): %.4f' % sample_ecdf(75))
print('P(x<100): %.4f' % sample_ecdf(100))
print('P(x<125): %.4f' % sample_ecdf(125))
print('P(x<140): %.4f' % sample_ecdf(140))
print('P(x<150): %.4f' % sample_ecdf(150))

The above section demonstrates how to derive the distribution of a variable empirically.?Since the?CDF measures the “outlier-ness” in terms of a variable, it can be developed into a univariate outlier score for a variable.

(B.2) ECOD Algorithm

Multi-dimensional data are also called multivariate data. In multi-dimensional data, each observation has multiple values.?An observation can have extreme values in some dimensions and normal values in other dimensions. Thus?an observation may have high outlier scores in some dimensions and low scores in other dimensions.?ECOD aggregates the univariate outlier scores to get the overall outlier score for an observation.

However, there is a small technical challenge when aggregating the univariate outlier scores — The distribution of a dimension can be either left-skewed or right-skewed as shown in Figure (B.2). It does not make sense to assume outliers always fall on the left- or right-hand side of the distribution. We should first do a smart job by determining whether a distribution is left- or right-skewed.?In a left-skewed distribution, the mean is less than its mode, and in a right-skewed distribution, the mean is larger than its mode.?ECOD uses the skewness of distribution to assign the outlier score for a dimension. If a distribution is right-skewed, the outlier score is the CDF, and if left-skewed, one minus CDF or 1-CDF. ECOD then aggregates the univariate outlier scores across all dimensions to get the overall outlier score for an observation.

(B.3) Comparisons of ECOD and HBOS

You may have realized the concepts of HBOS in the previous chapter and ECOD in this chapter are very similar. Both are unsupervised learning methods.?Both have assumed variable independence to obtain the distribution of a variable.?While HBOS derives the histogram for a variable, ECOD derives the cumulative distribution of a variable empirically. There is no hyperparameter to tune in both methods.?Further, Both HBOS and ECOD are distribution-based algorithms. Because distribution-based methods are usually fast, they are recommended as the starting techniques in a modeling project.

This book suggests the Steps 1, 2, 3 modeling procedure for anomaly detection. It involves model development, threshold determination, and feature evaluation.

Once a model is developed and outlier scores are assigned in Step 1, Step 2 suggests you to plot the histogram of the outlier scores to choose a threshold. The histogram usually presents a natural cut as we will see in Figure (C.2).?If you do not see a natural cut in the histogram, it typically means the features are not effective in differentiating outliers and you need to revise the features.

(C.1) Step 1 — Build your Model

I generate a mock dataset of 500 observations and six variables. I set the percentage of outliers to 5% with “contamination=0.05.” I also create a target variable Y as the ground truth. However, the unsupervised models will only use the X variables and the Y variable is simply for validation.

import numpy as n
import pandas as pd
import matplotlib.pyplot as plt
from pyod.utils.data import generate_data
contamination = 0.05 # percentage of outliers
n_train = 500       # number of training points
n_test = 500        # number of testing points
n_features = 6      # number of features
X_train, X_test, y_train, y_test = generate_data(
    n_train=n_train, 
    n_test=n_test, 
    n_features= n_features, 
    contamination=contamination, 
    random_state=123)

X_train_pd = pd.DataFrame(X_train)
X_train_pd.head()

The first few observations look like the following:

Let me plot the first two variables in a scatter plot as shown in Figure (C.1). The yellow points are the outliers and the purple points are the normal data points.

领英推荐

? The Karpenter transformation, make the most of your…

Learnk8s 3 个月前

Power10 is taking AI and Security to a new level

Gerard Suren Saverimuthu 3 年前

Best Laptop For Machine Learning 2022- Top 8 Picks

Tanzeela Akhtar 2 年前

(C.1.1) The Model

Below we declare and fit the model, then use the function?decision_functions()?to generate the outlier scores for the training and test data.

from pyod.models.ecod import ECO
ecod = ECOD(contamination=0.05)
ecod.fit(X_train)

# Training data
y_train_scores = ecod.decision_function(X_train)
y_train_pred = ecod.predict(X_train)

# Test data
y_test_scores = ecod.decision_function(X_test)
y_test_pred = ecod.predict(X_test) # outlier labels (0 or 1)

def count_stat(vector):
    # Because it is '0' and '1', we can run a count statistic. 
    unique, counts = np.unique(vector, return_counts=True)
    return dict(zip(unique, counts))

print("The training data:", count_stat(y_train_pred))
print("The training data:", count_stat(y_test_pred))
# Threshold for the defined comtanimation rate
print("The threshold for the defined comtanimation rate:" , ecod.threshold_)

The parameter?contamination=0.05?declares the percentage of outliers to be 5%.?The contamination parameter does not affect the calculation of the outlier scores.
PyOD?uses the given contamination rate to derive the threshold for the outlier scores, and applies the function?predict()?to assign the labels (1 or 0).
I made a short function?count_stat()?in the code below to show the count of predicted “1” and “0” values.
The syntax?.threshold_?shows the threshold at the assigned contamination rate. Any outlier score higher than the threshold is considered an outlier.

(C.1.2) Explain the Outlier Score of an Observation

Since the ECOD outlier score is the sum of the univariate scores, we can visualize the univariate scores to understand why an outlier has a high score. Such interpretability for individual predictions is important in machine learning, as explained in the book “The eXplainable A.I. with Python Examples”. Let me get the observations with high outlier scores to demonstrate how to visualize the univariate scores. It tells me Observations 475 and 477 and others.

np.where(y_train_scores>22)

ECOD has a special function?explain_outlier()?to explain the univariate outliers.?I plot the univariate outlier scores for the two observations in the left and right graphs of Figure (C.1). The x-axis is the dimension and the y-axis is the univariate outlier score. The blue and orange dashed lines are the 95 and 99 percentiles for outlier scores. The left graph shows the univariate outlier scores are all about the 95% cutoff band except Variable 1, and the right graph is all above the 95% cutoff band.?This explainability for the outlier scores is a plausible property of ECOD.

ecod.explain_outlier(475)
ecod.explain_outlier(477)

(C.2) Step 2 — Determine a reasonable threshold

In most cases, we do not know the percentage of outliers. We can use the histogram of the outlier score to select a reasonable threshold value. The threshold determines the size of the abnormal group. If any prior knowledge suggests the percentage of anomalies should be no more than 1%, you can choose a threshold that results in approximately 1% of anomalies. Figure (C.2) presents the histogram of the ECOD outlier score. It appears we can set the threshold at 16.0 because there is a natural cut in the histogram. If we select a low value for the threshold, the count of outliers will be high, and vice versa.

import matplotlib.pyplot as pl
plt.hist(y_train_scores, bins='auto') # arguments are passed to np.histogram
plt.title("Outlier score")
plt.show()

(C.3) Step 3 — Present the descriptive statistics of the normal and the abnormal groups

As explained in Chapter 1, the descriptive statistics (such as the means and standard deviations) of the features between the two groups are important to demonstrate the soundness of a model. I create a short function?descriptive_stat_threshold()?to show the sizes and descriptive statistics of the features for the normal and the outlier groups based on the threshold. Below I simply use the threshold at 5%. You can test a range of thresholds for a reasonable size for the outlier group.

threshold = ecod.threshold_ # Or other value from the above histogra

def descriptive_stat_threshold(df,pred_score, threshold):
    # Let's see how many '0's and '1's.
    df = pd.DataFrame(df)
    df['Anomaly_Score'] = pred_score
    df['Group'] = np.where(df['Anomaly_Score']< threshold, 'Normal', 'Outlier')

    # Now let's show the summary statistics:
    cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score':'Count'})
    cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100 # The count and count %
    stat = df.groupby('Group').mean().round(2).reset_index() # The avg.
    stat = cnt.merge(stat, left_on='Group',right_on='Group') # Put the count and the avg. together
    return (stat)

descriptive_stat_threshold(X_train,y_train_scores, threshold)

The above table presents the characteristics of the normal and abnormal groups. It shows the count and count percentage of the normal and outlier groups. The “Anomalous_Score” is the average anomaly score. You are reminded to label the features with their feature names for an effective presentation. The table tells us several important results:

The size of the outlier group:?The outlier group is about 5%. Remember the size of the outlier group is determined by the threshold. The size will shrink if you choose a higher value for the threshold.
The average anomaly score:?The average outlier score of the outlier group is far higher than that of the normal group (22.86 > 9.40). You do not need to interpret too much on the HBO scores.
The feature statistics in each group:?The above shows the means of the features in the outlier group are smaller than those of the normal group. Whether the means of the features in the outlier group should be higher or lower depends on business applications. All the means must be consistent with the domain knowledge.

Because we have the ground truth?y_test?in our data generation, we can produce a confusion matrix to understand the model performance. The model delivers a decent job and identifies all 25 outliers.

def confusion_matrix(actual,pred):
    Actual_pred = pd.DataFrame({'Actual': actual, 'Pred': pred})
    cm = pd.crosstab(Actual_pred['Actual'],Actual_pred['Pred'])
    return (cm)
confusion_matrix(y_train,y_train_pred)

(D) Outliers Identified by Multiple Models

We have learned two models HBOS and ECOD in the previous and this chapters.?If an outlier is identified by multiple models, the chance that it is an outlier is much higher.?In this section, I am going to cross-tabulate the predictions of the two models to identify outliers. I first replicate the HBOS and ECOD models and generate their thresholds.

########
# HBOS #
########
from pyod.models.hbos import HBOS
n_bins = 50
hbos = HBOS(n_bins=n_bins, contamination=0.05)
hbos.fit(X_train)
y_train_hbos_pred = hbos.labels_
y_test_hbos_pred = hbos.predict(X_test)
y_train_hbos_scores = hbos.decision_function(X_train)
y_test_hbos_scores = hbos.decision_function(X_test)

########
# ECOD #
########
from pyod.models.ecod import ECOD
clf_name = 'ECOD'
ecod = ECOD(contamination=0.05)
ecod.fit(X_train)
y_train_ecod_pred = ecod.labels_
y_test_ecod_pred = ecod.predict(X_test)
y_train_ecod_scores = ecod.decision_scores_  # raw outlier scores
y_test_ecod_scores = ecod.decision_function(X_test)

# Thresholds
[ecod.threshold_, hbos.threshold_]

I put the actual Y value, the predicted “1” and “0” values by HBOS and ECOD together in a data frame. When I cross-tabulate the HBOS and the ECOD predictions, 26 observations are identified by both models to be outliers. Both ECOD and HBOS yield consistent results.

# Put the actual, the HBO score and the ECOD score together
Actual_pred = pd.DataFrame({'Actual': y_test, 'HBOS_pred': y_test_hbos_pred, 'ECOD_pred': y_test_ecod_pred})
Actual_pred.head()
pd.crosstab(Actual_pred['HBOS_pred'],Actual_pred['ECOD_pred'])

(E) Summary of the HBOS Algorithm

If an observation is an outlier in terms of almost all variables, the observation is likely an outlier.
The HBOS defines the outlier score for each variable based on its histogram.
The outlier scores of all variables can be added up to get the multivariate outlier score for an observation.
Because histograms are easy to construct, the HBOS is an efficient unsupervised method to detect anomalies.

(F) Python Notebook:?Click?here?for the Notebook

References

[1] Z. Li, Y. Zhao, X. Hu, N. Botta, C. Ionescu, and G. Chen, “ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions,” in IEEE Transactions on Knowledge and Data Engineering, DOI: 10.1109/TKDE.2022.3159580.

Sanket Maiti

Senior Data Scientist @ Eli Lilly and Company | M.Sc Applied Statistics and Informatics IITB

10 个月

how are the different scores from univariate variables are combined into a single score?

Jie Gao

Director, Advanced Sales Analytics| Stop-Loss & Health at Sun Life

2 年

I fully support it!

1 次回应

查看更多评论

要查看或添加评论，请登录

Chris Kuo, Ph.D., CPCU的更多文章

Chapter 15 - Real-World Use Cases

2023年11月16日

Chapter 15 - Real-World Use Cases

(Click https://a.co/d/6Zu9MfV to access the book) Innovation comes by referencing best practices and applications.
Preface

2023年11月15日

Preface

With the arrival of ChatGPT in late 2022 and GPT-4 in early 2023, there is an ignited interest in natural language…

6 条评论
Chapter 4: Isolation Forest

2023年2月4日

Chapter 4: Isolation Forest

https://a.co/d/aiKKJQk If you were asked to separate the above trees one by one, which tree will be the first one to…

2 条评论
Chapter 2: Histogram-based Outlier Score (HBOS)

2023年1月21日

Chapter 2: Histogram-based Outlier Score (HBOS)

Consider multi-dimensional data like a data frame in an Excel Spreadsheet. The columns are the dimensions or variables,…

1 条评论
Handbook of Anomaly Detection: Chapter 1 Introduction

2023年1月14日

Handbook of Anomaly Detection: Chapter 1 Introduction

Chapter 1: Introduction Insurance fraud, cyber hacking, malfunctioning equipment, and production failure are examples…
Increase in the business investment

2018年5月17日

Increase in the business investment

U.S.
A Review on The 2017 Tax Reform

2017年12月31日

A Review on The 2017 Tax Reform

The biggest year-end news is that Congress passes tax reform. I found that the tax statistics cited in many news…

1 条评论
Machine learning or econometrics?

2017年2月9日

Machine learning or econometrics?

In recent years we see fruitful developments and undeniable success in machine learning. The development in…

7 条评论

See all articles

Chapter 3: Empirical Cumulative Distribution-based Outlier Detection (ECOD)

Chris Kuo, Ph.D., CPCU

Data science & insurance professional

领英推荐

Chris Kuo, Ph.D., CPCU的更多文章

社区洞察

其他会员也浏览了

How to Install DeepSeek AI: A Quick Guide

August 01, 2022

Homomorphic Encryption Market Shaping from Growth to Value | Microsoft, IBM, Galois

Nine Software Development Trends In 2021 To Watch For Now

Is WebAssembly Poised to Displace Docker?

Top 5 Developer Trends Of 2018

Deploying AI Vision Models to Apple iOS Devices

The 5 most in-demand IT and cyber security roles in the UK

Omnipresent Operating System (OS) : Re-imagining the Next Killer Experience for future of OS.

Strengthening SCADA System Security with Machine Learning

领英推荐

Chris Kuo, Ph.D., CPCU的更多文章

Chapter 15 - Real-World Use Cases

Preface

Chapter 4: Isolation Forest

Chapter 2: Histogram-based Outlier Score (HBOS)

Handbook of Anomaly Detection: Chapter 1 Introduction

Increase in the business investment

A Review on The 2017 Tax Reform

Machine learning or econometrics?

社区洞察

其他会员也浏览了

How to Install DeepSeek AI: A Quick Guide

August 01, 2022

Homomorphic Encryption Market Shaping from Growth to Value | Microsoft, IBM, Galois

Nine Software Development Trends In 2021 To Watch For Now

Is WebAssembly Poised to Displace Docker?

Top 5 Developer Trends Of 2018

Deploying AI Vision Models to Apple iOS Devices

The 5 most in-demand IT and cyber security roles in the UK

Omnipresent Operating System (OS) : Re-imagining the Next Killer Experience for future of OS.

Strengthening SCADA System Security with Machine Learning