Outlier Detection with Machine Learning
Image created with DALL-E3

Outlier Detection with Machine Learning

Outliers are pretty standard in real-time datasets in Data Science, and identification of Outliers and Anomalies is an integral part of Data Preprocessing in a Machine Learning pipeline.


Difference between Outliers and Anomalies:

  • Anomalies are values that are unexplainable in nature (Devaite is too much from the base distribution or our assumptions). E.g: In a dataset containing the age of students in a university, the value of -20 is an anomaly, as age cannot be a negative number.
  • Outliers are unlikely events or data points significantly different from her points in your dataset. E.g: In the same college dataset, the student's age could be 87, i.e., an older person.


Univariate Methods

Z Scores

Z Score is the number of standard deviations a point is away from the mean.

Z Score formula

where Z is the standard score, x is the observed value, mu is the mean of the training sample, and sigma is the sample's standard sample.?

This method filters out values outside a specified threshold, typically defined as ±3 standard deviations from the mean.

mean = X[col].mean()
std = X[col].std()
X = X[(X[col] >= mean - 3 * std) & (X[col] <= mean + 3 * std)]        
Normal distribution with outliers outside of 3 sigma

Pros:

  • Simple to compute and implement
  • Easy to understand and interpret

Cons:

  • As mean and standard deviation are sensitive to outliers, z scores are also skewed and are not robust.
  • Z scores are calculated with the assumption of normality.


Modified Z Score

Modified Z Score is a more robust solution to the Z score as it uses median and Median Absolute Deviation instead of mean and standard deviation.

Modified Z Score formula

where x is the absolute value, and 0.6745 is the upper quartile of a standard normal distribution.

from numpy import median
from statsmodels.robust.scale import mad

med = X[col].median()
mad_val = mad(X[col])

X = X[(0.6745 * (X[col] - med) / mad_val).abs() < 3]        

Pros:

  • Robust to outliers
  • No assumption of normality

Cons:

  • Calculation of the median is computationally more intensive than the mean due to sorting.


Inter Quartile Range IQR Rule

Detecting outliers with quartiles by excluding values from the inner and outer bounds using the following formula.

where Q1 and Q3 are the first and third quartiles of the distribution.

q1 = X[col].quantile(0.25)
q3 = X[col].quantile(1 - 0.25)
iqr = q3 - q1
X = X[(X[col] >= q1 - 1.5 * iqr) & (X[col] <= q3 + 1.5 * iqr)]        
Box plot to visually identify outliers

Pros:

  • Robust to outliers
  • Simple to understand

Cons:

  • Univariate nature might remove essential data points


Multivariate Methods

Mahalanobis Distance

Mahalanobis distance is a multivariate measure of the distance between a point and a distribution.

Formula to calculate Mahalanobis distance

where x is the vector for the datapoint, mu is the mean vector of the distribution, and S^-1 is the inverse of the covariance matrix S.

def mahalanobis_distance(self, x):
        x_minus_mu = x - np.mean(x)
        cov_matrix = np.cov(X[continuous_columns].values.T)
        inv_cov = np.linalg.inv(cov_matrix)
        m_distance = np.sqrt(np.dot(np.dot(x_minus_mu, inv_cov), x_minus_mu.T))
        return m_distance
m_distances = X[continuous_columns].apply(mahalanobis_distance, axis=1)
threshold = chi2.ppf(1 - self.threshold, len(continuous_columns))
X = X[m_distances <= np.sqrt(threshold)]        
Mahalanobis Distance - an overview | ScienceDirect Topics

Pros:

  • Robust to correlation
  • Sensitive to distribution shape
  • Worked with any dimensionality

Cons:

  • Sensitive to outlier due to the use of mean
  • Computational complexity due to the inverse of the Covariance matrix


Algorithm-Based Methods

One Class SVMs

One Class SVM is an unsupervised outlier detection algorithm that creates a decision boundary to classify data points as similar or different to the training set.

Source:


from sklearn.svm import OneClassSVM

def one_class_svm_outlier_detection(data, nu=0.05):
    clf = OneClassSVM(nu=nu)
    clf.fit(data)
    outliers = clf.predict(data) == -1
    return outliers        

Pros:

  • Works well with high-dimensionality problems
  • Can understand complex relationships

Cons:

  • Colinearity conditions with some kernels
  • Sensitive to hyperparameters
  • Cannot handle categorical variables well


Local Outlier Factor (LOF)

Local Outlier Factor is an unsupervised method which computes the local density deviation of a given data point with its neighbours.

Comparing the local density of a point with the densities of its neighbours. Source: Wikipedia
from sklearn.neighbors import LocalOutlierFactor

def lof_outlier_detection(data, k=20):
    clf = LocalOutlierFactor(n_neighbors=k)
    outliers = clf.fit_predict(data)
    outliers = outliers == -1
    return outliers        

Pros:

  • Works with high-dimensional problem statements
  • Can change the distance to different dissimilarity functions

Cons:

  • Choice of K is a crucial hyperparameter
  • Senstive to the distance metric
  • Doesn't perform well with varying-density datasets


Isolation Forest

Isolation Forest is an ensemble method for outlier detection that isolates outliers in a dataset using binary decision trees.

Partitioning of anomaly and regular data points. | Image: Satyam Kumar


from sklearn.ensemble import IsolationForest

def isolation_forest_outlier_detection(data, contamination=0.05):
    clf = IsolationForest(contamination=contamination)
    outliers = clf.fit_predict(data)
    outliers = outliers == -1
    return outliers        

Pros:

  • Efficient for high-dimensional datasets.
  • Can handle large datasets with a low computational cost.
  • Scaling invariance

Cons:

  • Sensitivity to the choice of hyperparameters, such as the contamination parameter.
  • Need to perform better with datasets containing clusters of similar data points.
  • Interpretability of results can be limited.


Conclusion:

In conclusion, outlier detection plays a crucial role in machine learning, aiding in identifying anomalies and novelties within datasets. The choice of technique is dependent on the use case and problem statement.

References:

  1. https://en.wikipedia.org/wiki/Standard_score
  2. https://www.statology.org/modified-z-score/
  3. https://online.stat.psu.edu/stat200/lesson/3/3.2
  4. https://www.sciencedirect.com/topics/engineering/mahalanobis-distance
  5. https://scikit-learn.org/stable/modules/outlier_detection.html

#machinelearning #artificialintelligence #ai #datascience #python #technology #programming #deeplearning #coding #bigdata #computerscience #tech #data #software #dataanalytics #pythonprogramming #developer #datascientist #programmer #innovation #ml #analytics

Wow, sounds like a fascinating read! Any standout methods you recommend? Harikrishna Dev

要查看或添加评论,请登录

Harikrishna Dev的更多文章

  • Handling Missing Values in Machine Learning

    Handling Missing Values in Machine Learning

    While working on my internship at Daifuku, I was working on analyzing the efficiency of the Paint Line Assembly and…

    3 条评论
  • Internship at Daifuku

    Internship at Daifuku

    Over the past few months, I've been fortunate enough to be part of the Daifuku team, a global leader in innovative…

    2 条评论

社区洞察

其他会员也浏览了