登录查看更多内容

Outlier Detection with Machine Learning

Harikrishna Dev

GenAI at CloudverseAI and Daifuku | UT Dallas | Ex-Flipkart | Ex-Tredence | NITK

发布日期: 2024年4月14日

Outliers are pretty standard in real-time datasets in Data Science, and identification of Outliers and Anomalies is an integral part of Data Preprocessing in a Machine Learning pipeline.

Difference between Outliers and Anomalies:

Anomalies are values that are unexplainable in nature (Devaite is too much from the base distribution or our assumptions). E.g: In a dataset containing the age of students in a university, the value of -20 is an anomaly, as age cannot be a negative number.
Outliers are unlikely events or data points significantly different from her points in your dataset. E.g: In the same college dataset, the student's age could be 87, i.e., an older person.

Univariate Methods

Z Scores

Z Score is the number of standard deviations a point is away from the mean.

where Z is the standard score, x is the observed value, mu is the mean of the training sample, and sigma is the sample's standard sample.?

This method filters out values outside a specified threshold, typically defined as ±3 standard deviations from the mean.

mean = X[col].mean()
std = X[col].std()
X = X[(X[col] >= mean - 3 * std) & (X[col] <= mean + 3 * std)]

Normal distribution with outliers outside of 3 sigma

Pros:

Simple to compute and implement
Easy to understand and interpret

Cons:

As mean and standard deviation are sensitive to outliers, z scores are also skewed and are not robust.
Z scores are calculated with the assumption of normality.

Modified Z Score

Modified Z Score is a more robust solution to the Z score as it uses median and Median Absolute Deviation instead of mean and standard deviation.

where x is the absolute value, and 0.6745 is the upper quartile of a standard normal distribution.

from numpy import median
from statsmodels.robust.scale import mad

med = X[col].median()
mad_val = mad(X[col])

X = X[(0.6745 * (X[col] - med) / mad_val).abs() < 3]

Pros:

Robust to outliers
No assumption of normality

Cons:

Calculation of the median is computationally more intensive than the mean due to sorting.

Inter Quartile Range IQR Rule

Detecting outliers with quartiles by excluding values from the inner and outer bounds using the following formula.

where Q1 and Q3 are the first and third quartiles of the distribution.

q1 = X[col].quantile(0.25)
q3 = X[col].quantile(1 - 0.25)
iqr = q3 - q1
X = X[(X[col] >= q1 - 1.5 * iqr) & (X[col] <= q3 + 1.5 * iqr)]

Pros:

Robust to outliers
Simple to understand

Cons:

Univariate nature might remove essential data points

Multivariate Methods

Mahalanobis Distance

Mahalanobis distance is a multivariate measure of the distance between a point and a distribution.

领英推荐

Boost Your Model's Reliability with Bayesian Methods…

VARAISYS PVT. LTD. 7 个月前

K-nearest neighbor Classification(KNN)

Bluechip Technologies Asia 9 个月前

Understanding Gaussian Mixture Models (GMMs) - The…

Engineer's Planet 1 年前

Formula to calculate Mahalanobis distance

where x is the vector for the datapoint, mu is the mean vector of the distribution, and S^-1 is the inverse of the covariance matrix S.

def mahalanobis_distance(self, x):
        x_minus_mu = x - np.mean(x)
        cov_matrix = np.cov(X[continuous_columns].values.T)
        inv_cov = np.linalg.inv(cov_matrix)
        m_distance = np.sqrt(np.dot(np.dot(x_minus_mu, inv_cov), x_minus_mu.T))
        return m_distance
m_distances = X[continuous_columns].apply(mahalanobis_distance, axis=1)
threshold = chi2.ppf(1 - self.threshold, len(continuous_columns))
X = X[m_distances <= np.sqrt(threshold)]

Mahalanobis Distance - an overview | ScienceDirect Topics

Pros:

Robust to correlation
Sensitive to distribution shape
Worked with any dimensionality

Cons:

Sensitive to outlier due to the use of mean
Computational complexity due to the inverse of the Covariance matrix

Algorithm-Based Methods

One Class SVMs

One Class SVM is an unsupervised outlier detection algorithm that creates a decision boundary to classify data points as similar or different to the training set.

from sklearn.svm import OneClassSVM

def one_class_svm_outlier_detection(data, nu=0.05):
    clf = OneClassSVM(nu=nu)
    clf.fit(data)
    outliers = clf.predict(data) == -1
    return outliers

Pros:

Works well with high-dimensionality problems
Can understand complex relationships

Cons:

Colinearity conditions with some kernels
Sensitive to hyperparameters
Cannot handle categorical variables well

Local Outlier Factor (LOF)

Local Outlier Factor is an unsupervised method which computes the local density deviation of a given data point with its neighbours.

Comparing the local density of a point with the densities of its neighbours. Source: Wikipedia

from sklearn.neighbors import LocalOutlierFactor

def lof_outlier_detection(data, k=20):
    clf = LocalOutlierFactor(n_neighbors=k)
    outliers = clf.fit_predict(data)
    outliers = outliers == -1
    return outliers

Pros:

Works with high-dimensional problem statements
Can change the distance to different dissimilarity functions

Cons:

Choice of K is a crucial hyperparameter
Senstive to the distance metric
Doesn't perform well with varying-density datasets

Isolation Forest

Isolation Forest is an ensemble method for outlier detection that isolates outliers in a dataset using binary decision trees.

Partitioning of anomaly and regular data points. | Image: Satyam Kumar

from sklearn.ensemble import IsolationForest

def isolation_forest_outlier_detection(data, contamination=0.05):
    clf = IsolationForest(contamination=contamination)
    outliers = clf.fit_predict(data)
    outliers = outliers == -1
    return outliers

Pros:

Efficient for high-dimensional datasets.
Can handle large datasets with a low computational cost.
Scaling invariance

Cons:

Sensitivity to the choice of hyperparameters, such as the contamination parameter.
Need to perform better with datasets containing clusters of similar data points.
Interpretability of results can be limited.

Conclusion:

In conclusion, outlier detection plays a crucial role in machine learning, aiding in identifying anomalies and novelties within datasets. The choice of technique is dependent on the use case and problem statement.

References:

#machinelearning #artificialintelligence #ai #datascience #python #technology #programming #deeplearning #coding #bigdata #computerscience #tech #data #software #dataanalytics #pythonprogramming #developer #datascientist #programmer #innovation #ml #analytics

Data & Analytics

10 个月

Wow, sounds like a fascinating read! Any standout methods you recommend? Harikrishna Dev

1 次回应

查看更多评论

要查看或添加评论，请登录

Harikrishna Dev的更多文章

Handling Missing Values in Machine Learning

2024年3月25日

Handling Missing Values in Machine Learning

While working on my internship at Daifuku, I was working on analyzing the efficiency of the Paint Line Assembly and…

3 条评论
Internship at Daifuku

2023年8月2日

Internship at Daifuku

Over the past few months, I've been fortunate enough to be part of the Daifuku team, a global leader in innovative…

2 条评论

Outlier Detection with Machine Learning

Harikrishna Dev

GenAI at CloudverseAI and Daifuku | UT Dallas | Ex-Flipkart | Ex-Tredence | NITK

Difference between Outliers and Anomalies:

Univariate Methods

Z Scores

Modified Z Score

Inter Quartile Range IQR Rule

Multivariate Methods

Mahalanobis Distance

领英推荐

Algorithm-Based Methods

One Class SVMs

Isolation Forest

Conclusion:

References:

Harikrishna Dev的更多文章

社区洞察

其他会员也浏览了

Understanding Graph Structures and the H2G2-Net Model: Advancements, Challenges, and Real-World Applications

Mastering CatBoost: Unlocking Robustness and Performance in Data Science

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!

Understanding the Concept of the Five Numbers in Machine Learning and Statistics

Mastering Statistical Foundations: Central Limit Theorem and Confidence Intervals Explained

Class 18 - EVALUATION METRICS FOR DIFFERENT MODELS Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

k-Nearest Neighbors (k-NN) in a Nutshell

"The A-Z Guide to Essential Data Science Concepts!" ????

Navigating Time Series Data Challenges: Handling Missing Features in Machine Learning Models

Difference between Outliers and Anomalies:

Univariate Methods

Z Scores

Modified Z Score

Inter Quartile Range IQR Rule

Multivariate Methods

Mahalanobis Distance

领英推荐

Algorithm-Based Methods

One Class SVMs

Isolation Forest

Conclusion:

References:

Harikrishna Dev的更多文章

Handling Missing Values in Machine Learning

Internship at Daifuku

社区洞察

其他会员也浏览了

Understanding Graph Structures and the H2G2-Net Model: Advancements, Challenges, and Real-World Applications

Mastering CatBoost: Unlocking Robustness and Performance in Data Science

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!

Understanding the Concept of the Five Numbers in Machine Learning and Statistics

Mastering Statistical Foundations: Central Limit Theorem and Confidence Intervals Explained

Class 18 - EVALUATION METRICS FOR DIFFERENT MODELS Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

k-Nearest Neighbors (k-NN) in a Nutshell

"The A-Z Guide to Essential Data Science Concepts!" ????

Navigating Time Series Data Challenges: Handling Missing Features in Machine Learning Models