Outlier Detection with Machine Learning
Harikrishna Dev
GenAI at CloudverseAI and Daifuku | UT Dallas | Ex-Flipkart | Ex-Tredence | NITK
Outliers are pretty standard in real-time datasets in Data Science, and identification of Outliers and Anomalies is an integral part of Data Preprocessing in a Machine Learning pipeline.
Difference between Outliers and Anomalies:
Univariate Methods
Z Scores
Z Score is the number of standard deviations a point is away from the mean.
where Z is the standard score, x is the observed value, mu is the mean of the training sample, and sigma is the sample's standard sample.?
This method filters out values outside a specified threshold, typically defined as ±3 standard deviations from the mean.
mean = X[col].mean()
std = X[col].std()
X = X[(X[col] >= mean - 3 * std) & (X[col] <= mean + 3 * std)]
Pros:
Cons:
Modified Z Score
Modified Z Score is a more robust solution to the Z score as it uses median and Median Absolute Deviation instead of mean and standard deviation.
where x is the absolute value, and 0.6745 is the upper quartile of a standard normal distribution.
from numpy import median
from statsmodels.robust.scale import mad
med = X[col].median()
mad_val = mad(X[col])
X = X[(0.6745 * (X[col] - med) / mad_val).abs() < 3]
Pros:
Cons:
Inter Quartile Range IQR Rule
Detecting outliers with quartiles by excluding values from the inner and outer bounds using the following formula.
where Q1 and Q3 are the first and third quartiles of the distribution.
q1 = X[col].quantile(0.25)
q3 = X[col].quantile(1 - 0.25)
iqr = q3 - q1
X = X[(X[col] >= q1 - 1.5 * iqr) & (X[col] <= q3 + 1.5 * iqr)]
Pros:
Cons:
Multivariate Methods
Mahalanobis Distance
Mahalanobis distance is a multivariate measure of the distance between a point and a distribution.
领英推荐
where x is the vector for the datapoint, mu is the mean vector of the distribution, and S^-1 is the inverse of the covariance matrix S.
def mahalanobis_distance(self, x):
x_minus_mu = x - np.mean(x)
cov_matrix = np.cov(X[continuous_columns].values.T)
inv_cov = np.linalg.inv(cov_matrix)
m_distance = np.sqrt(np.dot(np.dot(x_minus_mu, inv_cov), x_minus_mu.T))
return m_distance
m_distances = X[continuous_columns].apply(mahalanobis_distance, axis=1)
threshold = chi2.ppf(1 - self.threshold, len(continuous_columns))
X = X[m_distances <= np.sqrt(threshold)]
Pros:
Cons:
Algorithm-Based Methods
One Class SVMs
One Class SVM is an unsupervised outlier detection algorithm that creates a decision boundary to classify data points as similar or different to the training set.
from sklearn.svm import OneClassSVM
def one_class_svm_outlier_detection(data, nu=0.05):
clf = OneClassSVM(nu=nu)
clf.fit(data)
outliers = clf.predict(data) == -1
return outliers
Pros:
Cons:
Local Outlier Factor (LOF)
Local Outlier Factor is an unsupervised method which computes the local density deviation of a given data point with its neighbours.
from sklearn.neighbors import LocalOutlierFactor
def lof_outlier_detection(data, k=20):
clf = LocalOutlierFactor(n_neighbors=k)
outliers = clf.fit_predict(data)
outliers = outliers == -1
return outliers
Pros:
Cons:
Isolation Forest
Isolation Forest is an ensemble method for outlier detection that isolates outliers in a dataset using binary decision trees.
from sklearn.ensemble import IsolationForest
def isolation_forest_outlier_detection(data, contamination=0.05):
clf = IsolationForest(contamination=contamination)
outliers = clf.fit_predict(data)
outliers = outliers == -1
return outliers
Pros:
Cons:
Conclusion:
In conclusion, outlier detection plays a crucial role in machine learning, aiding in identifying anomalies and novelties within datasets. The choice of technique is dependent on the use case and problem statement.
References:
#machinelearning #artificialintelligence #ai #datascience #python #technology #programming #deeplearning #coding #bigdata #computerscience #tech #data #software #dataanalytics #pythonprogramming #developer #datascientist #programmer #innovation #ml #analytics
Wow, sounds like a fascinating read! Any standout methods you recommend? Harikrishna Dev