Why should treat outliers with Nearest Neighbor and Local Outlier Factor?
Neighbors

Why should treat outliers with Nearest Neighbor and Local Outlier Factor?

Local Outlier Factor is an unsupervised outlier determination algorithm. Computes the local density deviation of a given data point with respect to its neighbors. Proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and J?rg Sander in 2000.

No alt text provided for this image

As can be seen from the diagram above, the points that are far from the clusters are outliers. To determine these, Local Outlier Factor uses k-distance.

No alt text provided for this image

In this article, I will skip to coding Local Outlier Factor rather that theory and formulas .

For more detailed information, see the resources links.

Here’s how Local Outlier Factor is:

First, a dataset is selected. Data is fitting to Local Outlier Factor from within Scikit Learn.

Then the scores are determined with negative_outlier_factor_?attribution. Scores are sorting and observed.

Then the hyperparameter is selected by an outside intervention. As a threshold value. Then this threshold value is assigned to all outliers.

That’s all.

It’s actually a simple process, but we can ask the following question when we look more closely: Why do we assign same value (threshold value) to all outliers.

Looking at the image below, does it make sense to assign far neighbor’s values to the outlier A?

Wouldn’t it be better to find the nearest neighbor of an outlier and assign it to it?

That’s what we’re going to do today.

You can assign the outliers values to the nearest neighbor in the Local Outlier Factor process by following the codes below step by step.

Enjoy!

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Now we come to the point. I need to inform a little bit about the Nearest Neighbors function before going into the codes.

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data (possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree).

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

[ Unfortunately, since Linkedin does not offer a proper code sharing, I leave the gist link below, and also add the codes as a snippet. You can use whatever you want. ]

Bunyamin Ergen

# Resources

https://scikit-learn.org/stable/modules/neighbors.html

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html

https://en.wikipedia.org/wiki/Local_outlier_factor

https://www.veribilimiokulu.com/local-outlier-factor-ile-anormallik-tespiti/

https://towardsdatascience.com/local-outlier-factor-lof-algorithm-for-outlier-identification-8efb887d9843

要查看或添加评论,请登录

社区洞察

其他会员也浏览了