登录查看更多内容

Why should treat outliers with Nearest Neighbor and Local Outlier Factor?

Bunyamin Ergen

Artificial Intelligence Engineer @eTa??n | Multi-Agent AI Systems / Agentic AI, LLM, State-of-the-Art Technologies, Multi-Modal Learning, Speech-to-Text, Computer Vision and Adversarial Machine Learning

发布日期: 2022年11月26日

Local Outlier Factor is an unsupervised outlier determination algorithm. Computes the local density deviation of a given data point with respect to its neighbors. Proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and J?rg Sander in 2000.

As can be seen from the diagram above, the points that are far from the clusters are outliers. To determine these, Local Outlier Factor uses k-distance.

In this article, I will skip to coding Local Outlier Factor rather that theory and formulas .

For more detailed information, see the resources links.

Here’s how Local Outlier Factor is:

First, a dataset is selected. Data is fitting to Local Outlier Factor from within Scikit Learn.

Then the scores are determined with negative_outlier_factor_?attribution. Scores are sorting and observed.

Then the hyperparameter is selected by an outside intervention. As a threshold value. Then this threshold value is assigned to all outliers.

That’s all.

It’s actually a simple process, but we can ask the following question when we look more closely: Why do we assign same value (threshold value) to all outliers.

Looking at the image below, does it make sense to assign far neighbor’s values to the outlier A?

Wouldn’t it be better to find the nearest neighbor of an outlier and assign it to it?

That’s what we’re going to do today.

You can assign the outliers values to the nearest neighbor in the Local Outlier Factor process by following the codes below step by step.

Enjoy!

Konstantin Pupkov 2 个月前

Bagging , Random Forest and Adaboost

Jobanpreet Singh 1 年前

How to fine-tuning a LLaMa-2 overnight?

Nyoka 1 年前

Now we come to the point. I need to inform a little bit about the Nearest Neighbors function before going into the codes.

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data (possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree).

[ Unfortunately, since Linkedin does not offer a proper code sharing, I leave the gist link below, and also add the codes as a snippet. You can use whatever you want. ]

Bunyamin Ergen

# Resources

https://scikit-learn.org/stable/modules/neighbors.html

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html

https://en.wikipedia.org/wiki/Local_outlier_factor

https://www.veribilimiokulu.com/local-outlier-factor-ile-anormallik-tespiti/

https://towardsdatascience.com/local-outlier-factor-lof-algorithm-for-outlier-identification-8efb887d9843

Why should treat outliers with Nearest Neighbor and Local Outlier Factor?

Bunyamin Ergen

Artificial Intelligence Engineer @eTa??n | Multi-Agent AI Systems / Agentic AI, LLM, State-of-the-Art Technologies, Multi-Modal Learning, Speech-to-Text, Computer Vision and Adversarial Machine Learning

领英推荐

Neural Network of Bunyamin

3,168 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

How to fine-tuning a LLaMa-2 overnight?

Effective XGBoost by Matt Harrison

Kfold Cross Validation for the LightGBM Classifier

Unraveling the Mysteries of Decision Trees in Machine Learning

Balancing the Scales : Handling Class Imbalance

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Feature Engineering: One-Hot Encoding and the Art of Avoiding Dummy Variable Traps ????

Support Vector Machine- Simple analysis

Model Dimensionality and Overfitting

Unraveling the Mysteries of Decision Trees in Machine Learning

领英推荐

Neural Network of Bunyamin

3,168 位关注者

FoodCLIP Released !

2024年6月1日

Advancements and Challenges in Multimodal Machine Learning

2024年1月17日

ChatGPT API'nin Fonksiyon ?zelli?i ile Uygulamalar?n?z? Bir üst Seviyeye Nas?l Ta??rs?n?z ?

2023年9月25日

RFC, IETF ve IANA ??

2023年9月24日

Python Geli?tirme ??in En ?yi Se?enek nedir ?

2023年9月15日

ASCII Art Nedir?

2023年9月12日

LinkedIn Collaborative Articles ve Skills Pages

2023年9月11日

Ne zaman ba?lam???m ?

2023年9月8日

Django REST Framework ile API Geli?tirme: Temel Rehber

2023年9月7日

S?k kulland???n?z komutlar i?in: Scratch Files, Snippets, Live Templates, Terminal Alias'lar? ve PowerShell Profile.ps1

2023年8月26日

社区洞察

其他会员也浏览了

How to fine-tuning a LLaMa-2 overnight?

Effective XGBoost by Matt Harrison

Kfold Cross Validation for the LightGBM Classifier

Unraveling the Mysteries of Decision Trees in Machine Learning

Balancing the Scales : Handling Class Imbalance

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Feature Engineering: One-Hot Encoding and the Art of Avoiding Dummy Variable Traps ????

Support Vector Machine- Simple analysis

Model Dimensionality and Overfitting

Unraveling the Mysteries of Decision Trees in Machine Learning