登录查看更多内容

What should we do if we have a case where we have data of what we must detect and negligible or zero data in case of what we must not detect ?

Neil Pradhan

Full Stack Data & AI Engineer | Master's degree from KTH with specialization in Machine Learning

发布日期: 2022年12月14日

This problem from a technical perspective is commonly known as the Anomaly Detection problem. The business applications for this kind of problem usually include fraud detection, manufacturing industry faulty parts detection, monitoring unusual behavior in computers at data centers, etc.

As the problem above clearly has zero Anomaly data, this unsupervised Anomaly detection problem will be treated as an outlier detection problem. There are several approaches to deal with this problem, which are as follows:

1.???Density Based

??RKDE: Robust Kernel Density Estimation
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
EGMM: Ensemble Gaussian Mixture Model

2.???Neighbour or Distance Based

LOF: Local Outlier Factor
ABOD: kNN Angle-Based Outlier Detector
ORCA: Outlier detection and Robust Clustering

3.???Quantile Based

OCSVM: One-class SVM
SVDD: Support Vector Data Description

4.???Projection Based

IFOR: Isolation Forest
LODA: Lightweight Online Detector of Anomalies

The obvious question is how can I choose between these algorithms and how can they be compared.

Usually, for imbalanced datasets, we use a confusion matrix followed by ROC-AUC (True Positive rate vs False Positive rate) and progressively plot the curve on increasing threshold and find AUC (area under the curve) to compare between different algorithms. However, a severely imbalanced dataset will have huge ramifications on variation in minor outlier points.

If you want to choose an algorithm for your application you may want to choose precision or recall at K to be a better evaluation criterion, Emmott et al.[4] have done a huge study with varying factors and hypothesis testing and have come to conclude the following evaluation criteria as it seemed appropriate in the context of vast comparisons. As a basis of bench-marking between different Anomaly detection algorithms, the criteria used is AUC (Area under ROC curve) in probabilistic measure, as in comparing randomly chosen nominal point to an anomaly in ranking terms with transformed value as log(AUC/1 ? AUC) and AP(average precision: area under the precision-recall curve) with transformed value as log(AP/E[AP ]) = log(LIFT )

In order to design experiment for comparing different algorithms, Emmott et al [4] performed ANOVA on datasets and divided them on basis of rf( relative frequency), pd:point difficulty, cl(normalized clusterd- ness), ir(irrelavant features), memset (”Mother” set) and algorithm. Only variations on Algorithm are performed and they get the following results mentioned in the figure below.

Figure 1: This figure is taken from a presentation given by Dietterich et al. [3] on AD. In this figure, we can see that Isolation Forest performs the best in both evaluation criteria for anomaly detection with very few anomaly samples with regards to experiments performed by Emmott et al.[4].

Isolation Forest (iForest) is regarded as the state of the art when it comes to anomaly detection. The idea is that the data points get recursively split just like random trees and leaves are the most isolated points, and if a point is an anomaly it will get isolated quickly or have a lower depth. The advantages of using iForest is that it scales well with higher dimensional data, has a linear time complexity as well a lower memory requirement.

The isolation Forest Algorithm explained in three steps

1.???Code the tree; the input will be sample of data, current tree height, maximum depth and output will be the built tree

领英推荐

Unveiling the Art of Time Series Analysis: Choosing…

360DigiTMG 5 个月前

The Power of Probabilistic Scenarios in Constantly…

International Standard for Lean Six Sigma (ISLSS) 1 年前

Finding your wifi password in 512-dimensional space

qdive 2 年前

2.???Now from the tree we make the forest; Input as data, number of trees, and sampling size (data fed to each tree) the output will be a Forest (data fitting several trees).

3.???Find the path length; Input as an instance, isolation Tree, current path length (initially zero), and Output will be path length (nodes crossed before getting isolated)

Important points to be noted while using Isolation Forests include tuning difficulties and limitations. (Original paper Liu et al. [7])

1.???Although it is regarded as the most computationally efficient for higher dimensional data, it suffers from the curse of dimensionality like any other algorithm. Let the maximum size of a node be b bytes, t be the number of trees. Thus, a working model to detect anomalies is estimated to be less than 511 ? t ? b bytes, which is trivial in modern computing equipment. iForest has time complexities of O(t ? ψ ? log(ψ) ) in the training stage and O(n ? t ? log(ψ)) in the evaluating stage where ψ is the sub-sampling size, and n is the testing data size

2.???iForests achieve high anomaly detection performance with high efficiency with a very small sub-sampling size, unlike other algorithms where larger the sampling size, better is their performance.

3. In order to avoid Swamping (labeling normal instances as anomalies) and Masking (the existence of too many anomalies), it is important to tune?two?parameters namely the number?of?subsamples to be considered to grow every tree and the number of trees.

The above solution is valid in case we have zero anomaly samples and only nominal samples. Moreover, let's say that we are working on a business case study along with a domain expert who can help us with detecting anomalies and we want to make our algorithm more robust to these anomalies. In that case, Das et al. [1] have proposed Active Anomaly Detection algorithm which essentially re-weights the projections in the LODA: Lightweight Online Detector of Anomalies algorithm, making the nominal and anomaly points more separable via modified accuracy at the top algorithm.

Many of the above algorithms may flag a point as an anomaly, but say nothing about why it is an anomaly/outlier.??Attempts in the research community have been made in this direction. Siddiqui et al.??[9] have proposed SFE: Sequential feature explanation for understanding the reasons behind flagging points as anomalies in a human-analyst-friendly manner. Based on SFE, interactive outlier detection software can be found. One of them is Quirk which is proposed by Mokoena et al. [8]

References

[1]?Shubhomoy Das et al. “Incorporating Expert Feedback into Active Anomaly Discovery”. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). ISSN: 2374-8486. Dec. 2016, pp. 853– 858. doi: 10.1109/ICDM.2016.0102.

[2]???????Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing”. In: arXiv:1810.04805 [cs] (May 2019). arXiv: 1810.04805. url: https://arxiv.org/abs/ 1810.04805 (visited on 10/11/2021).

[3]???????Tom Dietterich et al. “Anomaly Detection: Principles, Benchmarking, Explanation, and Theory”. en. In: (), p. 57.

[4]???????Andrew Emmott et al. “A Meta-Analysis of the Anomaly Detection Problem”. In: arXiv:1503.01158 [cs, ?stat] ?(Aug. ?2016). ?arXiv: ?1503.01158. ?url: ?https://arxiv.org/abs/1503.01158 ?

[5]???????Yaru Hao et al. “Self-Attention Attribution: Interpreting Information Interactions Inside Transformer”. In: arXiv:2004.11207 ?[cs] ?(Feb. 2021). arXiv: 2004.11207. ?url: https://arxiv.org/abs/2004.11207

[6]???????Goro Kobayashi et al. “Attention is Not Only a Weight: Analyzing Transformers with Vector Norms”. en. ?In: ?arXiv:2004.10102 ?[cs] ?(Oct. ?2020). ?arXiv: ?2004.10102. ?url: ?https://arxiv.org/abs/2004. 10102

[7]???????Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. “Isolation Forest”. en. In: 2008 Eighth IEEE Interna- tional Conference on Data Mining. Pisa, Italy: IEEE, Dec. 2008, pp. 413–422. isbn: 978-0-7695-3502-9. doi: ?10.1109/ICDM.2008.17. ?url: ?https://ieeexplore.ieee.org/document/4781136/ ?

[8]???????Tshepiso Mokoena et al. “Bringing sequential feature explanations to life”. en. In: 2017 IEEE AFRICON. Cape ?Town: ?IEEE, ?Sept. ?2017, ?pp. ?59–64. ?isbn: ?978-1-5386-2775-4. ?doi: ?10.1109/AFRCON.2017. 8095456. url: https://ieeexplore.ieee.org/document/8095456/

[9]???????Md Amran Siddiqui et al. “Sequential Feature Explanations for Anomaly Detection”. In: arXiv:1503.00038 [cs, ?stat] ?(Feb. ?2015). ?arXiv: ?1503.00038. ?url: ?https://arxiv.org/abs/1503.00038 ?

[10] Han, S., Hu, X., Huang, H., Jiang, M., & Zhao, Y. (2022). ADBench: Anomaly Detection Benchmark (arXiv:2206.09426). arXiv. https://arxiv.org/abs/2206.09426

要查看或添加评论，请登录

Neil Pradhan的更多文章

Why do most Entrepreneurs fail to solve the loneliness aka Community building problem?

2023年10月27日

Why do most Entrepreneurs fail to solve the loneliness aka Community building problem?

We, humans, are social animals and from time immemorial we have had a strong urge to connect with people who are…
Migrating from C++, Does Python do call by reference or call by?value?

2023年1月3日

Migrating from C++, Does Python do call by reference or call by?value?

If you are migrating to python from C++, you will often come across this question about how are variables passed inside…
Why do we need normalization of images, before feeding them into Deep Neural Networks? Why can't we just do end-to-end learning?

2022年12月30日

Why do we need normalization of images, before feeding them into Deep Neural Networks? Why can't we just do end-to-end learning?

In general, normalization speeds up the learning process. But as we know, there is no free lunch, and everything comes…

What should we do if we have a case where we have data of what we must detect and negligible or zero data in case of what we must not detect ?

Neil Pradhan

Full Stack Data & AI Engineer | Master's degree from KTH with specialization in Machine Learning

领英推荐

Neil Pradhan的更多文章

社区洞察

其他会员也浏览了

Notes on Data Compression: Part 1

Counterfeit Knowledge Graphs

Revolutionizing Data Efficiency: A Novel Non-Classical Mathematical Lossless Compression Technology

Fixed-Latency Models

Text classification approaches

Demystifying Sorting Algorithms: A Comprehensive Guide

Harnessing the Power of Random Forest for Glucose Prediction, How I Completed This Task

The PAVA-TCE-DS-BCDFD: Pooled Adjacent Violators Algorithm with Test Calibration Error upon Dynamic Significance and Binomial CDF Deviation

The Evolution of Data Analysis: From Straight Lines to More Sophisticated Models ??

The Data Dilemma

领英推荐

Neil Pradhan的更多文章

Why do most Entrepreneurs fail to solve the loneliness aka Community building problem?

Migrating from C++, Does Python do call by reference or call by?value?

Why do we need normalization of images, before feeding them into Deep Neural Networks? Why can't we just do end-to-end learning?

社区洞察

其他会员也浏览了

Notes on Data Compression: Part 1

Counterfeit Knowledge Graphs

Revolutionizing Data Efficiency: A Novel Non-Classical Mathematical Lossless Compression Technology

Fixed-Latency Models

Text classification approaches

Demystifying Sorting Algorithms: A Comprehensive Guide

Harnessing the Power of Random Forest for Glucose Prediction, How I Completed This Task

The PAVA-TCE-DS-BCDFD: Pooled Adjacent Violators Algorithm with Test Calibration Error upon Dynamic Significance and Binomial CDF Deviation

The Evolution of Data Analysis: From Straight Lines to More Sophisticated Models ??

The Data Dilemma