What should we do if we have a case where we have data of what we must detect and negligible or zero data in case of what we must not detect ?

What should we do if we have a case where we have data of what we must detect and negligible or zero data in case of what we must not detect ?

This problem from a technical perspective is commonly known as the Anomaly Detection problem. The business applications for this kind of problem usually include fraud detection, manufacturing industry faulty parts detection, monitoring unusual behavior in computers at data centers, etc.

As the problem above clearly has zero Anomaly data, this unsupervised Anomaly detection problem will be treated as an outlier detection problem. There are several approaches to deal with this problem, which are as follows:

1.???Density Based

  • ??RKDE: Robust Kernel Density Estimation
  • DBSCAN: Density-Based Spatial Clustering of Applications with Noise
  • EGMM: Ensemble Gaussian Mixture Model

2.???Neighbour or Distance Based

  • LOF: Local Outlier Factor
  • ABOD: kNN Angle-Based Outlier Detector
  • ORCA: Outlier detection and Robust Clustering

3.???Quantile Based

  • OCSVM: One-class SVM
  • SVDD: Support Vector Data Description

4.???Projection Based

  • IFOR: Isolation Forest
  • LODA: Lightweight Online Detector of Anomalies

The obvious question is how can I choose between these algorithms and how can they be compared.

Usually, for imbalanced datasets, we use a confusion matrix followed by ROC-AUC (True Positive rate vs False Positive rate) and progressively plot the curve on increasing threshold and find AUC (area under the curve) to compare between different algorithms. However, a severely imbalanced dataset will have huge ramifications on variation in minor outlier points.

If you want to choose an algorithm for your application you may want to choose precision or recall at K to be a better evaluation criterion, Emmott et al.[4] have done a huge study with varying factors and hypothesis testing and have come to conclude the following evaluation criteria as it seemed appropriate in the context of vast comparisons. As a basis of bench-marking between different Anomaly detection algorithms, the criteria used is AUC (Area under ROC curve) in probabilistic measure, as in comparing randomly chosen nominal point to an anomaly in ranking terms with transformed value as log(AUC/1 ? AUC) and AP(average precision: area under the precision-recall curve) with transformed value as log(AP/E[AP ]) = log(LIFT )

In order to design experiment for comparing different algorithms, Emmott et al [4] performed ANOVA on datasets and divided them on basis of rf( relative frequency), pd:point difficulty, cl(normalized clusterd- ness), ir(irrelavant features), memset (”Mother” set) and algorithm. Only variations on Algorithm are performed and they get the following results mentioned in the figure below.

No alt text provided for this image

Figure 1: This figure is taken from a presentation given by Dietterich et al. [3] on AD. In this figure, we can see that Isolation Forest performs the best in both evaluation criteria for anomaly detection with very few anomaly samples with regards to experiments performed by Emmott et al.[4].

Isolation Forest (iForest) is regarded as the state of the art when it comes to anomaly detection. The idea is that the data points get recursively split just like random trees and leaves are the most isolated points, and if a point is an anomaly it will get isolated quickly or have a lower depth. The advantages of using iForest is that it scales well with higher dimensional data, has a linear time complexity as well a lower memory requirement.

The isolation Forest Algorithm explained in three steps

1.???Code the tree; the input will be sample of data, current tree height, maximum depth and output will be the built tree

2.???Now from the tree we make the forest; Input as data, number of trees, and sampling size (data fed to each tree) the output will be a Forest (data fitting several trees).

3.???Find the path length; Input as an instance, isolation Tree, current path length (initially zero), and Output will be path length (nodes crossed before getting isolated)

Important points to be noted while using Isolation Forests include tuning difficulties and limitations. (Original paper Liu et al. [7])

1.???Although it is regarded as the most computationally efficient for higher dimensional data, it suffers from the curse of dimensionality like any other algorithm. Let the maximum size of a node be b bytes, t be the number of trees. Thus, a working model to detect anomalies is estimated to be less than 511 ? t ? b bytes, which is trivial in modern computing equipment. iForest has time complexities of O(t ? ψ ? log(ψ) ) in the training stage and O(n ? t ? log(ψ)) in the evaluating stage where ψ is the sub-sampling size, and n is the testing data size

2.???iForests achieve high anomaly detection performance with high efficiency with a very small sub-sampling size, unlike other algorithms where larger the sampling size, better is their performance.

3. In order to avoid Swamping (labeling normal instances as anomalies) and Masking (the existence of too many anomalies), it is important to tune?two?parameters namely the number?of?subsamples to be considered to grow every tree and the number of trees.

The above solution is valid in case we have zero anomaly samples and only nominal samples. Moreover, let's say that we are working on a business case study along with a domain expert who can help us with detecting anomalies and we want to make our algorithm more robust to these anomalies. In that case, Das et al. [1] have proposed Active Anomaly Detection algorithm which essentially re-weights the projections in the LODA: Lightweight Online Detector of Anomalies algorithm, making the nominal and anomaly points more separable via modified accuracy at the top algorithm.

Many of the above algorithms may flag a point as an anomaly, but say nothing about why it is an anomaly/outlier.??Attempts in the research community have been made in this direction. Siddiqui et al.??[9] have proposed SFE: Sequential feature explanation for understanding the reasons behind flagging points as anomalies in a human-analyst-friendly manner. Based on SFE, interactive outlier detection software can be found. One of them is Quirk which is proposed by Mokoena et al. [8]

References

[1]?Shubhomoy Das et al. “Incorporating Expert Feedback into Active Anomaly Discovery”. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). ISSN: 2374-8486. Dec. 2016, pp. 853– 858. doi: 10.1109/ICDM.2016.0102.

[2]???????Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing”. In: arXiv:1810.04805 [cs] (May 2019). arXiv: 1810.04805. url: https://arxiv.org/abs/ 1810.04805 (visited on 10/11/2021).

[3]???????Tom Dietterich et al. “Anomaly Detection: Principles, Benchmarking, Explanation, and Theory”. en. In: (), p. 57.

[4]???????Andrew Emmott et al. “A Meta-Analysis of the Anomaly Detection Problem”. In: arXiv:1503.01158 [cs, ?stat] ?(Aug. ?2016). ?arXiv: ?1503.01158. ?url: ?https://arxiv.org/abs/1503.01158 ?

[5]???????Yaru Hao et al. “Self-Attention Attribution: Interpreting Information Interactions Inside Transformer”. In: arXiv:2004.11207 ?[cs] ?(Feb. 2021). arXiv: 2004.11207. ?url: https://arxiv.org/abs/2004.11207

[6]???????Goro Kobayashi et al. “Attention is Not Only a Weight: Analyzing Transformers with Vector Norms”. en. ?In: ?arXiv:2004.10102 ?[cs] ?(Oct. ?2020). ?arXiv: ?2004.10102. ?url: ?https://arxiv.org/abs/2004. 10102

[7]???????Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. “Isolation Forest”. en. In: 2008 Eighth IEEE Interna- tional Conference on Data Mining. Pisa, Italy: IEEE, Dec. 2008, pp. 413–422. isbn: 978-0-7695-3502-9. doi: ?10.1109/ICDM.2008.17. ?url: ?https://ieeexplore.ieee.org/document/4781136/ ?

[8]???????Tshepiso Mokoena et al. “Bringing sequential feature explanations to life”. en. In: 2017 IEEE AFRICON. Cape ?Town: ?IEEE, ?Sept. ?2017, ?pp. ?59–64. ?isbn: ?978-1-5386-2775-4. ?doi: ?10.1109/AFRCON.2017. 8095456. url: https://ieeexplore.ieee.org/document/8095456/

[9]???????Md Amran Siddiqui et al. “Sequential Feature Explanations for Anomaly Detection”. In: arXiv:1503.00038 [cs, ?stat] ?(Feb. ?2015). ?arXiv: ?1503.00038. ?url: ?https://arxiv.org/abs/1503.00038 ?

[10] Han, S., Hu, X., Huang, H., Jiang, M., & Zhao, Y. (2022). ADBench: Anomaly Detection Benchmark (arXiv:2206.09426). arXiv. https://arxiv.org/abs/2206.09426

要查看或添加评论,请登录

Neil Pradhan的更多文章

社区洞察

其他会员也浏览了