Astor Perkins的动态

查看Astor Perkins的组织主页

1,120 位关注者

Perform outlier detection more effectively using subsets of features This article is part of a series related to the challenges, and the techniques that may be used, to best identify outliers in data, including articles related to using PCA, Distance Metric Learning, Shared Nearest Neighbors, Frequent Patterns Outlier Factor, Counts Outlier Detector (a multi-dimensional histogram-based method), and doping. This article also contains an excerpt from my book, Outlier Detection in Python. We look here at techniques to create, instead of a single outlier detector examining all features within a dataset, a series of smaller outlier detectors, each working with a subset of the features (referred to as subspaces). There are also a number of technical challenges that appear in outlier detection. Among these are the difficulties that occur where data has many features. As covered in previous articles related to Counts Outlier Detector and Shared Nearest Neighbors, where we have many features, we often face an issue known as the curse of dimensionality. This has a number of implications for outlier detection, including that it makes distance metrics unreliable. Many outlier detection algorithms rely on calculating the distances between records — in order to identify as outliers the records that are similar to unusually few other records, and that are unusually different from most other records — that is, records that are close to few other records and far from most other records. To address these issues, an important technique in outlier detection is using subspaces. The term subspaces simply refers to subsets of the features. In the example above, if we use the subspaces: A-B, C-D, E-F, A-E, B-C, B-D-F, and A-B-E, then we have seven subspaces (five 2d subspaces and two 3d subspaces). Creating these, we would run one (or more) detectors on each subspace, so would run at least seven detectors on each record. We’ve seen, then, a couple motivations for working with subspaces: we can mitigate the curse of dimensionality, and we can reduce where anomalies are not identified reliably where they are based on small numbers of features that are lost among many features. As well as handling situations like this, there are a number of other advantages to using subspaces with outlier detection. These include... https://lnkd.in/exfdadZC

Perform Outlier Detection More Effectively Using Subsets of Features | Towards Data Science

Perform Outlier Detection More Effectively Using Subsets of Features | Towards Data Science

https://towardsdatascience.com

要查看或添加评论,请登录