K-means Clustering and its real use case in the security domain

What is K-means Clustering?

Every?Machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally broken down into two types —?supervised and unsupervised. K-means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.

K-means algorithm is an iterative algorithm that divides a group of n datasets into K subgroups /clusters based on some similarity and their mean distance from the centroid of that particular cluster formed. K here is the pre-defined number of clusters to be formed by the algorithm .For example if K=5, it means the number of clusters to be formed from the dataset is 5.

No alt text provided for this image

Choosing the right number of clusters is crucial. The number of clusters that we choose for the algorithm shouldn’t be random. Each and every cluster is formed by calculating and comparing the mean distances of each data points within a cluster from its centroid. There are many methods for identifying the optimal number of clusters. We can choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method. WCSS Stands for the sum of the squares of distances of the data points in each and every cluster from its centroid. The main idea is to minimize the distance between the data points and the centroid of the clusters. The process is iterated until we reach a minimum value for the sum of distances. The Elbow method looks at the total WCSS as a function of the number of clusters.

The elbow method plots the value of the cost function produced by different values of?k. If?k?increases, average distortion will decrease, each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements in average distortion will decline as?k?increases. The value of?k?at which improvement in distortion declines the most is called the elbow, at which we should stop dividing the data into further clusters.

Use cases in the security domain

1. Cyber-profiling criminals

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity . Difficulties in implementing cyber profiling is on the diversity of user data and behavior when online is sometimes different from actual behavior. Here clustering techniques are used to classify the Web-based content through data user preferences. This preference can be interpreted as an initial grouping of the data so that the resulting cluster will show user profiles.

2. Insurance Fraud Detection

Machine Learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

3. Automatic Clustering of it Alerts

Large enterprise infrastructure technology components such as network, storage, or database generate large volumes of alert messages because alert messages potentially point to operational issues that must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.

要查看或添加评论,请登录

Mitul Jain的更多文章

  • Javascript and its Usecases

    Javascript and its Usecases

    INTRODUCTION: JavaScript, often abbreviated as JS, is a programming language that conforms to the ECMAScript…

  • Computer Vision - Image Processing

    Computer Vision - Image Processing

    Combining Two Images: To combining two images, the horizontal or vertical size of the image should be the same. If we…

  • Confusion M

    Confusion M

    What is Confusion Matrix? Well, confusion matrix is a technique that helps us determine the performance measurement for…

  • Opening Firefox on docker

    Opening Firefox on docker

    Task Description ?? ?? GUI container on the Docker ?? Launch a container on docker in GUI mode ?? Run any GUI software…

  • Deploying Machine Learning Model on Docker

    Deploying Machine Learning Model on Docker

    Hello!! Here’s my Task 1 – Deploy your Machine Learning Model on Docker under the guidance of Vimal Daga sir in the…

社区洞察