K Means Clustering UseCases in Security Domain
Introduction
K Means Clustering is an Unsupervised Machine Learning. It is one of the simplest and popular unsupervised machine learning algorithms.
A cluster refers to a collection of data points aggregated together because of certain similarities.You’ll define a target number?k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the centre of the cluster. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares. In other words, the K-means algorithm identifies?k?number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The?‘means’?in the K-means refers to averaging of the data; that is, finding the centroid.
How the K-means algorithm works
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimise the positions of the centroids
It halts creating and optimising clusters when either:
Use-cases of K-means in Security Domain
领英推荐
1: Document Analysis:
There are many different reasons why you would want to run an analysis on a document. In this scenario, you want to be able to organise the documents quickly and efficiently.
Problem: Imagine you are limited in time and need to organise information held in documents quickly. To be able to complete this ask you need to: understand the theme of the text, compare it with other documents and classify it.
Working:?Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different themes. Using this technique, you can cluster and organise similar documents quickly using the characteristics identified in the paragraph.
2: Criminal or Fraudulent Activities
In this scenario, we are going to focus on fraudulent taxi driver behaviour. However, the technique has been used in multiple scenarios.
Problem:?You need to look into fraudulent driving activity. The challenge is how do you identify what is true and which is false?
Working:?By analysing the GPS logs, the algorithm is able to group similar behaviours. Based on the characteristics of the groups you are then able to classify them into those that are real and which are fraudulent.?