K-means Clustering and its real use case in the security domain
What is K-means Clustering?
Every?Machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally broken down into two types —?supervised and unsupervised. K-means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.
K-means algorithm is an iterative algorithm that divides a group of n datasets into K subgroups /clusters based on some similarity and their mean distance from the centroid of that particular cluster formed. K here is the pre-defined number of clusters to be formed by the algorithm .For example if K=5, it means the number of clusters to be formed from the dataset is 5.
Choosing the right number of clusters is crucial. The number of clusters that we choose for the algorithm shouldn’t be random. Each and every cluster is formed by calculating and comparing the mean distances of each data points within a cluster from its centroid. There are many methods for identifying the optimal number of clusters. We can choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method. WCSS Stands for the sum of the squares of distances of the data points in each and every cluster from its centroid. The main idea is to minimize the distance between the data points and the centroid of the clusters. The process is iterated until we reach a minimum value for the sum of distances. The Elbow method looks at the total WCSS as a function of the number of clusters.
The elbow method plots the value of the cost function produced by different values of?k. If?k?increases, average distortion will decrease, each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements in average distortion will decline as?k?increases. The value of?k?at which improvement in distortion declines the most is called the elbow, at which we should stop dividing the data into further clusters.
Use cases in the security domain
1. Cyber-profiling criminals
The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity . Difficulties in implementing cyber profiling is on the diversity of user data and behavior when online is sometimes different from actual behavior. Here clustering techniques are used to classify the Web-based content through data user preferences. This preference can be interpreted as an initial grouping of the data so that the resulting cluster will show user profiles.
2. Insurance Fraud Detection
Machine Learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
3. Automatic Clustering of it Alerts
Large enterprise infrastructure technology components such as network, storage, or database generate large volumes of alert messages because alert messages potentially point to operational issues that must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.