K-means clustering: Applications in security domains
Ajeenkya S.
Jr. Soft Engg @Cognizant, EDI-Maps Developer, 2X OCI, 1xAWS Certified, 1X Aviatrix Certified, AT&T Summer Learning Academy Extern, LW summer Research Intern, ARTH Learner, 1X Gitlab Certified Associate, ARTH 2.0 LW_TV
k-means is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.
What is K-Means Algorithm?
K-Means Clustering is an?Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process.
K-means clustering algorithm computes the centroids and iterates until it finds optimal centroid. It assumes that the number of clusters is already known. It is also called?a flat clustering?algorithm.
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum. It is to be understood that less variation within the clusters will lead to more similar data points within the same cluster.
In more technical terms, we try to make the data into one cluster as?homogenous?as possible, while making the cluster as?heterogeneous?as possible. The?K?number is the number of clusters we try to obtain. We can play around with?K?until we are satisfied with our results.
Some Advantages of K- Means Clustering Algorithm:
Some Disadvantages of K- Means Clustering Algorithm:
Some Uses of K- Means Clustering Algorithm in Real Life:
Working of k-mean algorithm
The K-Means Clustering algorithm?works with a few simple steps.
The time needed to run the K-Means Clustering algorithm depends on the size of the dataset, the K number we define, and the patterns in the data.
Application of K-means clustering in Security Domains:
Being able to classify data records into groups according to their features attributes, or similarities makes this significant in many fields related to data analysis, such as pattern recognition, image processing, information retrieval, geography, and marketing. Also considering that this is the information era, it has been a challenge on storage as well as performing computation on such massive data. This has all been dealt with by the wave of advancements in Cloud technology.
With the migration from on-premises to cloud, the need for security upgrades has become even more apparent. Preserving data privacy during out-sourced analysis is something that has been developing improvement but seemingly short of perfection through the various iterations of protocols and algorithms implemented. This holds specially true when it comes to performing clustering techniques. Due to sheer volume of inputs that are often involved in data mining problems or ML, generic multiparty computation (MPC) protocols become infeasible in terms of communication cost. This has led to constructions of function-specific multiparty protocols that attempt to handle a specific functionality in an efficient manner, while still providing privacy to the parties.
领英推荐
Secure two-party?k-means Clustering:
The solution to the above problem was proposed by Paul Bunn and Rafail Ostroversky in their?research paper (refer here)?in which they designed a protocol that takes as a template the?single-database?protocol, and extends it to the two-party setting. They utilized numerous sub-protocols which themselves preserve privacy against an honest-but-curious adversary. They also utilize standard cryptographic tools to maintain privacy in the two-party?k-means clustering algorithm.
Use in Intrusion Detection Systems:
Intrusion Detection Systems are mainly used to distinguish normal behavior and abnormal behavior and then make corresponding measures. The application of unsupervised clustering algorithms in the field of abnormal detection can improve detection efficiency of an IDS and makes the practical application value higher.?k-means can serve to be the most commonly used and most practical way of implementation.
In system applications, if you can’t use tagged data, you can’t clearly determine the normal or abnormal condition of the connection record, and then make the clustering tag. Typically, a threshold is used to keep a record of the connection above the threshold for the normal clustering, whereas the other is exception clustering. According to the?paper (refer)?published by Chunfen Bu, the experimental results show an average detection rate of 89.24% and the false alarm rate (False Positive) of 0.77%.
An improved K-Means algorithm flow as proposed in the research paper?consisting of data preprocessing being performed on the collected data or the original dataset. Data normalization uses the Min-Max Linear function normalization to map data into intervals of 0 to 1. Feature extraction uses the PCA algorithm to perform feature dimension reduction on the entire dataset. Then the outlier detection analysis of the whole dataset will affect the removal of outliers of?k-means clusters and cluster center points (centroids), and improve the accuracy of the clustering algorithm. This improved PCA based?k-means algorithm reached 99.02% accuracy with a false positive rate of only 1.144% (for various intrusion types).
Cyber Security Analytics on Apache Spark:
Apache Spark is an open-source unified analytics engine for large-scale data preprocessing in fields like Big Data and Machine Learning. It has seen rapid adoption by enterprises across a wide range of industries.
It is one of the commonly used big data frameworks, which is scalable, in-memory persistent, fault tolerant, and supports programs that can be executed in parallel.
Cyber security analytics is an alternative solution to the traditional security systems. It exploits the techniques and methods used in big data analytics to solve security related problems. With the help of big data frameworks like Hadoop, Spark, it can handle large volume of data in real time and can provide important insights into the security incidents with the help of data. Cyber security analytics can also detect attacks that are hidden inside enormous number of security events by filtering out irrelevant events from the relevant ones. This in turn speeds up the process of security analysis.
Elkan’s?k-means clustering using triangle inequality (k-meansTI) is one of the ways to improve performance of the original?k-means algorithm.?k-meansTI avoids data points-cluster centers distance computations for the points that are far away from the cluster centers. The main contribution of?k-meansTI is the possibility to reduce the time complexity of the standard?k-means from?O(kne)?to approximately?O(n)?in practice.
The results showed that the parallel implementation of?k-meansTI on Spark can achieve better performance than the Spark ML?k-means when the dataset is very large. However the performance was degraded for small datasets. Clustering Web attacks shows that good clustering results can organize and reduce the data for further analysis and can be used to gain important insight into the properties of the attacks. The knowledge obtained from the clustering results can also be used to quickly classify the new data.
Some more applications of K-means under the security domain are as follows:
Conclusion:
Hope this information was relevant to you all. Here, I tried to explain about k- means clustering algorithm and it's applications which are currently researched by a lot of scientists. Please do refer the original research papers linked above to get much more detailed information regarding the implementation and in-depth working of the above mentioned use cases.
Thank you :)
Wow, your detailed explanation on k-means clustering is impressive! You've shown real talent in breaking down complex concepts. Learning about other machine learning algorithms could really broaden your skill set. Have you considered diving into neural networks next? I'm curious, what do you see yourself doing in the future within the tech world?