K-means clustering: Applications in security domains

K-means clustering: Applications in security domains


k-means is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.


What is K-Means Algorithm?

K-Means Clustering is an?Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process.

No alt text provided for this image


K-means clustering algorithm computes the centroids and iterates until it finds optimal centroid. It assumes that the number of clusters is already known. It is also called?a flat clustering?algorithm.

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum. It is to be understood that less variation within the clusters will lead to more similar data points within the same cluster.

In more technical terms, we try to make the data into one cluster as?homogenous?as possible, while making the cluster as?heterogeneous?as possible. The?K?number is the number of clusters we try to obtain. We can play around with?K?until we are satisfied with our results.



Some Advantages of K- Means Clustering Algorithm:

  • It is fast
  • Robust
  • Easy to understand
  • Comparatively efficient
  • If data sets are distinct, then gives the best results
  • Produce tighter clusters
  • When centroids are recomputed, the cluster changes.
  • Flexible
  • Easy to interpret
  • Better computational cost
  • Enhances Accuracy
  • Works better with spherical clusters


No alt text provided for this image



Some Disadvantages of K- Means Clustering Algorithm:

  • Needs prior specification for the number of cluster centers
  • If there are two highly overlapping data, then it cannot be distinguished and cannot tell that there are two clusters
  • With the different representations of the data, the results achieved are also different
  • Euclidean distance can unequally weigh the factors
  • It gives the local optima of the squared error function
  • Sometimes choosing the centroids randomly cannot give fruitful results
  • It can be used only if the meaning is defined
  • Cannot handle outliers and noisy data
  • Do not work for the non-linear data set
  • Lacks consistency
  • Sensitive to scale
  • If very large data sets are encountered, then the computer may crash.
  • Prediction issues



Some Uses of K- Means Clustering Algorithm in Real Life:

  • Market segmentation
  • Document clustering
  • Image segmentation
  • Image compression
  • Vector quantization
  • Cluster analysis
  • Feature learning or dictionary learning
  • Identifying crime-prone areas
  • Insurance fraud detection
  • Public transport data analysis
  • Clustering of IT assets
  • Customer segmentation
  • Identifying Cancerous data
  • Used in search engines
  • Drug Activity Prediction



Working of k-mean algorithm

The K-Means Clustering algorithm?works with a few simple steps.

  1. Assign the?K?number of clusters
  2. Shuffle the data and randomly assign each data point to one of the?K?clusters and assign initial random centroids.
  3. Calculate the squared sum between each data point and all centroids.
  4. Reassign each data point to the closest centroid based on the computation for step 3.
  5. Reassign the centroid by calculating the mean value for every cluster
  6. Repeat steps 3, 4, 5 until we no longer have to change anything in the clusters

The time needed to run the K-Means Clustering algorithm depends on the size of the dataset, the K number we define, and the patterns in the data.


No alt text provided for this image



Application of K-means clustering in Security Domains:


Being able to classify data records into groups according to their features attributes, or similarities makes this significant in many fields related to data analysis, such as pattern recognition, image processing, information retrieval, geography, and marketing. Also considering that this is the information era, it has been a challenge on storage as well as performing computation on such massive data. This has all been dealt with by the wave of advancements in Cloud technology.


No alt text provided for this image


With the migration from on-premises to cloud, the need for security upgrades has become even more apparent. Preserving data privacy during out-sourced analysis is something that has been developing improvement but seemingly short of perfection through the various iterations of protocols and algorithms implemented. This holds specially true when it comes to performing clustering techniques. Due to sheer volume of inputs that are often involved in data mining problems or ML, generic multiparty computation (MPC) protocols become infeasible in terms of communication cost. This has led to constructions of function-specific multiparty protocols that attempt to handle a specific functionality in an efficient manner, while still providing privacy to the parties.


Secure two-party?k-means Clustering:

The solution to the above problem was proposed by Paul Bunn and Rafail Ostroversky in their?research paper (refer here)?in which they designed a protocol that takes as a template the?single-database?protocol, and extends it to the two-party setting. They utilized numerous sub-protocols which themselves preserve privacy against an honest-but-curious adversary. They also utilize standard cryptographic tools to maintain privacy in the two-party?k-means clustering algorithm.


Use in Intrusion Detection Systems:

No alt text provided for this image

Intrusion Detection Systems are mainly used to distinguish normal behavior and abnormal behavior and then make corresponding measures. The application of unsupervised clustering algorithms in the field of abnormal detection can improve detection efficiency of an IDS and makes the practical application value higher.?k-means can serve to be the most commonly used and most practical way of implementation.

In system applications, if you can’t use tagged data, you can’t clearly determine the normal or abnormal condition of the connection record, and then make the clustering tag. Typically, a threshold is used to keep a record of the connection above the threshold for the normal clustering, whereas the other is exception clustering. According to the?paper (refer)?published by Chunfen Bu, the experimental results show an average detection rate of 89.24% and the false alarm rate (False Positive) of 0.77%.

An improved K-Means algorithm flow as proposed in the research paper?consisting of data preprocessing being performed on the collected data or the original dataset. Data normalization uses the Min-Max Linear function normalization to map data into intervals of 0 to 1. Feature extraction uses the PCA algorithm to perform feature dimension reduction on the entire dataset. Then the outlier detection analysis of the whole dataset will affect the removal of outliers of?k-means clusters and cluster center points (centroids), and improve the accuracy of the clustering algorithm. This improved PCA based?k-means algorithm reached 99.02% accuracy with a false positive rate of only 1.144% (for various intrusion types).


Cyber Security Analytics on Apache Spark:

No alt text provided for this image

Apache Spark is an open-source unified analytics engine for large-scale data preprocessing in fields like Big Data and Machine Learning. It has seen rapid adoption by enterprises across a wide range of industries.

It is one of the commonly used big data frameworks, which is scalable, in-memory persistent, fault tolerant, and supports programs that can be executed in parallel.

Cyber security analytics is an alternative solution to the traditional security systems. It exploits the techniques and methods used in big data analytics to solve security related problems. With the help of big data frameworks like Hadoop, Spark, it can handle large volume of data in real time and can provide important insights into the security incidents with the help of data. Cyber security analytics can also detect attacks that are hidden inside enormous number of security events by filtering out irrelevant events from the relevant ones. This in turn speeds up the process of security analysis.

Elkan’s?k-means clustering using triangle inequality (k-meansTI) is one of the ways to improve performance of the original?k-means algorithm.?k-meansTI avoids data points-cluster centers distance computations for the points that are far away from the cluster centers. The main contribution of?k-meansTI is the possibility to reduce the time complexity of the standard?k-means from?O(kne)?to approximately?O(n)?in practice.


No alt text provided for this image

The results showed that the parallel implementation of?k-meansTI on Spark can achieve better performance than the Spark ML?k-means when the dataset is very large. However the performance was degraded for small datasets. Clustering Web attacks shows that good clustering results can organize and reduce the data for further analysis and can be used to gain important insight into the properties of the attacks. The knowledge obtained from the clustering results can also be used to quickly classify the new data.


Some more applications of K-means under the security domain are as follows:


No alt text provided for this image

  • Customer Segmentation: Subdivision of customers into groups/segments such that each customer segment consists of customers with similar market characteristics —?pricing , loyalty, spending behaviors?etc. Some of the segmentation variables could be e.g.,?number of items bought on sale, avg transaction value, total number of transactions. Customer segmentation allows businesses to customize market programs that will be suitable for each of its customer segments.


  • Anomaly or Fraud Detection: Separate valid activity groups from bots and detect fraudulent claims.

No alt text provided for this image


  • Inventory Categorization?based on sales or other manufacturing metrics
  • Creating NewsFeeds: K-Means can be used to cluster articles by their similarity — it can separate documents into disjoint clusters.


No alt text provided for this image


  • Cloud Computing Environment: Clustered storage to increase performance, capacity, or reliability — clustering distributes work loads to each server, manages the transfer of workloads between servers, and provides access to all files from any server regardless of the physical location of the file.
  • Environmental risks: K-means can be used to analyze environmental risk in an area — environmental risk zoning of a chemical industrial area.


No alt text provided for this image

  • Pattern Recognition in images:?For example, to automatically detect infected fruits or for segmentation of blood cells for leukemia detection.





Conclusion:

Hope this information was relevant to you all. Here, I tried to explain about k- means clustering algorithm and it's applications which are currently researched by a lot of scientists. Please do refer the original research papers linked above to get much more detailed information regarding the implementation and in-depth working of the above mentioned use cases.

Thank you :)


Wow, your detailed explanation on k-means clustering is impressive! You've shown real talent in breaking down complex concepts. Learning about other machine learning algorithms could really broaden your skill set. Have you considered diving into neural networks next? I'm curious, what do you see yourself doing in the future within the tech world?

要查看或添加评论,请登录

Ajeenkya S.的更多文章

社区洞察

其他会员也浏览了