登录查看更多内容

K-means clustering: Applications in security domains

Ajeenkya S.

Jr. Soft Engg @Cognizant, EDI-Maps Developer, 2X OCI, 1xAWS Certified, 1X Aviatrix Certified, AT&T Summer Learning Academy Extern, LW summer Research Intern, ARTH Learner, 1X Gitlab Certified Associate, ARTH 2.0 LW_TV

发布日期: 2021年7月26日

k-means is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.

What is K-Means Algorithm?

K-Means Clustering is an?Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process.

K-means clustering algorithm computes the centroids and iterates until it finds optimal centroid. It assumes that the number of clusters is already known. It is also called?a flat clustering?algorithm.

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum. It is to be understood that less variation within the clusters will lead to more similar data points within the same cluster.

In more technical terms, we try to make the data into one cluster as?homogenous?as possible, while making the cluster as?heterogeneous?as possible. The?K?number is the number of clusters we try to obtain. We can play around with?K?until we are satisfied with our results.

Some Advantages of K- Means Clustering Algorithm:

It is fast
Robust
Easy to understand
Comparatively efficient
If data sets are distinct, then gives the best results
Produce tighter clusters
When centroids are recomputed, the cluster changes.
Flexible
Easy to interpret
Better computational cost
Enhances Accuracy
Works better with spherical clusters

Some Disadvantages of K- Means Clustering Algorithm:

Needs prior specification for the number of cluster centers
If there are two highly overlapping data, then it cannot be distinguished and cannot tell that there are two clusters
With the different representations of the data, the results achieved are also different
Euclidean distance can unequally weigh the factors
It gives the local optima of the squared error function
Sometimes choosing the centroids randomly cannot give fruitful results
It can be used only if the meaning is defined
Cannot handle outliers and noisy data
Do not work for the non-linear data set
Lacks consistency
Sensitive to scale
If very large data sets are encountered, then the computer may crash.
Prediction issues

Some Uses of K- Means Clustering Algorithm in Real Life:

Market segmentation
Document clustering
Image segmentation
Image compression
Vector quantization
Cluster analysis
Feature learning or dictionary learning
Identifying crime-prone areas
Insurance fraud detection
Public transport data analysis
Clustering of IT assets
Customer segmentation
Identifying Cancerous data
Used in search engines
Drug Activity Prediction

Working of k-mean algorithm

The K-Means Clustering algorithm?works with a few simple steps.

Assign the?K?number of clusters
Shuffle the data and randomly assign each data point to one of the?K?clusters and assign initial random centroids.
Calculate the squared sum between each data point and all centroids.
Reassign each data point to the closest centroid based on the computation for step 3.
Reassign the centroid by calculating the mean value for every cluster
Repeat steps 3, 4, 5 until we no longer have to change anything in the clusters

The time needed to run the K-Means Clustering algorithm depends on the size of the dataset, the K number we define, and the patterns in the data.

Application of K-means clustering in Security Domains:

Being able to classify data records into groups according to their features attributes, or similarities makes this significant in many fields related to data analysis, such as pattern recognition, image processing, information retrieval, geography, and marketing. Also considering that this is the information era, it has been a challenge on storage as well as performing computation on such massive data. This has all been dealt with by the wave of advancements in Cloud technology.

With the migration from on-premises to cloud, the need for security upgrades has become even more apparent. Preserving data privacy during out-sourced analysis is something that has been developing improvement but seemingly short of perfection through the various iterations of protocols and algorithms implemented. This holds specially true when it comes to performing clustering techniques. Due to sheer volume of inputs that are often involved in data mining problems or ML, generic multiparty computation (MPC) protocols become infeasible in terms of communication cost. This has led to constructions of function-specific multiparty protocols that attempt to handle a specific functionality in an efficient manner, while still providing privacy to the parties.

Vrata Tech Solutions (VTS) 7 个月前

Data Scientist rescuing Mr. Wolf to build a Classifier

Shaurya Uppal 1 年前

Unveiling the Power of Representation-Based…

Massimo Re 1 年前

Secure two-party?k-means Clustering:

The solution to the above problem was proposed by Paul Bunn and Rafail Ostroversky in their?research paper (refer here)?in which they designed a protocol that takes as a template the?single-database?protocol, and extends it to the two-party setting. They utilized numerous sub-protocols which themselves preserve privacy against an honest-but-curious adversary. They also utilize standard cryptographic tools to maintain privacy in the two-party?k-means clustering algorithm.

Use in Intrusion Detection Systems:

Intrusion Detection Systems are mainly used to distinguish normal behavior and abnormal behavior and then make corresponding measures. The application of unsupervised clustering algorithms in the field of abnormal detection can improve detection efficiency of an IDS and makes the practical application value higher.?k-means can serve to be the most commonly used and most practical way of implementation.

In system applications, if you can’t use tagged data, you can’t clearly determine the normal or abnormal condition of the connection record, and then make the clustering tag. Typically, a threshold is used to keep a record of the connection above the threshold for the normal clustering, whereas the other is exception clustering. According to the?paper (refer)?published by Chunfen Bu, the experimental results show an average detection rate of 89.24% and the false alarm rate (False Positive) of 0.77%.

An improved K-Means algorithm flow as proposed in the research paper?consisting of data preprocessing being performed on the collected data or the original dataset. Data normalization uses the Min-Max Linear function normalization to map data into intervals of 0 to 1. Feature extraction uses the PCA algorithm to perform feature dimension reduction on the entire dataset. Then the outlier detection analysis of the whole dataset will affect the removal of outliers of?k-means clusters and cluster center points (centroids), and improve the accuracy of the clustering algorithm. This improved PCA based?k-means algorithm reached 99.02% accuracy with a false positive rate of only 1.144% (for various intrusion types).

Cyber Security Analytics on Apache Spark:

Apache Spark is an open-source unified analytics engine for large-scale data preprocessing in fields like Big Data and Machine Learning. It has seen rapid adoption by enterprises across a wide range of industries.

It is one of the commonly used big data frameworks, which is scalable, in-memory persistent, fault tolerant, and supports programs that can be executed in parallel.

Cyber security analytics is an alternative solution to the traditional security systems. It exploits the techniques and methods used in big data analytics to solve security related problems. With the help of big data frameworks like Hadoop, Spark, it can handle large volume of data in real time and can provide important insights into the security incidents with the help of data. Cyber security analytics can also detect attacks that are hidden inside enormous number of security events by filtering out irrelevant events from the relevant ones. This in turn speeds up the process of security analysis.

Elkan’s?k-means clustering using triangle inequality (k-meansTI) is one of the ways to improve performance of the original?k-means algorithm.?k-meansTI avoids data points-cluster centers distance computations for the points that are far away from the cluster centers. The main contribution of?k-meansTI is the possibility to reduce the time complexity of the standard?k-means from?O(kne)?to approximately?O(n)?in practice.

The results showed that the parallel implementation of?k-meansTI on Spark can achieve better performance than the Spark ML?k-means when the dataset is very large. However the performance was degraded for small datasets. Clustering Web attacks shows that good clustering results can organize and reduce the data for further analysis and can be used to gain important insight into the properties of the attacks. The knowledge obtained from the clustering results can also be used to quickly classify the new data.

Some more applications of K-means under the security domain are as follows:

Customer Segmentation: Subdivision of customers into groups/segments such that each customer segment consists of customers with similar market characteristics —?pricing , loyalty, spending behaviors?etc. Some of the segmentation variables could be e.g.,?number of items bought on sale, avg transaction value, total number of transactions. Customer segmentation allows businesses to customize market programs that will be suitable for each of its customer segments.

Anomaly or Fraud Detection: Separate valid activity groups from bots and detect fraudulent claims.

Inventory Categorization?based on sales or other manufacturing metrics
Creating NewsFeeds: K-Means can be used to cluster articles by their similarity — it can separate documents into disjoint clusters.

Cloud Computing Environment: Clustered storage to increase performance, capacity, or reliability — clustering distributes work loads to each server, manages the transfer of workloads between servers, and provides access to all files from any server regardless of the physical location of the file.
Environmental risks: K-means can be used to analyze environmental risk in an area — environmental risk zoning of a chemical industrial area.

Pattern Recognition in images:?For example, to automatically detect infected fruits or for segmentation of blood cells for leukemia detection.

Conclusion:

Hope this information was relevant to you all. Here, I tried to explain about k- means clustering algorithm and it's applications which are currently researched by a lot of scientists. Please do refer the original research papers linked above to get much more detailed information regarding the implementation and in-depth working of the above mentioned use cases.

Thank you :)

Incredible Interns

7 个月

Wow, your detailed explanation on k-means clustering is impressive! You've shown real talent in breaking down complex concepts. Learning about other machine learning algorithms could really broaden your skill set. Have you considered diving into neural networks next? I'm curious, what do you see yourself doing in the future within the tech world?

1 次回应

要查看或添加评论，请登录

Ajeenkya S.的更多文章

Microservices: Architecture and Case Study from Various Organizations

2022年11月29日

Microservices: Architecture and Case Study from Various Organizations

What are microservices? Microservices are an architectural approach to building applications. As an architectural…

14 条评论
Research Insights On JVM (Java Virtual Machine)

2022年10月28日

Research Insights On JVM (Java Virtual Machine)

What is Java Virtual Machine? JVM(Java Virtual Machine) acts as a run-time engine to run Java applications. JVM is the…

1 条评论
How to create a Flutter Linux Mobile Application that runs with the help of a CGI program at the Backend?

2021年8月25日

How to create a Flutter Linux Mobile Application that runs with the help of a CGI program at the Backend?

Task 11 ?? ?? Team Task ?? ?Description: Till date whatever we have learned in Flutter, is need to be implemented in…
How to make a Kubernetes web application using CGI?

2021年8月22日

How to make a Kubernetes web application using CGI?

Task 09 ??????? Kubernetes Integration with Python-CGI Task Description ?? ?? In continuation of task 7.1 you need to…
Javascript: Industry Usecases

2021年7月23日

Javascript: Industry Usecases

What is JavaScript, and why is it important? JavaScript is a programming language used primarily by Web browsers to…

2 条评论
Confusion Matrix and Cyber Crime

2021年6月3日

Confusion Matrix and Cyber Crime

A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It is…

2 条评论
Industry use cases of Azure Kubernetes Service

2021年3月4日

Industry use cases of Azure Kubernetes Service

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that…

2 条评论
Industry use cases of Neural Networks

2021年3月4日

Industry use cases of Neural Networks

What is a Neural Networks ? Neural networks are a series of algorithms that mimic the operations of a human brain to…

1 条评论
AWS SQS and it's use cases

2021年3月1日

AWS SQS and it's use cases

"RedBus founded in 2006 is the best platform in the world for booking bus tickets online. Currently, it is serving 6…

2 条评论
AI & Machine Learning: Is it Beneficial for the Product Based Companies?

2021年2月6日

AI & Machine Learning: Is it Beneficial for the Product Based Companies?

In this article we'll see what's the major benefit of AI to different MNC's and how it is enhancing the products with…

See all articles

K-means clustering: Applications in security domains

Ajeenkya S.

Jr. Soft Engg @Cognizant, EDI-Maps Developer, 2X OCI, 1xAWS Certified, 1X Aviatrix Certified, AT&T Summer Learning Academy Extern, LW summer Research Intern, ARTH Learner, 1X Gitlab Certified Associate, ARTH 2.0 LW_TV

What is K-Means Algorithm?

领英推荐

Secure two-party?k-means Clustering:

Use in Intrusion Detection Systems:

Cyber Security Analytics on Apache Spark:

Some more applications of K-means under the security domain are as follows:

Conclusion:

Ajeenkya S.的更多文章

社区洞察

其他会员也浏览了

Differentiating Regression Algorithms And Classification Algorithms

Computer Vision Classification: Cleaning Noisy and Mislabeled Data

Data clustering

Clustering - Machine Learning Algorithms

Differentiating Regression Algorithms And Classification Algorithms

Isolation Forest: Unmasking Anomalies in Your Data

What frustrates Data Scientists in Machine Learning projects?

k-mean clustering and its real usecase in the security domain

Imbalanced classification

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!

What is K-Means Algorithm?

领英推荐

Secure two-party?k-means Clustering:

Use in Intrusion Detection Systems:

Cyber Security Analytics on Apache Spark:

Some more applications of K-means under the security domain are as follows:

Conclusion:

Ajeenkya S.的更多文章

Microservices: Architecture and Case Study from Various Organizations

Research Insights On JVM (Java Virtual Machine)

How to create a Flutter Linux Mobile Application that runs with the help of a CGI program at the Backend?

How to make a Kubernetes web application using CGI?

Javascript: Industry Usecases

Confusion Matrix and Cyber Crime

Industry use cases of Azure Kubernetes Service

Industry use cases of Neural Networks

AWS SQS and it's use cases

AI & Machine Learning: Is it Beneficial for the Product Based Companies?

社区洞察

其他会员也浏览了

Differentiating Regression Algorithms And Classification Algorithms

Computer Vision Classification: Cleaning Noisy and Mislabeled Data

Data clustering

Clustering - Machine Learning Algorithms

Differentiating Regression Algorithms And Classification Algorithms

Isolation Forest: Unmasking Anomalies in Your Data

What frustrates Data Scientists in Machine Learning projects?

k-mean clustering and its real usecase in the security domain

Imbalanced classification

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!