K-means Clustering and its use-case in the Security Domain
Harshita Kumari
DevOps Engineer | Terraform/RHCSA/AWS/Azure Certified | AWS,Docker, Ansible, Kubernetes ,Terraform ,Python ,Jenkins,
What is K-means Clustering?
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. in simple words, the aim is to segregate groups with similar traits and assign them into clusters. The goal of the k-means algorithm is to find groups in the data.
K-Means Algorithm
K-Means Clustering is an unsupervised learning algorithm , which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The k-means clustering?algorithm mainly performs two tasks:
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
WORKING OF K-Means Algorithm
The working of the K-Means algorithm is explained in the below steps:
Step-1:?Select the number K to decide the number of clusters.
Step-2:?Select random K points or centroids. (It can be other from the input dataset).
Step-3:?Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4:?Calculate the variance and place a new centroid of each cluster.
Step-5:?Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Step-6:?If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Use-Cases in the Security Domain
Malware Detection
Malware detection refers to the process of detecting the presence of malware on a host system or of distinguishing whether a specific program is malicious or benign. Malware detection technique plays vital role in detecting malware attack that can give high impact towards the cyber world. By using clustering, unsupervised machine learning is able to detect malware attack by identifying the behavior of the malware.
Clustering detection model by using?K-Means?clustering approach to detect malware behavior of data based on the features of the malware. Clustering techniques that use unsupervised algorithm in machine learning plays an important role in grouping similar malware characteristics by studying the behavior of the malware which results in, model is capable to cluster normal and suspicious data into two separate groups with high detection rate which is more than 90 percent accuracy.
Anomaly detection
Anomaly detection refers to methods that provide warnings of unusual behaviors which may compromise the security and performance of communication networks. Anomalous behaviors can be identified by comparing the distance between real data and cluster centroids. Identifying network anomalies is essential for communication networks of enterprises or institutions. The goal is to provide an early warning about an unusual behavior which can affect the security and the performance of a network.
Crime Analysis
Crime analysis is a law enforcement function that involves systematic analysis for identifying and analyzing patterns and trends in crime and disorder. Crime analysis also plays a role in devising solutions to crime problems, and formulating crime prevention strategies. Analysis of crime is essential for providing safety and security to the civilian population. K means?clustering technique is used to extract useful information from the high volume crime dataset and to interpret the data which assist police in identify and analyze crime patterns to reduce further occurrences of similar incidence and provide information to reduce the crime.
Crime document classification
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.
Identifying crime localities
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.?
Hope you like it.Thankyou!
???????? ?????? (?????? ?????, ???????? ?????)
3 年Wow great
ATSE@RedHat || Openshift || 3x RedHat Certified || DevOps(Docker??, Kubernetes?, Jenkins????) || Ansible || Cloud Computing ?(AWS) |||
3 年Good Job Harshita Kumari
PriceFx Certified Configuration Engineer
3 年??