k-mean clustering and its real usecase in the security domain

k-mean clustering and its real usecase in the security domain

k - mean clustering : is one of the unsupervised machine learning algorithms. Clustering means dividing data into a number of groups with similar properties, these groups are called Clusters. Clusters refers to a collection of data points which have similar traits. The number of clusters are decided by the variable k.

k refers to the number of centroids you need in the dataset. A centroid is the location representing the center of the cluster.

How k - mean Clustering Works

Step-01 : First select the number k to decide the number of clusters.

Step-02 : It selects random k centroids in the dataset.

Step-03 : Then, calculate the Euclidean distance and assign the data points to the nearest centroid, thus creating k clusters.

Step-04 : Now, find the original centroid in each group.

Step-05 : Now, repeat the steps till each data point assign to a cluster.

It is a simple example to understand how k-means works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the result.

First, we will start by importing the necessary packages ?

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans        

The following code will generate the 2D, containing four blobs ?

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples = 400, centers = 4, cluster_std = 0.60, random_state = 0)        

Next, the following code will help us to visualize the dataset ?

plt.scatter(X[:, 0], X[:, 1], s = 20);
plt.show()        
No alt text provided for this image


Next, make an object of KMeans along with providing number of clusters, train the model and do the prediction as follows ?

kmeans = KMeans(n_clusters = 4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)        

Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means Python estimator ?

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples = 400, centers = 4, cluster_std = 0.60, random_state = 0)        

Next, the following code will help us to visualize the dataset ?

plt.scatter(X[:, 0], X[:, 1], c = y_kmeans, s = 20, cmap = 'summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c = 'blue', s = 100, alpha = 0.9);
plt.show()        
No alt text provided for this image


Elbow Method : In the Elbow method, we are varying the number of clusters and for each value of K, we are calculating WCSS(Within-Cluster Sum Square). WCSS is the sum of squared distance between each point and the centroid in cluster. When we plot the WCSS graph with the k value, the graph looks like an Elbow.

As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest at k=1. On increasing the number of clusters the graph will rapidly change at a point and create an Elbow shape. From this point, graph starts to move almost parallel to the x-axis. The value of k corresponding to this point is the optimal k value or optimal number of clusters.

No alt text provided for this image

How K- means Clustering Hepls In Security Domain ?

1. Insurance fraud detection

machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

2. cyber-profiling criminals

cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.?

3. automatic clustering of it alerts

large enterprise it infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.

"Keep sharing, keep learning"


要查看或添加评论,请登录

Neha Arya的更多文章

  • Google Map JavaScript API

    Google Map JavaScript API

    JavaScript : JavaScript is the Programming Language for the Web applications. JavaScript, HTML and CSS these three…

  • Confusion Matrix In Cyber Security

    Confusion Matrix In Cyber Security

    Cyber Crime : Cyber crime is an illegal activity that targets or uses computer, computer network or network devices to…

    4 条评论
  • GUI Application Inside Docker

    GUI Application Inside Docker

    Summer - Task 02 ??????? Summer Program Linux World Informatics Pvt. Lt.

  • Launch Machine Learning Inside Docker

    Launch Machine Learning Inside Docker

    Linux World Pvt. Lt.

社区洞察

其他会员也浏览了