k-mean clustering and its real usecase in the security domain
k - mean clustering : is one of the unsupervised machine learning algorithms. Clustering means dividing data into a number of groups with similar properties, these groups are called Clusters. Clusters refers to a collection of data points which have similar traits. The number of clusters are decided by the variable k.
k refers to the number of centroids you need in the dataset. A centroid is the location representing the center of the cluster.
How k - mean Clustering Works
Step-01 : First select the number k to decide the number of clusters.
Step-02 : It selects random k centroids in the dataset.
Step-03 : Then, calculate the Euclidean distance and assign the data points to the nearest centroid, thus creating k clusters.
Step-04 : Now, find the original centroid in each group.
Step-05 : Now, repeat the steps till each data point assign to a cluster.
It is a simple example to understand how k-means works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the result.
First, we will start by importing the necessary packages ?
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
The following code will generate the 2D, containing four blobs ?
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples = 400, centers = 4, cluster_std = 0.60, random_state = 0)
Next, the following code will help us to visualize the dataset ?
plt.scatter(X[:, 0], X[:, 1], s = 20);
plt.show()
Next, make an object of KMeans along with providing number of clusters, train the model and do the prediction as follows ?
领英推荐
kmeans = KMeans(n_clusters = 4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means Python estimator ?
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples = 400, centers = 4, cluster_std = 0.60, random_state = 0)
Next, the following code will help us to visualize the dataset ?
plt.scatter(X[:, 0], X[:, 1], c = y_kmeans, s = 20, cmap = 'summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c = 'blue', s = 100, alpha = 0.9);
plt.show()
Elbow Method : In the Elbow method, we are varying the number of clusters and for each value of K, we are calculating WCSS(Within-Cluster Sum Square). WCSS is the sum of squared distance between each point and the centroid in cluster. When we plot the WCSS graph with the k value, the graph looks like an Elbow.
As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest at k=1. On increasing the number of clusters the graph will rapidly change at a point and create an Elbow shape. From this point, graph starts to move almost parallel to the x-axis. The value of k corresponding to this point is the optimal k value or optimal number of clusters.
How K- means Clustering Hepls In Security Domain ?
1. Insurance fraud detection
machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
2. cyber-profiling criminals
cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.?
3. automatic clustering of it alerts
large enterprise it infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.
"Keep sharing, keep learning"