K-mean clustering and its real use case in the security domain
What is Clustering ?
clustering algorithms is used to subdivide our datasets into clusters of data points that are most similar for a predefined attribute. If we have a dataset that describes multiple attributes about a particular feature and want to group your data points according to their attribute similarities, then use clustering algorithms.
A simple scatter plot of Country Income and Education datasets yields the chart we see here.
In unsupervised clustering, we start with this data and then proceed to divide it into subsets. These subsets are called?clusters?and are comprised of data points that are most similar to one another. It appears that there are at least two clusters, probably three — one at the bottom with low income and education, and then the high education countries look like they might be split between low and high income.
The following figure shows the result of?eyeballing?— making a visual estimate of — clusters in this dataset.
Clustering algorithms are one type of approach in unsupervised machine learning — other approaches include Markov methods and methods for dimension reduction. Clustering algorithms are appropriate in situations where the following characteristics are true:
K-Means Clustering
K-means clustering aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart.
Similarity of two points is determined by the distance between them.
There are many methods to measure the distance.?Euclidean distance?(minkowski distance with p=2) is one of most commonly used distance measurements. The figure below shows how to calculate euclidean distance between two points in a 2-dimensional space. It is calculated using the square of the difference between x and y coordinates of the points.
In the case above, euclidean distance is the square root of (16 + 9) which is 5. Euclidean distance in two dimensions remind us the famous?pythagorean theorem.
There are other ways to measure distance such as cosine similarity, average distance and so on. The similarity measure is at the core of k-means clustering. Optimal method depends on the type of problem. So it is important to have a good domain knowledge in order to choose the best measurement type.
K-means clustering tries to minimize distances within a cluster and maximize the distance between different clusters.
Let’s start with a simple example to understand the concept. As usual, we import the dependencies first:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
Scikit-learn provides many useful functions to create synthetic datasets which are very helpful for practicing machine learning algorithms. I will use?make_blobs?function.
X, y = make_blobs(n_samples = 200, centers=4, cluster_std = 0.5, random_state = 0)
plt.scatter(X[:, 0], X[:, 1], s=50)
Then we create a KMeans object and fit the data:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(X)
We can now partition the dataset into clusters:
y_pred = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c = y_pred, s=50)
领英推荐
Real life datasets are much more complex in which clusters are not clearly separated. However, the algorithm works in the same way.
K-means algorithm is not capable of determining the number of clusters. We need to define it when creating the KMeans object which may be a challenging task.
K-Means Algorithm
K-means is an iterative process. It is built on?expectation-maximization?algorithm. After number of clusters are determined, it works by executing the following steps:
Pros and Cons
Pros:
Cons:
Real use cases in the security domain
k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such scenarios.
So here is a list of ten interesting use cases for k-means:-
1.Information security problem
In the rapid development of computer technology today, information security has become an important guarantee for social development, most of the technology depends on the development of network information support, at the same time information security has become a worthy issue. Intrusion detection system occupies an important part in the information security architecture, with the diversification and complication of the intrusion detection system, the intrusion detection system also puts forward higher requirements. In this work, the principle and flow of K-means algorithm are expounded, and the problems existing in the application of K-means algorithm are analyzed. The initial value is easy to be affected by the isolated point, and the convergence result is easy to fall into the local optimum. It is suggested that the isolated point should be removed and the initial center should be optimized. Finally, the algorithm of the isolated point clustering method is improved. Through simulation experiments show that the improved K-means algorithm improves the detection rate of each data with the traditional K-means algorithm in the intrusion detection of mixed data, the false detection rate decreases, and the clustering effect is improved obviously, it also received good detection results at the same time.
2. Cyber-profiling criminals
Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. K-Means algorithm is used as an algorithm for the cyber profiling process. K-Means algorithm being used is in line with expectations from this study, because it has a simple algorithmic process with a good degree of accuracy.
3. Insurance fraud detection
Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
Thank You for reading
Hope you will like it.
Feel Free to Comment!!