??K-mean clustering and its real usecase in the security domain??
Let's first understand, what exactly is K-mean clustering in machine learning?
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
A cluster refers to a collection of data points aggregated together because of certain similarities.
What exactly is clustering?
Let’s kick things off with a simple example. A bank wants to give credit card offers to its customers. Currently, they look at the details of each customer and based on this information, decide which offer should be given to which customer.
Now, the bank can potentially have millions of customers. Does it make sense to look at the details of each customer separately and then make a decision? Certainly not! It is a manual process and will take a huge amount of time.
So what can the bank do? One option is to segment its customers into different groups. For instance, the bank can group the customers based on their income:
The bank can now make three different strategies or offers, one for each group. Here, instead of creating different strategies for individual customers, they only have to make 3 strategies. This will reduce the effort as well as the time.
Thus, Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.
Properties Of Cluster:-
1) All the data points in a cluster should be similar to each other.?
2) The data points from different clusters should be as different as possible.
TYPES OF CLUSTERING:-
The various types of clustering are:
Hierarchical clustering is further subdivided into:
Partitioning clustering is further subdivided into:
Applications of Clustering in Real-World Scenarios
1) Customer Segmentation
2) Document Clustering
3) Image Segmentation
4) Recommendation Engines
WHAT IS MEANT BY K-MEANS ALGORITHM?
K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.
The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.
For a better understanding of k-means, let’s take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsman and bowlers.
HOW THE K-MEANS ALGORITHM WORKS?
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids.
It halts creating and optimizing clusters when either:
The?k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters.
CHOOSING K
The algorithm described above finds the clusters and data set labels for a particular pre-chosen?K. To find the number of clusters in the data, the user needs to run the?K-means clustering algorithm for a range of?K?values and compare the results. In general, there is no method for determining exact value of?K, but an accurate estimate can be obtained using the following techniques.
One of the metrics that is commonly used to compare results across different values of?K?is the mean distance between data points and their cluster centroid. Since increasing the number of clusters will always reduce the distance to data points, increasing?K?will?always?decrease this metric, to the extreme of reaching zero when?K?is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function of?K?is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to roughly determine?K.
A number of other techniques exist for validating?K, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of data points across groups provides insight into how the algorithm is splitting the data for each?K.
领英推荐
USE-CASES OF K-MEAN CLUSTERING IN SECURITY DOMAIN:-
In today’s world security is an aspect which is given higher priority by all political and government worldwide and aiming to reduce crime incidence. As data mining is the appropriate field to apply on high volume crime dataset and knowledge gained from data mining approaches will be useful and support police force.
Crime analysis is done by performing k-means clustering on crime dataset using rapid miner tool.?
Crime analysis is defined as analytical processes which provides relevant information relative to crime patterns and trend correlations to assist personnel in planning the deployment of resources for the prevention and suppression of criminal activities.
It is important to analyze crime due to following reasons :
1. Analyze crime to inform law enforcers about general and specific crime trends in timely manner.
2. Analyze crime to take advantage of the plenty of information existing in justice system and public domain.
Crime rates are rapidly changing and improved analysis finds hidden patterns of crime, if any, without any explicit prior knowledge of these patterns.
The main objectives of crime analysis include:
1. Extraction of crime patterns by analysis of available crime and criminal data.
2. Prediction of crime based on spatial distribution of existing data and anticipation of crime rate using different data mining techniques.
3. Detection of crime.
Let's go through its procedure:-
1. First we take crime dataset.
2. Filter dataset according to requirement and create new dataset which has attribute according to analysis to be done.
3. Open rapid miner tool and read excel file of crime dataset and apply “Replace Missing value operator” on it and execute operation.
4. Perform “Normalize operator” on resultant dataset and execute operation.
5. Perform k means clustering on resultant dataset formed after normalization and execute operation.
6. From plot view of result plot data between crimes and get required cluster.
7. Analysis can be done on cluster formed.
?Flow chart of crime analysis:-
K-means clustering is one of the method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
Process:
1. Initially, the number of clusters must be known let it be k.
2. The initial step is the choose a set of K instances as centres of the clusters.
3. Next, the algorithm considers each instance and assigns it to the cluster which is closest.
4. The cluster centroids are recalculated either after whole cycle of re-assignment or each instance assignment.
5. This process is iterated. K means algorithm complexity is O(tkn), where n is instances, c is clusters, and t is iterations and relatively efficient .
It often terminates at a local optimum.
Its disadvantage is applicable only when mean is defined and need to specify c, the number of clusters, in advance. It unable to handle noisy data and outliers and not suitable to discover clusters with non-convex shapes.
So, thus we have seen:-
1) What is clustering?
2) Types of Clustering.
3) Applications of Clustering
4) What is K-Mean Algorithm?
5) Working of K-Mean Algorithm
6) Choosing K
7) Its use-case in security domain
Thanks for Reading??
Keep Learning?? Keep Sharing??