??K-mean clustering and its real usecase in the security domain??

??K-mean clustering and its real usecase in the security domain??

Let's first understand, what exactly is K-mean clustering in machine learning?

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

A cluster refers to a collection of data points aggregated together because of certain similarities.

What exactly is clustering?

Let’s kick things off with a simple example. A bank wants to give credit card offers to its customers. Currently, they look at the details of each customer and based on this information, decide which offer should be given to which customer.

Now, the bank can potentially have millions of customers. Does it make sense to look at the details of each customer separately and then make a decision? Certainly not! It is a manual process and will take a huge amount of time.

So what can the bank do? One option is to segment its customers into different groups. For instance, the bank can group the customers based on their income:

No alt text provided for this image

The bank can now make three different strategies or offers, one for each group. Here, instead of creating different strategies for individual customers, they only have to make 3 strategies. This will reduce the effort as well as the time.

Thus, Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.

Properties Of Cluster:-

1) All the data points in a cluster should be similar to each other.?

No alt text provided for this image

2) The data points from different clusters should be as different as possible.

No alt text provided for this image

TYPES OF CLUSTERING:-

The various types of clustering are:

  • Hierarchical clustering
  • Partitioning clustering

Hierarchical clustering is further subdivided into:

  • Agglomerative clustering
  • Divisive clustering

Partitioning clustering is further subdivided into:

  • K-Means clustering
  • Fuzzy C-Means clustering

No alt text provided for this image

Applications of Clustering in Real-World Scenarios

1) Customer Segmentation

2) Document Clustering

3) Image Segmentation

4) Recommendation Engines

WHAT IS MEANT BY K-MEANS ALGORITHM?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

For a better understanding of k-means, let’s take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsman and bowlers.

HOW THE K-MEANS ALGORITHM WORKS?

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids.

It halts creating and optimizing clusters when either:

  • The centroids have stabilized — there is no change in their values because the clustering has been successful.
  • The defined number of iterations has been achieved.

No alt text provided for this image

The?k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters.

CHOOSING K

The algorithm described above finds the clusters and data set labels for a particular pre-chosen?K. To find the number of clusters in the data, the user needs to run the?K-means clustering algorithm for a range of?K?values and compare the results. In general, there is no method for determining exact value of?K, but an accurate estimate can be obtained using the following techniques.

One of the metrics that is commonly used to compare results across different values of?K?is the mean distance between data points and their cluster centroid. Since increasing the number of clusters will always reduce the distance to data points, increasing?K?will?always?decrease this metric, to the extreme of reaching zero when?K?is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function of?K?is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to roughly determine?K.

A number of other techniques exist for validating?K, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of data points across groups provides insight into how the algorithm is splitting the data for each?K.

USE-CASES OF K-MEAN CLUSTERING IN SECURITY DOMAIN:-

In today’s world security is an aspect which is given higher priority by all political and government worldwide and aiming to reduce crime incidence. As data mining is the appropriate field to apply on high volume crime dataset and knowledge gained from data mining approaches will be useful and support police force.

Crime analysis is done by performing k-means clustering on crime dataset using rapid miner tool.?

Crime analysis is defined as analytical processes which provides relevant information relative to crime patterns and trend correlations to assist personnel in planning the deployment of resources for the prevention and suppression of criminal activities.

It is important to analyze crime due to following reasons :

1. Analyze crime to inform law enforcers about general and specific crime trends in timely manner.

2. Analyze crime to take advantage of the plenty of information existing in justice system and public domain.

Crime rates are rapidly changing and improved analysis finds hidden patterns of crime, if any, without any explicit prior knowledge of these patterns.

The main objectives of crime analysis include:

1. Extraction of crime patterns by analysis of available crime and criminal data.

2. Prediction of crime based on spatial distribution of existing data and anticipation of crime rate using different data mining techniques.

3. Detection of crime.

Let's go through its procedure:-

1. First we take crime dataset.

2. Filter dataset according to requirement and create new dataset which has attribute according to analysis to be done.

3. Open rapid miner tool and read excel file of crime dataset and apply “Replace Missing value operator” on it and execute operation.

4. Perform “Normalize operator” on resultant dataset and execute operation.

5. Perform k means clustering on resultant dataset formed after normalization and execute operation.

6. From plot view of result plot data between crimes and get required cluster.

7. Analysis can be done on cluster formed.

?Flow chart of crime analysis:-

No alt text provided for this image

K-means clustering is one of the method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Process:

1. Initially, the number of clusters must be known let it be k.

2. The initial step is the choose a set of K instances as centres of the clusters.

3. Next, the algorithm considers each instance and assigns it to the cluster which is closest.

4. The cluster centroids are recalculated either after whole cycle of re-assignment or each instance assignment.

5. This process is iterated. K means algorithm complexity is O(tkn), where n is instances, c is clusters, and t is iterations and relatively efficient .

It often terminates at a local optimum.

Its disadvantage is applicable only when mean is defined and need to specify c, the number of clusters, in advance. It unable to handle noisy data and outliers and not suitable to discover clusters with non-convex shapes.

So, thus we have seen:-

1) What is clustering?

2) Types of Clustering.

3) Applications of Clustering

4) What is K-Mean Algorithm?

5) Working of K-Mean Algorithm

6) Choosing K

7) Its use-case in security domain

Thanks for Reading??

Keep Learning?? Keep Sharing??

要查看或添加评论,请登录

Vrushali Mahajan的更多文章

  • ??AMAZON SQS AND NASA??

    ??AMAZON SQS AND NASA??

    Amazon provides SDKs in several programming languages including Java, Ruby, Python, .NET, PHP, Go and JavaScript.

  • ??AWS CLI COMMANDS??

    ??AWS CLI COMMANDS??

    Task Description ?? Create a key pair. ?? Create a security group.

    1 条评论
  • ??HELM CHARTS??

    ??HELM CHARTS??

    Before we start let us know what are Helm Charts and why do Kubernetes use them? Helm uses a packaging format called…

  • ??Multi-Node Cluster & Stateful Apps Like MySql and Wordpress??

    ??Multi-Node Cluster & Stateful Apps Like MySql and Wordpress??

    What is Kubernetes exactly? Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling,…

    2 条评论
  • ??USE CASES OF JAVASCRIPT??

    ??USE CASES OF JAVASCRIPT??

    Let's start with What exactly is Javascript? JavaScript (JS) is a lightweight, interpreted, or just-in-time compiled…

  • IBM AND ITS USE CASES SOLVED BY KUBERNETES

    IBM AND ITS USE CASES SOLVED BY KUBERNETES

    Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of…

  • IMAGE PROCESSING WITH OPENCV USING PYTHON

    IMAGE PROCESSING WITH OPENCV USING PYTHON

    What is Image Processing? Image processing aims to transform an image into digital form and performs some process on…

  • ??CYBER CRIMES AND CONFUSION MATRIX??

    ??CYBER CRIMES AND CONFUSION MATRIX??

    Let's start with knowing, what exactly is cyber crime? Cybercrime is criminal activity that either targets or uses a…

  • RUNNING GUI APPS IN A DOCKER CONTAINER

    RUNNING GUI APPS IN A DOCKER CONTAINER

    What is GUI? Graphical user interface (GUI), a computer program that enables a person to communicate with a computer…

  • Running Machine Learning model in Docker Container

    Running Machine Learning model in Docker Container

    Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers…

社区洞察

其他会员也浏览了