登录查看更多内容

??K-mean clustering and its real usecase in the security domain??

Vrushali Mahajan

Software Engineer at UBS

发布日期: 2021年7月20日

+ 关注

Let's first understand, what exactly is K-mean clustering in machine learning?

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

A cluster refers to a collection of data points aggregated together because of certain similarities.

What exactly is clustering?

Let’s kick things off with a simple example. A bank wants to give credit card offers to its customers. Currently, they look at the details of each customer and based on this information, decide which offer should be given to which customer.

Now, the bank can potentially have millions of customers. Does it make sense to look at the details of each customer separately and then make a decision? Certainly not! It is a manual process and will take a huge amount of time.

So what can the bank do? One option is to segment its customers into different groups. For instance, the bank can group the customers based on their income:

The bank can now make three different strategies or offers, one for each group. Here, instead of creating different strategies for individual customers, they only have to make 3 strategies. This will reduce the effort as well as the time.

Thus, Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.

Properties Of Cluster:-

1) All the data points in a cluster should be similar to each other.?

2) The data points from different clusters should be as different as possible.

TYPES OF CLUSTERING:-

The various types of clustering are:

Hierarchical clustering
Partitioning clustering

Hierarchical clustering is further subdivided into:

Agglomerative clustering
Divisive clustering

Partitioning clustering is further subdivided into:

K-Means clustering
Fuzzy C-Means clustering

Applications of Clustering in Real-World Scenarios

1) Customer Segmentation

2) Document Clustering

3) Image Segmentation

4) Recommendation Engines

WHAT IS MEANT BY K-MEANS ALGORITHM?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

For a better understanding of k-means, let’s take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsman and bowlers.

HOW THE K-MEANS ALGORITHM WORKS?

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids.

It halts creating and optimizing clusters when either:

The centroids have stabilized — there is no change in their values because the clustering has been successful.
The defined number of iterations has been achieved.

The?k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters.

CHOOSING K

The algorithm described above finds the clusters and data set labels for a particular pre-chosen?K. To find the number of clusters in the data, the user needs to run the?K-means clustering algorithm for a range of?K?values and compare the results. In general, there is no method for determining exact value of?K, but an accurate estimate can be obtained using the following techniques.

One of the metrics that is commonly used to compare results across different values of?K?is the mean distance between data points and their cluster centroid. Since increasing the number of clusters will always reduce the distance to data points, increasing?K?will?always?decrease this metric, to the extreme of reaching zero when?K?is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function of?K?is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to roughly determine?K.

A number of other techniques exist for validating?K, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of data points across groups provides insight into how the algorithm is splitting the data for each?K.

领英推荐

Data Scientist’s Dilemma: The Cold Start Problem – Ten…

Kirk Borne, Ph.D. 6 年前

K-means Clustering: Applications and Real-world Use…

Vrata Tech Solutions (VTS) 11 个月前

Data clustering

Darshika Srivastava 1 年前

USE-CASES OF K-MEAN CLUSTERING IN SECURITY DOMAIN:-

In today’s world security is an aspect which is given higher priority by all political and government worldwide and aiming to reduce crime incidence. As data mining is the appropriate field to apply on high volume crime dataset and knowledge gained from data mining approaches will be useful and support police force.

Crime analysis is done by performing k-means clustering on crime dataset using rapid miner tool.?

Crime analysis is defined as analytical processes which provides relevant information relative to crime patterns and trend correlations to assist personnel in planning the deployment of resources for the prevention and suppression of criminal activities.

It is important to analyze crime due to following reasons :

1. Analyze crime to inform law enforcers about general and specific crime trends in timely manner.

2. Analyze crime to take advantage of the plenty of information existing in justice system and public domain.

Crime rates are rapidly changing and improved analysis finds hidden patterns of crime, if any, without any explicit prior knowledge of these patterns.

The main objectives of crime analysis include:

1. Extraction of crime patterns by analysis of available crime and criminal data.

2. Prediction of crime based on spatial distribution of existing data and anticipation of crime rate using different data mining techniques.

3. Detection of crime.

Let's go through its procedure:-

1. First we take crime dataset.

2. Filter dataset according to requirement and create new dataset which has attribute according to analysis to be done.

3. Open rapid miner tool and read excel file of crime dataset and apply “Replace Missing value operator” on it and execute operation.

4. Perform “Normalize operator” on resultant dataset and execute operation.

5. Perform k means clustering on resultant dataset formed after normalization and execute operation.

6. From plot view of result plot data between crimes and get required cluster.

7. Analysis can be done on cluster formed.

?Flow chart of crime analysis:-

K-means clustering is one of the method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Process:

1. Initially, the number of clusters must be known let it be k.

2. The initial step is the choose a set of K instances as centres of the clusters.

3. Next, the algorithm considers each instance and assigns it to the cluster which is closest.

4. The cluster centroids are recalculated either after whole cycle of re-assignment or each instance assignment.

5. This process is iterated. K means algorithm complexity is O(tkn), where n is instances, c is clusters, and t is iterations and relatively efficient .

It often terminates at a local optimum.

Its disadvantage is applicable only when mean is defined and need to specify c, the number of clusters, in advance. It unable to handle noisy data and outliers and not suitable to discover clusters with non-convex shapes.

So, thus we have seen:-

1) What is clustering?

2) Types of Clustering.

3) Applications of Clustering

4) What is K-Mean Algorithm?

5) Working of K-Mean Algorithm

6) Choosing K

7) Its use-case in security domain

Thanks for Reading??

Keep Learning?? Keep Sharing??

要查看或添加评论，请登录

Vrushali Mahajan的更多文章

??AMAZON SQS AND NASA??

2021年8月28日

??AMAZON SQS AND NASA??

Amazon provides SDKs in several programming languages including Java, Ruby, Python, .NET, PHP, Go and JavaScript.
??AWS CLI COMMANDS??

2021年7月23日

??AWS CLI COMMANDS??

Task Description ?? Create a key pair. ?? Create a security group.

1 条评论
??HELM CHARTS??

2021年7月14日

??HELM CHARTS??

Before we start let us know what are Helm Charts and why do Kubernetes use them? Helm uses a packaging format called…
??Multi-Node Cluster & Stateful Apps Like MySql and Wordpress??

2021年7月12日

??Multi-Node Cluster & Stateful Apps Like MySql and Wordpress??

What is Kubernetes exactly? Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling,…

2 条评论
??USE CASES OF JAVASCRIPT??

2021年6月20日

??USE CASES OF JAVASCRIPT??

Let's start with What exactly is Javascript? JavaScript (JS) is a lightweight, interpreted, or just-in-time compiled…
IBM AND ITS USE CASES SOLVED BY KUBERNETES

2021年6月10日

IBM AND ITS USE CASES SOLVED BY KUBERNETES

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of…
IMAGE PROCESSING WITH OPENCV USING PYTHON

2021年6月7日

IMAGE PROCESSING WITH OPENCV USING PYTHON

What is Image Processing? Image processing aims to transform an image into digital form and performs some process on…
??CYBER CRIMES AND CONFUSION MATRIX??

2021年6月4日

??CYBER CRIMES AND CONFUSION MATRIX??

Let's start with knowing, what exactly is cyber crime? Cybercrime is criminal activity that either targets or uses a…
RUNNING GUI APPS IN A DOCKER CONTAINER

2021年6月1日

RUNNING GUI APPS IN A DOCKER CONTAINER

What is GUI? Graphical user interface (GUI), a computer program that enables a person to communicate with a computer…
Running Machine Learning model in Docker Container

2021年5月27日

Running Machine Learning model in Docker Container

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers…

See all articles

??K-mean clustering and its real usecase in the security domain??

Vrushali Mahajan

Software Engineer at UBS

Applications of Clustering in Real-World Scenarios

HOW THE K-MEANS ALGORITHM WORKS?

CHOOSING K

领英推荐

USE-CASES OF K-MEAN CLUSTERING IN SECURITY DOMAIN:-

?Flow chart of crime analysis:-

Vrushali Mahajan的更多文章

社区洞察

其他会员也浏览了

Clustering - Machine Learning Algorithms

Clustering Algorithms

LINEAR REGRESSION IN MACHINE LEARNING

Cyclical Encoding: An Alternative to One-Hot Encoding

Data Scaling and Training space in Machine Learning. A Statistical perspective.

Statistical Modeling

K-Means Clustering, Centroid, Inertia, Convergence & more.