登录查看更多内容

K-mean clustering and its real use case in the security domain

Saurav Majumder

Associate IT Consultant

发布日期: 2021年7月19日

What is Clustering ?

clustering algorithms is used to subdivide our datasets into clusters of data points that are most similar for a predefined attribute. If we have a dataset that describes multiple attributes about a particular feature and want to group your data points according to their attribute similarities, then use clustering algorithms.

A simple scatter plot of Country Income and Education datasets yields the chart we see here.

In unsupervised clustering, we start with this data and then proceed to divide it into subsets. These subsets are called?clusters?and are comprised of data points that are most similar to one another. It appears that there are at least two clusters, probably three — one at the bottom with low income and education, and then the high education countries look like they might be split between low and high income.

The following figure shows the result of?eyeballing?— making a visual estimate of — clusters in this dataset.

Clustering algorithms are one type of approach in unsupervised machine learning — other approaches include Markov methods and methods for dimension reduction. Clustering algorithms are appropriate in situations where the following characteristics are true:

We know and understand the dataset we’re analyzing.
Before running the clustering algorithm, we don’t have an exact idea as to the nature of the subsets (clusters). Often, we won’t even know how many subsets there are in the dataset before you run the algorithm.
The subsets (clusters) are determined by only the one dataset we’re analyzing.
Your goal is to determine a model that describes the subsets in a single dataset and only this dataset.

K-Means Clustering

K-means clustering aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart.

Similarity of two points is determined by the distance between them.

There are many methods to measure the distance.?Euclidean distance?(minkowski distance with p=2) is one of most commonly used distance measurements. The figure below shows how to calculate euclidean distance between two points in a 2-dimensional space. It is calculated using the square of the difference between x and y coordinates of the points.

In the case above, euclidean distance is the square root of (16 + 9) which is 5. Euclidean distance in two dimensions remind us the famous?pythagorean theorem.

There are other ways to measure distance such as cosine similarity, average distance and so on. The similarity measure is at the core of k-means clustering. Optimal method depends on the type of problem. So it is important to have a good domain knowledge in order to choose the best measurement type.

K-means clustering tries to minimize distances within a cluster and maximize the distance between different clusters.

Let’s start with a simple example to understand the concept. As usual, we import the dependencies first:

# Importing necessary libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

Scikit-learn provides many useful functions to create synthetic datasets which are very helpful for practicing machine learning algorithms. I will use?make_blobs?function.

X, y = make_blobs(n_samples = 200, centers=4, cluster_std = 0.5, random_state = 0)

plt.scatter(X[:, 0], X[:, 1], s=50)

Then we create a KMeans object and fit the data:

kmeans = KMeans(n_clusters = 4)

kmeans.fit(X)

We can now partition the dataset into clusters:

y_pred = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c = y_pred, s=50)

领英推荐

K-means Clustering: Applications and Real-world Use…

Vrata Tech Solutions (VTS) 11 个月前

Clustering Algorithms

Bluechip Technologies Asia 10 个月前

Data for Good: Clustering Countries using Unsupervised…

Sebastiano D. 2 年前

Real life datasets are much more complex in which clusters are not clearly separated. However, the algorithm works in the same way.

K-means algorithm is not capable of determining the number of clusters. We need to define it when creating the KMeans object which may be a challenging task.

K-Means Algorithm

K-means is an iterative process. It is built on?expectation-maximization?algorithm. After number of clusters are determined, it works by executing the following steps:

Randomly select centroids (center of cluster) for each cluster.
Calculate the distance of all data points to the centroids.
Assign data points to the closest cluster.
Find the new centroids of each cluster by taking the mean of all data points in the cluster.
Repeat steps 2,3 and 4 until all points converge and cluster centers stop moving.

Pros and Cons

Pros:

Easy to interpret
Relatively fast
Scalable for large data sets
Able to choose the positions of initial centroids in a smart way that speeds up the convergence
Guarantees convergence

Cons:

Number of clusters must be pre-determined. K-means algorithm is not able to guess how many clusters exist in the data. Determining number of clusters may well be a challenging task.
Can only draw linear boundaries. If there is a non-linear structure separating groups in the data, k-means will not be a good choice.
Slows down as the number of samples increases because at each step, k-means algorithm accesses all data points and calculates distances. An alternative way is to use a subset of data points to update the location of centroids (i.e. sklearn.cluster.MiniBatchKMeans)
Sensitive to outliers

Real use cases in the security domain

k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such scenarios.

So here is a list of ten interesting use cases for k-means:-

1.Information security problem

In the rapid development of computer technology today, information security has become an important guarantee for social development, most of the technology depends on the development of network information support, at the same time information security has become a worthy issue. Intrusion detection system occupies an important part in the information security architecture, with the diversification and complication of the intrusion detection system, the intrusion detection system also puts forward higher requirements. In this work, the principle and flow of K-means algorithm are expounded, and the problems existing in the application of K-means algorithm are analyzed. The initial value is easy to be affected by the isolated point, and the convergence result is easy to fall into the local optimum. It is suggested that the isolated point should be removed and the initial center should be optimized. Finally, the algorithm of the isolated point clustering method is improved. Through simulation experiments show that the improved K-means algorithm improves the detection rate of each data with the traditional K-means algorithm in the intrusion detection of mixed data, the false detection rate decreases, and the clustering effect is improved obviously, it also received good detection results at the same time.

2. Cyber-profiling criminals

Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. K-Means algorithm is used as an algorithm for the cyber profiling process. K-Means algorithm being used is in line with expectations from this study, because it has a simple algorithmic process with a good degree of accuracy.

3. Insurance fraud detection

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

Thank You for reading

Hope you will like it.

Feel Free to Comment!!

要查看或添加评论，请登录

Saurav Majumder的更多文章

AWS SQS and it's usecases

2022年1月13日

AWS SQS and it's usecases

AWS SQS Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables you to decouple and…
Case Study on JavaScript Industrial Use Cases

2021年6月18日

Case Study on JavaScript Industrial Use Cases

Introduction JavaScript is a programming language used primarily by Web browsers to create a dynamic and interactive…
CYBER CRIME PATTERNS AND THE ROLE OF CONFUSION MATRIX

2021年6月6日

CYBER CRIME PATTERNS AND THE ROLE OF CONFUSION MATRIX

What is Cyber Crime? Cybercrime, also called computer crime, the use of a computer as an instrument to further illegal…
GUI container on the Docker

2021年6月3日

GUI container on the Docker

Task 2 The objective for the the fulfillment of the task are:- Launch a container on docker in GUI mode Run any GUI…
Machine Learning Model on the top of Docker Container

2021年5月27日

Machine Learning Model on the top of Docker Container

The objective of the task 1 are as follows:- Pull the Docker container image of CentOS image from DockerHub and create…
Industry Use Case on Automation Using Ansible-Demo(Practical Implementation)

2020年12月29日

Industry Use Case on Automation Using Ansible-Demo(Practical Implementation)

Hello Connections!! I have recently attended a session on Industrial Implementation of Ansible Technology in the…
Kubernetes usage in Industries and the use cases

2020年12月26日

Kubernetes usage in Industries and the use cases

Kubernetes was first released in 2014,is an open-source container orchestration tool that can automatically scale…
ARTH Task 10

2020年12月17日

ARTH Task 10

Task 10 Objective Write an Ansible PlayBook that does the following operations in the managed nodes: Configure Docker…
Create High Availability Architecture with AWS CLI

2020年12月4日

Create High Availability Architecture with AWS CLI

In this article you will get a brief description of how to create an architecture with High Availability. Task 6…
The Rise of an Automation Technology - Ansible

2020年12月1日

The Rise of an Automation Technology - Ansible

Automation is an essential and strategic component of modernization and digital transformation. Modern, dynamic…

See all articles

K-mean clustering and its real use case in the security domain

Saurav Majumder

Associate IT Consultant

What is Clustering ?

K-Means Clustering

领英推荐

K-Means Algorithm

Pros and Cons

Real use cases in the security domain

2. Cyber-profiling criminals

3. Insurance fraud detection

Saurav Majumder的更多文章

社区洞察

其他会员也浏览了

24 Ultimate Data Science (ML) projects to work on in 2022

Mastering Statistical Foundations: Central Limit Theorem and Confidence Intervals Explained

Modern Statistics: A Dynamic and Evolving Field

“GETTING STARTED WITH DATA SCIENCE: A BEGINNER’S GUIDE.”

Bayesian Thinking in Modern Data Science

Text Data Analytics: A Methodological Review and Demonstration

Mastering Graph Data Science: Techniques and Applications

Matrices and Other Data Science Concepts You need to Know

16 Useful Advices for Aspiring Data Scientists

Understanding Clustering Algorithms: Key Techniques and Their Applications

What is Clustering ?

K-Means Clustering

领英推荐

K-Means Algorithm

Pros and Cons

Real use cases in the security domain

2. Cyber-profiling criminals

3. Insurance fraud detection

Saurav Majumder的更多文章

AWS SQS and it's usecases

Case Study on JavaScript Industrial Use Cases

CYBER CRIME PATTERNS AND THE ROLE OF CONFUSION MATRIX

GUI container on the Docker

Machine Learning Model on the top of Docker Container

Industry Use Case on Automation Using Ansible-Demo(Practical Implementation)

Kubernetes usage in Industries and the use cases

ARTH Task 10

Create High Availability Architecture with AWS CLI

The Rise of an Automation Technology - Ansible

社区洞察

其他会员也浏览了

24 Ultimate Data Science (ML) projects to work on in 2022

Mastering Statistical Foundations: Central Limit Theorem and Confidence Intervals Explained

Modern Statistics: A Dynamic and Evolving Field

“GETTING STARTED WITH DATA SCIENCE: A BEGINNER’S GUIDE.”

Bayesian Thinking in Modern Data Science

Text Data Analytics: A Methodological Review and Demonstration

Mastering Graph Data Science: Techniques and Applications

Matrices and Other Data Science Concepts You need to Know

16 Useful Advices for Aspiring Data Scientists

Understanding Clustering Algorithms: Key Techniques and Their Applications