登录查看更多内容

k-mean clustering in security domain

Shradha Seth

Mts software Engineer 2 at NetApp

发布日期: 2021年8月11日

? Clustering:-

Clustering?is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

In this post, we will cover only?Kmeans?which is considered as one of the most used clustering algorithms due to its simplicity.

? K-means Algorithm :-

K-means?algorithm is an iterative algorithm that tries to partition the dataset into?Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to?only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

The way kmeans algorithm works is as follows:

Specify number of clusters?K.
Initialize centroids by first shuffling the dataset and then randomly selecting?K?data points for the centroids without replacement.
Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

Compute the sum of the squared distance between data points and all centroids.
Assign each data point to the closest cluster (centroid).
Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

? Now Let's Understand The K-Means Algorithm With the help of example:-

A pizza chain wants to open its delivery centres across a city. What do you think would be the possible challenges?

They need to analyse the areas from where the pizza is being ordered frequently.
They need to understand as to how many pizza stores has to be opened to cover delivery in the area.
They need to figure out the locations for the pizza stores within all these areas in order to keep the distance between the store and delivery points minimum.

Resolving these challenges includes a lot of analysis and mathematics. We would now learn about how clustering can provide a meaningful and easy method of sorting out such real life challenges.

领英推荐

Data Science Best Practices

Pratibha Kumari J. 1 年前

Basic Building Blocks of K-Means Clustering Algorithms

Harry Thapa 1 年前

Everything You Need To Know About Exploratory Data…

Ze Learning Labb 3 周前

?let’s consider the above problem and see how we can help the pizza chain to come up with centres based on K-means algorithm.

? Use-Case : Cyber Profiling using K-Means Clustering

? Cyber Profiling :-

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal. Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity . Difficulties in implementing cyber profiling is on the diversity of user data and behavior when online is sometimes different from actual behavior. Given the privilege in personal behavior, inductive generalizations can be very reliable but can also lead to a misunderstanding of behavior analysis. Therefore the cyber-profiling process is via a combination of deductive and inductive methods. For investigation, the cyber-profiling process gives a good, contributing to the field of forensic computer science. Cyber Profiling is one of the efforts made by the investigator, to know the alleged offenders through the analysis of data patterns that include aspects of technology, investigation, psychology, and sociology.?

??Cyber Profiling process can be directed to the benefit of:

? Identification of users of computers that have been used previously.

? Mapping the subject of family, social life, work, or network-based organizations, including those for whom he/she worked.

? Provision of information about the user regarding his ability, level of threat, and how vulnerable to threats

? Identify the suspected abuser

The process of profiling against criminals often also known as cyber-criminal profiling criminal investigation or analysis. Criminal profiles generated in the form of data on personal traits, tendencies, habits, and geographic-demographic characteristics of the offender (for example: age, gender, socioeconomic status, education, origin place of residence). Preparation of criminal profiling will relate to the analysis of physical evidence found at the crime scene, the process of extracting the understanding of the victim (victimology), looking for a modus operandi (whether the crime scene planned or unplanned), and the process of tracing the perpetrators were deliberately left out (signature). The new approach to cyber profiling is to use clustering techniques to classify the Web-based content through data user preferences. This preference can be interpreted as an initial grouping of the data so that the resulting cluster will show user profiles. User profiling can be seen as the conclusion of the interests of users, intentions, characteristics, behavior and preferences. User profiles are created for a description of the background knowledge of the user. User profile represents a concept model which is owned by the user when searching for information web.

An initial value is determined based on the data that have the highest value, the median value, and the smallest value. Those values are at the center of the initial cluster that will be followed in the process of "K-Means".?

The new centroid calculations will continue to do (iteration) until the discovery of iterations where centroid result is the same as the results of the previous centroid.

The results of clustering details will be explained as follows:

? Cluster-1. On the results of clustering that has been done, the first cluster has as much data as 1467 records. The first cluster has the most members, but this cluster has a value which is below the overall average of the data studied. In the first cluster has a data value in the range of 1-10, because in this cluster of existing data has a low level of traffic. Thus, cluster unity categorized on the website that has the least traffic from another cluster.

? Cluster-2. In the second cluster, members who entered at this cluster of some 126 records. The value of the results of the second cluster is in the range 11-31. This value indicates that the members of the second cluster have a medium level visits, because it has a higher value than the average value generated by clustering. Thus, the second cluster of clusters categorized as having moderate traffic levels.

?Cluster-3. On the results of the third cluster, cluster members who sign on as many as 33 records. The results of this third cluster have the fewest number of members in comparison with other clusters, but the members of this cluster have the highest value of the data that has been generated. The value in this cluster is in the 34-63 range, pointing to a result that the third cluster has a value far above average. Thus, the third cluster is categorized as a cluster that has the highest traffic levels.

? Analysis Results :-

In this study, the algorithm?K-Means?clustering has been implemented to perform in line with expectations. In the early stages of primary data obtained containing information about the websites accessed by users via the internet. In addition to the data contained informative website also contains data that updates to the operating system, the update of the web browser, and website advertising that usually appears as a pop-up.

Based on the results of the?K-Means?as shown in above Figure indicate that each cluster obtained having a different number of significant cluster members. In the first cluster have shown low levels of traffic, but has some websites most. Data on the first cluster contains most of the advertising media website that coincided with a visit to a website activity. Meanwhile, in the second cluster that has moderate traffic levels, the data indicate a cluster member news sites that are in this cluster. On the results of the third cluster is a group of websites with the highest traffic levels, but has the least number of websites. Data in this cluster shows that social media is a website with traffic levels were relatively high. Other data from the third cluster shows that Internet users access website search engine more frequently than from other websites including social media websites.

Harshit Mathur

Seeking for Cloud DevOps/Support Role(Immediate Joiner) | AWS DevOps Engineer | Master's in Cloud Computing | Automation/API Tester

3 年

Nice

要查看或添加评论，请登录

Shradha Seth的更多文章

Face Detection using python

2021年6月21日

Face Detection using python

Task 06 ??????? Team Task Task Description ?? ?? Create a program that perform below mentioned task upon recognizing a…

2 条评论
Javascript!!

2021年6月20日

Javascript!!

?? Task 7.2 - ?? Write a blog explaining the usecase of javascript in any of your favorite industries.
Confusion Matrix and cyber security

2021年6月6日

Confusion Matrix and cyber security

Confusion matrix is a fairly common term when it comes to machine learning. Today I would be trying to relate the…
GUI Application On Docker Container?? ??????

2021年6月1日

GUI Application On Docker Container?? ??????

?? Task Description?? ?? GUI container on the Docker ?? Launch a container on docker in GUI mode ?? Run any GUI…
Deploying Simple Machine Learning Model inside Docker Container

2021年5月27日

Deploying Simple Machine Learning Model inside Docker Container

Task Description ?? ?? Pull the Docker container image of CentOS image from Docker Hub and Create a new container ??…

1 条评论
Creating VPC Infrastucture and NAT gateway using Terraform

2020年8月9日

Creating VPC Infrastucture and NAT gateway using Terraform

The goal is to create a scenario in which we will create our own virtual private cloud (VPC) with a public and a…

2 条评论
Creating VPC Infrastucture: Terraform & Hosting WordPress

2020年8月8日

Creating VPC Infrastucture: Terraform & Hosting WordPress

We have to create a web portal for our company with all the security as much as possible. So, we use Wordpress software…
Creating AWS infrastructure with AWS: EFS using Terraform

2020年8月8日

Creating AWS infrastructure with AWS: EFS using Terraform

Task details: Create the key and security group which allow the port 80. Launch EC2 instance.
LAUNCH NEXT CLOUD WITH EKS

2020年7月28日

LAUNCH NEXT CLOUD WITH EKS

AWS (Amazon Web Services) is a comprehensive, evolving cloud computing platform provided by Amazon that includes a…

See all articles

k-mean clustering in security domain

Shradha Seth

Mts software Engineer 2 at NetApp

? Clustering:-

? K-means Algorithm :-

领英推荐

? Use-Case : Cyber Profiling using K-Means Clustering

? Cyber Profiling :-

? Analysis Results :-

Shradha Seth的更多文章

社区洞察

其他会员也浏览了

Data for Good: Clustering Countries using Unsupervised Machine Learning

Data Science: The Future of AI and Analytics

Is AutoML End of a Data Scientist Job?

Adaptive Hierarchical Clustering, Gaussian Mixture Models (GMM), and Expectation-Maximization

Essential Data Science Concepts from A to Z

Class 16 - DATA SCIENCE PROCESSES Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Data Analytics is Misleading!

Imbalanced classification

Exploratory Data Analysis - Critical step for AI / ML based solution

? Clustering:-

? K-means Algorithm :-

领英推荐

? Use-Case : Cyber Profiling using K-Means Clustering

? Cyber Profiling :-

? Analysis Results :-

Shradha Seth的更多文章

Face Detection using python

Javascript!!

Confusion Matrix and cyber security

GUI Application On Docker Container?? ??????

Deploying Simple Machine Learning Model inside Docker Container

Creating VPC Infrastucture and NAT gateway using Terraform

Creating VPC Infrastucture: Terraform & Hosting WordPress

Creating AWS infrastructure with AWS: EFS using Terraform

LAUNCH NEXT CLOUD WITH EKS

社区洞察

其他会员也浏览了

Data for Good: Clustering Countries using Unsupervised Machine Learning

Data Science: The Future of AI and Analytics

Is AutoML End of a Data Scientist Job?

Adaptive Hierarchical Clustering, Gaussian Mixture Models (GMM), and Expectation-Maximization

Essential Data Science Concepts from A to Z

Class 16 - DATA SCIENCE PROCESSES Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Data Analytics is Misleading!

Imbalanced classification

Exploratory Data Analysis - Critical step for AI / ML based solution