登录查看更多内容

K-means Clustering

Nehal Ingole

DevOps | Full Stack Development | CICD | React Native

发布日期: 2021年7月19日

+ 关注

Clustering

Clustering?is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

In this post, we will cover only?Kmeans?which is considered as one of the most used clustering algorithms due to its simplicity.

Kmeans Algorithm

Kmeans?algorithm is an iterative algorithm that tries to partition the dataset into?Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to?only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

The way kmeans algorithm works is as follows:

Specify number of clusters?K.
Initialize centroids by first shuffling the dataset and then randomly selecting?K?data points for the centroids without replacement.
Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

Compute the sum of the squared distance between data points and all centroids.
Assign each data point to the closest cluster (centroid).
Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

How K-Means Clustering Works.....

K-means is an algorithm that trains a model that groups similar objects together. The k-means algorithm accomplishes this by mapping each observation in the input dataset to a point in the?n-dimensional space (where?n?is the number of attributes of the observation). For example, your dataset might contain observations of temperature and humidity in a particular location, which are mapped to points (t, h) in 2-dimensional space.

In k-means clustering, each cluster has a center. During model training, the k-means algorithm uses the distance of the point that corresponds to each observation in the dataset to the cluster centers as the basis for clustering. You choose the number of clusters (k) to create.

For example, suppose that you want to create a model to recognize handwritten digits and you choose the MNIST dataset for training. The dataset provides thousands of images of handwritten digits (0 through 9). In this example, you might choose to create 10 clusters, one for each digit (0, 1, …, 9). As part of model training, the k-means algorithm groups the input images into 10 clusters.

Each image in the MNIST dataset is a 28x28-pixel image, with a total of 784 pixels. Each image corresponds to a point in a 784-dimensional space, similar to a point in a 2-dimensional space (x,y). To find a cluster to which a point belongs, the k-means algorithm finds the distance of that point from all of the cluster centers. It then chooses the cluster with the closest center as the cluster to which the image belongs.

In Sage Maker, you specify the number of clusters when creating a training job. For more information, see?Create Training Job. In the request body, you add the?Hyper Parameters?string map to specify the?k?and?extra_center_factor?strings.

The following is a summary of how k-means works for model training in SageMaker:

领英推荐

Data Science Best Practices

Pratibha Kumari J. 1 年前

Basic Building Blocks of K-Means Clustering Algorithms

Harry Thapa 9 个月前

Types of Clustering Methods

Shashank Sharma 1 年前

It determines the initial?K?cluster centers.
It iterates over input training data and recalculates cluster centers.
It reduces resulting clusters to?k?(if the data scientist specified the creation of?k*x?clusters in the request).

Criminal Analysis using K-means clustering

Criminal activities are a major cause for concern for law enforcement officials. Existing strategies to control crime are usually reactive, responding to the crime scene after the crimes have occurred. However, with the advent of technology and data analytics, it is now possible to recognize patterns in criminal activities using historical data and help law enforcement officers do a better job in crime prevention and control.

Steps of crime pattern analysis

Determine the geospatial plot of crimes in the city: The first step is the collection of crime information in a given city. This is usually available from multiple places such as law enforcement reports, victimization statistical surveys, collation of newspaper articles etc. This data can be plotted on a geographical map such as the one shown above.
Use clustering techniques to identify patterns: Clustering is a method to depict the dataset in the form of subsets called clusters so that the observations in the same cluster make some sense. It is a method of unsupervised learning and is used for statistical data analysis.

Advantages of clustering for crime pattern analysis

There are several advantages to using this approach for crime pattern analysis:

This approach helps us to analyse the historical crime rates and enhance the crime resolution rate of the present.
Take actions to prevent future incidents by using preventive mechanisms based on observed patterns.
Reduce the training time of the officers that are assigned to a new location and have no prior knowledge of site-specific crimes.
Increase operational efficiency by optimally redeploying limited resources to the right places at the right times.

Limitations of crime pattern detection

There are a few limitations to using this approach for crime pattern detection:

Crime pattern analysis can only help the detectives and not replace them. It is up to the human experts to interpret what the clusters are telling us.
Data mining is sensitive to the quality of input data and that can be inaccurate sometimes. Missing information can also cause errors.
Mapping data mining attributes is a difficult task and hence it requires a skilled data miner and a crime data analyst with good domain knowledge.

要查看或添加评论，请登录

Nehal Ingole的更多文章

Machine Learning for CyberSecurity

2021年11月19日

Machine Learning for CyberSecurity

Security have been a major issue in today’s digital world organizations does there business via the websites/ webapp…

2 条评论
WordPress

2021年7月14日

WordPress

ARTH - Task 18 ??????? Task Description?? ?? Create an AWS EC2 instance ?? Configure the instance with Apache…
Using Face Recognition to send a mail ?? , WhatsApp ?? and launching AWS instance ??

2021年6月21日

Using Face Recognition to send a mail ?? , WhatsApp ?? and launching AWS instance ??

So In this task we are using the Face Recognition model to recognize someone's Face and after that we will ask the…

2 条评论
JavaScript

2021年6月20日

JavaScript

JavaScript is a light-weight object-oriented programming language that is used by several websites for scripting the…
Images Processing with python

2021年6月9日

Images Processing with python

Hello Guys !! In this article we are going to learn about the Images Processing with the help of python but before that…
Cyber Security And Confusion Matrix

2021年6月6日

Cyber Security And Confusion Matrix

Hello everyone, In this article we will discuss what role does Confusion Matrix plays in the Cybersecurity field?? As…
LVM in Hadoop

2021年3月15日

LVM in Hadoop

What is Hadoop? Apache Hadoop is the software it is a collection of open-source software utilities that facilitates…
Create a partitions using python

2021年3月14日

Create a partitions using python

In this task we are going to see how to Create a partitions using python for the python code you can visit to my GitHub…

1 条评论
PARTITION IN LINUX

2021年3月14日

PARTITION IN LINUX

In this task we going to see how to create a PARTITION IN LINUX First add a Hardisk to the ec2 instances Then install…
GUI CONTAINER ON THE DOCKER

2021年3月12日

GUI CONTAINER ON THE DOCKER

Task Description ??GUI container on the docker ??Launch a container on Docker in GUI mode ??Run any GUI software on…

See all articles

K-means Clustering

Nehal Ingole

DevOps | Full Stack Development | CICD | React Native

Clustering

Kmeans Algorithm

How K-Means Clustering Works.....

领英推荐

Criminal Analysis using K-means clustering

Steps of crime pattern analysis

Advantages of clustering for crime pattern analysis

Limitations of crime pattern detection

Nehal Ingole的更多文章

社区洞察

其他会员也浏览了

Exploratory Data Analysis (EDA)

Statistics in Data Science: From Analysis to Decision Making and Beyond

Data Science Notes _ Part 1

Hierarchical Clustering: A Comprehensive Guide to Understanding and Applying This Powerful Data Analysis Technique

Data Science Revolutionizing the Way We Analyze Data

Data Science Workflow: From Data Collection to Insights

Resampling Techniques: Unlocking the Hidden Potential of Your Data

Dendrograms in Data Science: A Comprehensive Overview

Exploratory Data Analysis - Critical step for AI / ML based solution

What does a Data Science do ?

Clustering

Kmeans Algorithm

How K-Means Clustering Works.....

领英推荐

Criminal Analysis using K-means clustering

Steps of crime pattern analysis

Advantages of clustering for crime pattern analysis

Limitations of crime pattern detection

Nehal Ingole的更多文章

Machine Learning for CyberSecurity

WordPress

Using Face Recognition to send a mail ?? , WhatsApp ?? and launching AWS instance ??

JavaScript

Images Processing with python

Cyber Security And Confusion Matrix

LVM in Hadoop

Create a partitions using python

PARTITION IN LINUX

GUI CONTAINER ON THE DOCKER

社区洞察

其他会员也浏览了

Exploratory Data Analysis (EDA)

Statistics in Data Science: From Analysis to Decision Making and Beyond

Data Science Notes _ Part 1

Hierarchical Clustering: A Comprehensive Guide to Understanding and Applying This Powerful Data Analysis Technique

Data Science Revolutionizing the Way We Analyze Data

Data Science Workflow: From Data Collection to Insights

Resampling Techniques: Unlocking the Hidden Potential of Your Data

Dendrograms in Data Science: A Comprehensive Overview

Exploratory Data Analysis - Critical step for AI / ML based solution

What does a Data Science do ?