K-means Clustering

K-means Clustering

No alt text provided for this image

Clustering

Clustering?is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

In this post, we will cover only?Kmeans?which is considered as one of the most used clustering algorithms due to its simplicity.

No alt text provided for this image


Kmeans Algorithm

Kmeans?algorithm is an iterative algorithm that tries to partition the dataset into?Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to?only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

No alt text provided for this image

The way kmeans algorithm works is as follows:

  1. Specify number of clusters?K.
  2. Initialize centroids by first shuffling the dataset and then randomly selecting?K?data points for the centroids without replacement.
  3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

  • Compute the sum of the squared distance between data points and all centroids.
  • Assign each data point to the closest cluster (centroid).
  • Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.


No alt text provided for this image



How K-Means Clustering Works.....

K-means is an algorithm that trains a model that groups similar objects together. The k-means algorithm accomplishes this by mapping each observation in the input dataset to a point in the?n-dimensional space (where?n?is the number of attributes of the observation). For example, your dataset might contain observations of temperature and humidity in a particular location, which are mapped to points (t, h) in 2-dimensional space.

No alt text provided for this image

In k-means clustering, each cluster has a center. During model training, the k-means algorithm uses the distance of the point that corresponds to each observation in the dataset to the cluster centers as the basis for clustering. You choose the number of clusters (k) to create.

For example, suppose that you want to create a model to recognize handwritten digits and you choose the MNIST dataset for training. The dataset provides thousands of images of handwritten digits (0 through 9). In this example, you might choose to create 10 clusters, one for each digit (0, 1, …, 9). As part of model training, the k-means algorithm groups the input images into 10 clusters.

Each image in the MNIST dataset is a 28x28-pixel image, with a total of 784 pixels. Each image corresponds to a point in a 784-dimensional space, similar to a point in a 2-dimensional space (x,y). To find a cluster to which a point belongs, the k-means algorithm finds the distance of that point from all of the cluster centers. It then chooses the cluster with the closest center as the cluster to which the image belongs.

No alt text provided for this image

In Sage Maker, you specify the number of clusters when creating a training job. For more information, see?Create Training Job. In the request body, you add the?Hyper Parameters?string map to specify the?k?and?extra_center_factor?strings.

The following is a summary of how k-means works for model training in SageMaker:

  1. It determines the initial?K?cluster centers.
  2. It iterates over input training data and recalculates cluster centers.
  3. It reduces resulting clusters to?k?(if the data scientist specified the creation of?k*x?clusters in the request).


Criminal Analysis using K-means clustering

Criminal activities are a major cause for concern for law enforcement officials. Existing strategies to control crime are usually reactive, responding to the crime scene after the crimes have occurred. However, with the advent of technology and data analytics, it is now possible to recognize patterns in criminal activities using historical data and help law enforcement officers do a better job in crime prevention and control.

Steps of crime pattern analysis

No alt text provided for this image


  • Determine the geospatial plot of crimes in the city: The first step is the collection of crime information in a given city. This is usually available from multiple places such as law enforcement reports, victimization statistical surveys, collation of newspaper articles etc. This data can be plotted on a geographical map such as the one shown above.
  • Use clustering techniques to identify patterns: Clustering is a method to depict the dataset in the form of subsets called clusters so that the observations in the same cluster make some sense. It is a method of unsupervised learning and is used for statistical data analysis.

No alt text provided for this image


Advantages of clustering for crime pattern analysis

There are several advantages to using this approach for crime pattern analysis:

  • This approach helps us to analyse the historical crime rates and enhance the crime resolution rate of the present.
  • Take actions to prevent future incidents by using preventive mechanisms based on observed patterns.
  • Reduce the training time of the officers that are assigned to a new location and have no prior knowledge of site-specific crimes.
  • Increase operational efficiency by optimally redeploying limited resources to the right places at the right times.

No alt text provided for this image


Limitations of crime pattern detection

There are a few limitations to using this approach for crime pattern detection:

  • Crime pattern analysis can only help the detectives and not replace them. It is up to the human experts to interpret what the clusters are telling us.
  • Data mining is sensitive to the quality of input data and that can be inaccurate sometimes. Missing information can also cause errors.
  • Mapping data mining attributes is a difficult task and hence it requires a skilled data miner and a crime data analyst with good domain knowledge.

No alt text provided for this image








要查看或添加评论,请登录

Nehal Ingole的更多文章

  • Machine Learning for CyberSecurity

    Machine Learning for CyberSecurity

    Security have been a major issue in today’s digital world organizations does there business via the websites/ webapp…

    2 条评论
  • WordPress

    WordPress

    ARTH - Task 18 ??????? Task Description?? ?? Create an AWS EC2 instance ?? Configure the instance with Apache…

  • Using Face Recognition to send a mail ?? , WhatsApp ?? and launching AWS instance ??

    Using Face Recognition to send a mail ?? , WhatsApp ?? and launching AWS instance ??

    So In this task we are using the Face Recognition model to recognize someone's Face and after that we will ask the…

    2 条评论
  • JavaScript

    JavaScript

    JavaScript is a light-weight object-oriented programming language that is used by several websites for scripting the…

  • Images Processing with python

    Images Processing with python

    Hello Guys !! In this article we are going to learn about the Images Processing with the help of python but before that…

  • Cyber Security And Confusion Matrix

    Cyber Security And Confusion Matrix

    Hello everyone, In this article we will discuss what role does Confusion Matrix plays in the Cybersecurity field?? As…

  • LVM in Hadoop

    LVM in Hadoop

    What is Hadoop? Apache Hadoop is the software it is a collection of open-source software utilities that facilitates…

  • Create a partitions using python

    Create a partitions using python

    In this task we are going to see how to Create a partitions using python for the python code you can visit to my GitHub…

    1 条评论
  • PARTITION IN LINUX

    PARTITION IN LINUX

    In this task we going to see how to create a PARTITION IN LINUX First add a Hardisk to the ec2 instances Then install…

  • GUI CONTAINER ON THE DOCKER

    GUI CONTAINER ON THE DOCKER

    Task Description ??GUI container on the docker ??Launch a container on Docker in GUI mode ??Run any GUI software on…

社区洞察

其他会员也浏览了