登录查看更多内容

K-means Clustering

Srilekha Mule

Data Analytics Intern at Apps Associates

发布日期: 2021年7月19日

What is Clustering?

Clustering?is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

K-means Clustering:

K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

Kmeans?algorithm is an iterative algorithm that tries to partition the dataset into?Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to?only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

A K-means clustering algorithm tries to group similar items in the form of clusters. The number of groups is represented by K. K-means clustering uses the euclidean distance method to find out the distance between the points.

Real time example:

Suppose you went to a vegetable shop to buy some vegetables. There you will see different kinds of vegetables. The one thing you will notice there that the vegetables will be arranged in a group of their types. Like all the carrots will be kept in one place, potatoes will be kept with their kinds and so on. If you will notice here then you will find that they are forming a group or cluster, where each of the vegetables is kept within their kind of group forming the clusters.

Now we will understand this with the help of a figure.

The first figure(fig 1) shows the data before applying the k-means clustering algorithm. Here all three different categories are messed up. When you will see such data in the real world, you will not able to figure out the different categories.
The second figure(fig 2) shows the data after applying the K-means clustering algorithm. you can see that all three different items are classified into three different categories which are called clusters.

How Does the K-means clustering algorithm work?

k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity between the items and groups them into the clusters. K-means clustering algorithm works in three steps. Let’s see what are these three steps.

Select the k values.
Initialize the centroids.
Select the group and find the average.

Let us understand the above steps with the help of the figure because a good picture is better than the thousands of words.

We will understand each figure one by one.

Figure 1 shows the representation of data of two different items. the first item has shown in blue color and the second item has shown in red color. Here I am choosing the value of K randomly as 2. There are different methods by which we can choose the right k values.
In figure 2, Join the two selected points. Now to find out centroid, we will draw a perpendicular line to that line. The points will move to their centroid. If you will notice there, then you will see that some of the red points are now moved to the blue points. Now, these points belong to the group of blue color items.
The same process will continue in figure 3. we will join the two points and draw a perpendicular line to that and find out the centroid. Now the two points will move to its centroid and again some of the red points get converted to blue points.
The same process is happening in figure 4. This process will be continued until and unless we get two completely different clusters of these groups.

Choosing the optimal K value:

One way of choosing the k value is to use the elbow method. First, you compute the sum of squared error (SSE) for some values of k. SSE is the sum of the squared distance between each member of the cluster and its centroid. If you plot k against the SSE, you will see that the error decreases with increasing k. This is because as the number of clusters increases, the error should be smaller and therefore, distortion should be smaller. The idea of the elbow method is to choose the k value at which the SSE decreases significantly.

Applications of K-Means Clustering:

k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such scenarios.

here is a list of ten interesting use cases for k-means.

1. Document classification

cluster documents in multiple categories based on tags, topics, and the content of the document. this is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. the initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.

领英推荐

Data Science Best Practices

Pratibha Kumari J. 1 年前

Basic Building Blocks of K-Means Clustering Algorithms

Harry Thapa 1 年前

Skills in AI Ecosystem: DA vs DS

Mohammad Arshad 1 年前

2. Delivery store optimization

optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.

3. Identifying crime localities

with data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.?

4. Customer segmentation

clustering helps marketers improve their customer base, work on target areas, and segment customers based on?purchase history, interests, or activity monitoring.?

5. Fantasy league stat analysis

analyzing player stats has always been a critical element of the sporting world, and with increasing competition, machine learning has a critical role to play here. as an interesting exercise,?if you would like to create a fantasy draft team and like to identify similar players based on player stats, k-means can be a useful option.

6. Insurance fraud detection

machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.?

7. Rideshare data analysis

the publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.

8. Cyber-profiling criminals

cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

9. Call record detail analysis

a call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. this information provides greater insights about the customer’s needs when used with customer demographics.?

10. Automatic clustering of it alerts

large enterprise it infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes.?

11. Identifying crime localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

Advantages of K-Means Clustering:

Relatively simple to implement
Scales to large data sets
Guarantees convergence
Can warm-start the positions of centroids
Easily adapts to new examples
Generalize clusters of different shapes and sizes, such as elliptical clusters

Disadvantages of K-Means Clustering:

Manual Selection
Dependent on initial values
Clustering data of varying sizes and densities
Clustering outliers
Scaling with a number of dimensions

Drawbacks

Kmeans algorithm is good in capturing structure of the data if clusters have a spherical-like shape. It always try to construct a nice spherical shape around the centroid. That means, the minute the clusters have a complicated geometric shapes, kmeans does a poor job in clustering the data.

Conclusion

K-means clustering is one of the most popular clustering algorithms. Usually, the first thing practitioners apply when solving clustering tasks is to get an idea of the dataset’s structure. The goal of k-means is to group data points into distinct non-overlapping subgroups. It does an excellent job when the clusters have a kind of spherical shape. However, it suffers as the geometric shapes of clusters deviate from spherical shapes.

Moreover, it also doesn’t learn the number of clusters from the data and requires it to be pre-defined. It’s always good to know the assumptions behind algorithms/methods to have a good idea about each technique’s strengths and weaknesses. This will help you decide when to use each form and under what circumstances.

要查看或添加评论，请登录

Srilekha Mule的更多文章

Face Recognition.................

2021年6月20日

Face Recognition.................

Task 6.1 Collecting Samples import cv2 import numpy as np # Load HAAR face classifier face_classifier = cv2.
JavaScript

2021年6月19日

JavaScript

What is JavaScript? JavaScript(JS) is a text-based programming language used both on the client-side and server-side…
Confusion Matrix

2021年6月5日

Confusion Matrix

Confusion Matrix: A Confusion matrix is used for evaluating the performance of a classification model. The matrix…
GUI container on the Docker

2021年6月1日

GUI container on the Docker

Here i am running gui software firefox on docker container centos and launching a jupyter notebook on firefox . I…
How to run a machine learing model on Docker

2021年5月27日

How to run a machine learing model on Docker

Step1: Pull the docker container image of centOS image from DockerHub using the following command: # docker run -it…

2 条评论

See all articles

K-means Clustering

Srilekha Mule

Data Analytics Intern at Apps Associates

What is Clustering?

K-means Clustering:

A K-means clustering algorithm tries to group similar items in the form of clusters. The number of groups is represented by K. K-means clustering uses the euclidean distance method to find out the distance between the points.

How Does the K-means clustering algorithm work?

Choosing the optimal K value:

Applications of K-Means Clustering:

1. Document classification

领英推荐

2. Delivery store optimization

3. Identifying crime localities

4. Customer segmentation

5. Fantasy league stat analysis

6. Insurance fraud detection

7. Rideshare data analysis

8. Cyber-profiling criminals

9. Call record detail analysis

10. Automatic clustering of it alerts

11. Identifying crime localities

Advantages of K-Means Clustering:

Disadvantages of K-Means Clustering:

Drawbacks

Conclusion

Srilekha Mule的更多文章

社区洞察

其他会员也浏览了

Everything You Need To Know About Exploratory Data Analysis For Machine Learning!

Data Analytics is Misleading!

Enhancing Data Analytics with Excel and GenAI: A Modern Approach to Insight-Driven Decisions

Exploratory Data Analysis - Critical step for AI / ML based solution

Data Analytics Tools

Data Analytics Tools

Types of Data Analytics in Data Science and Its Applications

Data Science Algorithms Every CIO Should Know: Driving Business Value Through Advanced Analytics

Applied Predictive Modeling – Book Review

k-mean clustering in security domain

What is Clustering?

K-means Clustering:

A K-means clustering algorithm tries to group similar items in the form of clusters. The number of groups is represented by K. K-means clustering uses the euclidean distance method to find out the distance between the points.

How Does the K-means clustering algorithm work?

Choosing the optimal K value:

Applications of K-Means Clustering:

1. Document classification

领英推荐

2. Delivery store optimization

3. Identifying crime localities

4. Customer segmentation

5. Fantasy league stat analysis

6. Insurance fraud detection

7. Rideshare data analysis

8. Cyber-profiling criminals

9. Call record detail analysis

10. Automatic clustering of it alerts

11. Identifying crime localities

Advantages of K-Means Clustering:

Disadvantages of K-Means Clustering:

Drawbacks

Conclusion

Srilekha Mule的更多文章

Face Recognition.................

JavaScript

Confusion Matrix

GUI container on the Docker

How to run a machine learing model on Docker

社区洞察

其他会员也浏览了

Everything You Need To Know About Exploratory Data Analysis For Machine Learning!

Data Analytics is Misleading!

Enhancing Data Analytics with Excel and GenAI: A Modern Approach to Insight-Driven Decisions

Exploratory Data Analysis - Critical step for AI / ML based solution

Data Analytics Tools

Data Analytics Tools

Types of Data Analytics in Data Science and Its Applications

Data Science Algorithms Every CIO Should Know: Driving Business Value Through Advanced Analytics

Applied Predictive Modeling – Book Review

k-mean clustering in security domain