K-Means Clustering - Use Cases
KARTIK LOKARE
Shivaji University||Web Development(HTML,CSS, JavaScript, ReactNative,Flutter,Nodejs,Reactjs, Django,Flask)||Computer Vision/AI/ML||
What is Clustering?
Clustering?is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.
Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.
K-Means Algorithm
It is one of the most popular machine learning algorithms to perform cluster analysis. It has been used to perform customer segmentation, delivery optimization, topic modeling of text data, network traffic segmentation, etc.?
Machine learning algorithms are of two kinds:?
K-means clustering is an example of an unsupervised algorithm. It works on a basic principle – that of the closeness of distance between two data points. There is no correct label or outcome.?
K-means?algorithm is an iterative algorithm that tries to partition the dataset into?Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to?only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
The way k-means algorithm works is as follows:
The approach k-means follows to solve the problem is called?Expectation-Maximization.?The E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster. Below is a break down of how we can solve it mathematically (feel free to skip it).
The objective function is:
where wik=1 for data point xi if it belongs to cluster?k; otherwise, wik=0. Also, μk is the centroid of xi’s cluster.
It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids after the cluster assignments from previous step (M-step). Therefore, E-step is:
In other words, assign the data point xi to the closest cluster judged by its sum of squared distance from cluster’s centroid.
And M-step is:
Which translates to re-computing the centroid of each cluster to reflect the new assignments.
Few things to note here:
Understanding K-means Clustering Algorithm
For this algorithm to work, we first need to decide the number of clusters into which we want to group our data. Based on that, the algorithm classifies each data point into one of the clusters. All the data points in a cluster are similar to each other than to data points in different clusters.?
For example, let’s say we have data points about customers in a shop, with two variables:?
When the variables are plotted against each other, they will look something like below:
Now, these customers can be grouped into separate clusters based on their buying patterns. Let’s say the marketing team wants to group them into four different clusters. We can run a k-means clustering algorithm with these data points, with k=4 (Yes! The?k?stands for the number of clusters!).
The result would be four different sets of data points.
?
We can see the high spending cluster (seen in red) and the low spending cluster (in blue). Thus, clustering would allow us to cater to the four different kinds of customers with four different buying patterns.?
This is just a small example where we had only two variables. With K-means clustering algorithms, we can cluster data points with a much higher number of variables. However, it becomes slightly tricky to visualize such clusters.?
Choosing the optimal K value:
One way of choosing the k value is to use the elbow method. First, you compute the sum of squared error (SSE) for some values of k. SSE is the sum of the squared distance between each member of the cluster and its centroid. If you plot k against the SSE, you will see that the error decreases with increasing k. This is because as the number of clusters increases, the error should be smaller and therefore, distortion should be smaller. The idea of the elbow method is to choose the k value at which the SSE decreases significantly.
History
The term "k-means" was first used by james macqueen in 1967 as part of his paper on "some methods for classification and analysis of multivariate observations". The standard algorithm was also used in bell labs as part of a technique in pulse code modulation in 1957. It was also published by in 1965 by e. w. forgy and typically is also known as the lloyd-forgy method.
How does the K-means clustering algorithm work?
This algorithm works interestingly.?
It first randomly assigns k vectors, which are called centroids. Each of the centroids represents a cluster.?
We can see the four large circles – those are the centroid vectors from clustering four groups of customers that we encountered earlier.?
Once the centroids are placed, the algorithm assigns the data points closest to the centroid to its clusters.?
The algorithm then moves the centroids around and keeps repeating the previous step until suitable clusters are formed. An appropriate cluster formation is where the cluster points are more similar to each other than to issues in the other clusters.?
The centroids represent the vector mean of the data points – hence the term k-means?clustering.?
领英推荐
There are various ways to measure the suitability of clusters. Since clustering is an unsupervised algorithm – there are no correct answers. So the metrics for a clustering model are geared towards measuring the similarity of data points in the cluster.?
Applications
K-means algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. The goal usually when we undergo a cluster analysis is either:
We have to apply clustering on two cases:
i) Geyser eruptions segmentation (2D dataset).
ii) Image compression.
K-means on Geyser’s Eruptions Segmentation
We’ll first implement the k-means algorithm on 2D dataset and see how it works. The dataset has 272 observations and 2 features. The data covers the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. We will try to find?K?subgroups within the data points and group them accordingly. Below is the description of the features:
We’ll use this data because it’s easy to plot and visually spot the clusters since its a 2-dimension dataset. It’s obvious that we have 2 clusters. Let’s standardize the data first and run the k-means algorithm on the standardized data with K=2.
The above graph shows the scatter plot of the data colored by the cluster they belong to. In this example, we chose K=2. The symbol?‘*‘?is the centroid of each cluster. We can think of those 2 clusters as geyser had different kinds of behaviors under different scenarios.
K-means on Image Compression
In this part, we’ll implement k-means to compress an image. The image that we’ll be working on is 396 x 396 x 3. Therefore, for each pixel location we would have 3 8-bit integers that specify the red, green, and blue intensity values. Our goal is to reduce the number of colors to 30 and represent (compress) the photo using those 30 colors only. To pick which colors to use, we’ll use k-means algorithm on the image and treat every pixel as a data point. That means reshape the image from height x width x channels to (height * width) x channel, i,e we would have 396 x 396 = 156,816 data points in 3-dimensional space which are the intensity of RGB.
Doing so will allow us to represent the image using the 30 centroids for each pixel and would significantly reduce the size of the image by a factor of 6. The original image size was 396 x 396 x 24 = 3,763,584 bits; however, the new compressed image would be 30 x 24 + 396 x 396 x 4 = 627,984 bits. The huge difference comes from the fact that we’ll be using centroids as a lookup for pixels’ colors and that would reduce the size of each pixel location to 4-bit instead of 8-bit.
From now on we will be using?sklearn?implementation of k-means. Few thing to note here:
IMPLEMENTATION:
Real World Use-Cases of K-Means Clustering Algorithm
1. Document Classification
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. The document vectors are then clustered to help identify similarity in document groups.
2. Delivery Store Optimization
Optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.
3. Identifying Crime Localities
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.?
4. Customer Segmentation
Clustering helps marketers improve their customer base, work on target areas, and segment customers based on?purchase history, interests, or activity monitoring.
5. Fantasy League Stat Analysis
Analyzing player stats has always been a critical element of the sporting world, and with increasing competition, machine learning has a critical role to play here. As an interesting exercise,?if you would like to create a fantasy draft team and like to identify similar players based on player stats, k-means can be a useful option.
6. Insurance Fraud Detection
Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
7. Rideshare Data Analysis
The publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.
8. Cyber-profiling Criminals
Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
9. Call Record Detail Analysis
A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics.?
10. Automatic Clustering of IT Alerts
Large enterprise IT infrastructure technology components such as network, storage, or database generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes.?
THANK YOU!!!