K-Means Clustering - Use Cases

K-Means Clustering - Use Cases

What is Clustering?

Clustering?is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

K-Means Algorithm

It is one of the most popular machine learning algorithms to perform cluster analysis. It has been used to perform customer segmentation, delivery optimization, topic modeling of text data, network traffic segmentation, etc.?

Machine learning algorithms are of two kinds:?

  • Supervised algorithms:?The model learns to predict answers from a given set of correct answers.??
  • Unsupervised algorithm:?The model does not learn from correct answers. The model learns to detect new patterns from a given data set without any example of correct answers.?

K-means clustering is an example of an unsupervised algorithm. It works on a basic principle – that of the closeness of distance between two data points. There is no correct label or outcome.?

K-means?algorithm is an iterative algorithm that tries to partition the dataset into?Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to?only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

No alt text provided for this image

The way k-means algorithm works is as follows:

  1. Specify number of clusters?K.
  2. Initialize centroids by first shuffling the dataset and then randomly selecting?K?data points for the centroids without replacement.
  3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

  • Compute the sum of the squared distance between data points and all centroids.
  • Assign each data point to the closest cluster (centroid).
  • Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

The approach k-means follows to solve the problem is called?Expectation-Maximization.?The E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster. Below is a break down of how we can solve it mathematically (feel free to skip it).

The objective function is:

No alt text provided for this image


where wik=1 for data point xi if it belongs to cluster?k; otherwise, wik=0. Also, μk is the centroid of xi’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids after the cluster assignments from previous step (M-step). Therefore, E-step is:

No alt text provided for this image


In other words, assign the data point xi to the closest cluster judged by its sum of squared distance from cluster’s centroid.

And M-step is:

No alt text provided for this image


Which translates to re-computing the centroid of each cluster to reflect the new assignments.

Few things to note here:

  • Since clustering algorithms including k-means use distance-based measurements to determine the similarity between data points, it’s recommended to standardize the data to have a mean of zero and a standard deviation of one since almost always the features in any dataset would have different units of measurements such as age vs income.
  • Given k-means iterative nature and the random initialization of centroids at the start of the algorithm, different initializations may lead to different clusters since k-means algorithm may?stuck in a local optimum and may not converge to global optimum. Therefore, it’s recommended to run the algorithm using different initializations of centroids and pick the results of the run that that yielded the lower sum of squared distance.
  • Assignment of examples isn’t changing is the same thing as no change in within-cluster variation:

No alt text provided for this image


Understanding K-means Clustering Algorithm

For this algorithm to work, we first need to decide the number of clusters into which we want to group our data. Based on that, the algorithm classifies each data point into one of the clusters. All the data points in a cluster are similar to each other than to data points in different clusters.?

For example, let’s say we have data points about customers in a shop, with two variables:?

  • the price of the goods that they bought and?
  • the number of goods that they purchased.?

When the variables are plotted against each other, they will look something like below:

No alt text provided for this image

Now, these customers can be grouped into separate clusters based on their buying patterns. Let’s say the marketing team wants to group them into four different clusters. We can run a k-means clustering algorithm with these data points, with k=4 (Yes! The?k?stands for the number of clusters!).

The result would be four different sets of data points.

No alt text provided for this image

?

We can see the high spending cluster (seen in red) and the low spending cluster (in blue). Thus, clustering would allow us to cater to the four different kinds of customers with four different buying patterns.?

This is just a small example where we had only two variables. With K-means clustering algorithms, we can cluster data points with a much higher number of variables. However, it becomes slightly tricky to visualize such clusters.?

No alt text provided for this image


Choosing the optimal K value:

One way of choosing the k value is to use the elbow method. First, you compute the sum of squared error (SSE) for some values of k. SSE is the sum of the squared distance between each member of the cluster and its centroid. If you plot k against the SSE, you will see that the error decreases with increasing k. This is because as the number of clusters increases, the error should be smaller and therefore, distortion should be smaller. The idea of the elbow method is to choose the k value at which the SSE decreases significantly.

History

The term "k-means" was first used by james macqueen in 1967 as part of his paper on "some methods for classification and analysis of multivariate observations". The standard algorithm was also used in bell labs as part of a technique in pulse code modulation in 1957. It was also published by in 1965 by e. w. forgy and typically is also known as the lloyd-forgy method.

How does the K-means clustering algorithm work?

This algorithm works interestingly.?

It first randomly assigns k vectors, which are called centroids. Each of the centroids represents a cluster.?

No alt text provided for this image


We can see the four large circles – those are the centroid vectors from clustering four groups of customers that we encountered earlier.?

Once the centroids are placed, the algorithm assigns the data points closest to the centroid to its clusters.?

The algorithm then moves the centroids around and keeps repeating the previous step until suitable clusters are formed. An appropriate cluster formation is where the cluster points are more similar to each other than to issues in the other clusters.?

No alt text provided for this image


The centroids represent the vector mean of the data points – hence the term k-means?clustering.?

There are various ways to measure the suitability of clusters. Since clustering is an unsupervised algorithm – there are no correct answers. So the metrics for a clustering model are geared towards measuring the similarity of data points in the cluster.?

Applications

K-means algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. The goal usually when we undergo a cluster analysis is either:

  1. Get a meaningful intuition of the structure of the data we’re dealing with.
  2. Cluster-then-predict where different models will be built for different subgroups if we believe there is a wide variation in the behaviors of different subgroups. An example of that is clustering patients into different subgroups and build a model for each subgroup to predict the probability of the risk of having heart attack.

We have to apply clustering on two cases:

i) Geyser eruptions segmentation (2D dataset).

ii) Image compression.

K-means on Geyser’s Eruptions Segmentation

We’ll first implement the k-means algorithm on 2D dataset and see how it works. The dataset has 272 observations and 2 features. The data covers the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. We will try to find?K?subgroups within the data points and group them accordingly. Below is the description of the features:

  • eruptions (float): Eruption time in minutes.
  • waiting (int): Waiting time to next eruption.

No alt text provided for this image


We’ll use this data because it’s easy to plot and visually spot the clusters since its a 2-dimension dataset. It’s obvious that we have 2 clusters. Let’s standardize the data first and run the k-means algorithm on the standardized data with K=2.

No alt text provided for this image


The above graph shows the scatter plot of the data colored by the cluster they belong to. In this example, we chose K=2. The symbol?‘*‘?is the centroid of each cluster. We can think of those 2 clusters as geyser had different kinds of behaviors under different scenarios.

K-means on Image Compression

In this part, we’ll implement k-means to compress an image. The image that we’ll be working on is 396 x 396 x 3. Therefore, for each pixel location we would have 3 8-bit integers that specify the red, green, and blue intensity values. Our goal is to reduce the number of colors to 30 and represent (compress) the photo using those 30 colors only. To pick which colors to use, we’ll use k-means algorithm on the image and treat every pixel as a data point. That means reshape the image from height x width x channels to (height * width) x channel, i,e we would have 396 x 396 = 156,816 data points in 3-dimensional space which are the intensity of RGB.

Doing so will allow us to represent the image using the 30 centroids for each pixel and would significantly reduce the size of the image by a factor of 6. The original image size was 396 x 396 x 24 = 3,763,584 bits; however, the new compressed image would be 30 x 24 + 396 x 396 x 4 = 627,984 bits. The huge difference comes from the fact that we’ll be using centroids as a lookup for pixels’ colors and that would reduce the size of each pixel location to 4-bit instead of 8-bit.

From now on we will be using?sklearn?implementation of k-means. Few thing to note here:

  • n_init?is the number of times of running the k-means with different centroid’s initialization. The result of the best one will be reported.
  • tol?is the within-cluster variation metric used to declare convergence.
  • The default of?init?is?k-means++?which is supposed to yield a better results than just random initialization of centroids.

IMPLEMENTATION:

No alt text provided for this image


Real World Use-Cases of K-Means Clustering Algorithm

1. Document Classification

Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. The document vectors are then clustered to help identify similarity in document groups.

2. Delivery Store Optimization

Optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.

3. Identifying Crime Localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.?

4. Customer Segmentation

Clustering helps marketers improve their customer base, work on target areas, and segment customers based on?purchase history, interests, or activity monitoring.

5. Fantasy League Stat Analysis

Analyzing player stats has always been a critical element of the sporting world, and with increasing competition, machine learning has a critical role to play here. As an interesting exercise,?if you would like to create a fantasy draft team and like to identify similar players based on player stats, k-means can be a useful option.

6. Insurance Fraud Detection

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

7. Rideshare Data Analysis

The publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.

8. Cyber-profiling Criminals

Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

9. Call Record Detail Analysis

A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics.?

10. Automatic Clustering of IT Alerts

Large enterprise IT infrastructure technology components such as network, storage, or database generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes.?

THANK YOU!!!






#linuxworld hashtag

#vimaldaga hashtag

#righteducation


要查看或添加评论,请登录

KARTIK LOKARE的更多文章

  • How to read data stored in RAM?(Memory Forensic)

    How to read data stored in RAM?(Memory Forensic)

    What is RAM and What data RAM contains? Random-access memory (RAM) is a computer’s short-term memory. None of your…

  • Explore date command and with options and try to use every option.

    Explore date command and with options and try to use every option.

    The date command displays or sets the system date. It is most commonly used to print the date and time in different…

  • Object Recognition using CNN model

    Object Recognition using CNN model

    Step 1: Create an ML model detecting vehicle number plate Step 2: Display the screen showing the WebApp taking the…

  • Kubernetes Integration with Python-CGI

    Kubernetes Integration with Python-CGI

    Kubernetes Kubernetes (also known as k8s or “kube”) is an open source container orchestration platform that automates…

    3 条评论
  • Javascript Integration with Docker

    Javascript Integration with Docker

    Python CGI with Docker (Task 7) In this project, I have integrated python with Docker !! Python is one of the most…

    2 条评论
  • Industry Usecase of JavaScript

    Industry Usecase of JavaScript

    August 12, 2021 Introduction JavaScript is a programming language used primarily by Web browsers to create a dynamic…

    2 条评论
  • Understanding Confusion Matrix

    Understanding Confusion Matrix

    Hello Everyone… Today I am here with a interesting article. Which is based on Confusion Matrix.

    3 条评论

社区洞察

其他会员也浏览了