Ever Wondered How Similar patterns are identified? A Complete Guide about K-Means, K-Means ++, K-Medoids & PAM’s in K-Means Clustering.
Chandra Prakash Bathula
Adjunct Faculty at Saint Louis University | Machine Learning Practitioner | Web Developer | GenAI Developer
Ever Wondered How Similar patterns are identified? A Complete Guide about K-Means, K-Means ++, K-Medoids & PAM’s in K-Means Clustering.
Have you ever wondered how internet companies identify similar customer behaviors and offer them personalized discounts based on their purchasing power and preferences? Take Amazon or any other retail product seller application, or any tech giant with a large user base.
Let’s imagine we have a task at one of these companies to identify groups of people who exhibit similar behavior and uncover patterns. How would we tackle this challenge?
To address such tasks and uncover behavioral patterns, we turn to a powerful technique in Machine Learning called Clustering. Originally used in Data Mining, clustering can also serve as a crucial preprocessing step in various Machine Learning algorithms.
Now, let’s dive into the world of Clustering:
Clustering, is the process of grouping similar data points based on their inherent similarities, such as customer preferences, purchase history, or browsing habits. By applying clustering algorithms, distinct clusters or groups can be automatically identified within a dataset. Each cluster represents individuals who share common characteristics and behaviors, empowering companies to gain insights into their customer base and tailor strategies accordingly.
Clustering uncovers hidden patterns, allowing businesses to understand preferences, target specific demographics, and offer personalized recommendations. It enables companies to streamline marketing efforts, enhance customer satisfaction, and drive revenue growth by segmenting their customer base and catering to unique needs and preferences.
Through clustering, companies can harness the power of data to better understand customers and create personalized experiences. Behind the scenes, clustering algorithms work diligently, analyzing behavior and identifying patterns to provide an enhanced shopping experience. By leveraging clustering, businesses unlock the potential of data, enabling them to deliver tailored strategies that keep customers engaged and coming back for more.
Among the clustering algorithms, K-Means, Hierarchical clustering, and DBSCAN are widely used and similar in nature.
K-Means Clustering:
Now, let’s delve into the K-Means algorithm, which is known for its simplicity and efficiency. In K-Means clustering, the parameter K represents the number of clusters, and it is a hyper parameter that needs to be determined. The optimal value for K can be found using ideas like Cross Validation (CV).
Geometrically,
K-Means clustering involves assigning centroids to groups of points. Assuming we have a dataset D and K clusters, we will have Cn centroids for each cluster
Centroids: (C1, C2, C3, …, Cn).
Sets: (S1, S2, S3, …, Sn).
D = (S1 U S2 U … U Sn) & S1 intersection S2 is an empty set.
Mathematically,
So the core idea is to find k-centroids and then corresponding K-set of points based on these K-centroids.
But, Now comes the Big Problem! How can we identify the K-centroids?
K-Centroids = C1, C2, C3, …, Ck
and their corresponding sets of points
K-Sets = S1, S2, S3, …, Sk.
argmin(k=1,…,K) ∥xp — ck∥2
Since the above objective function poses a complex computational problem (known as an NP-hard problem), we rely on approximation algorithms to solve it. One such optimization algorithm widely used is Lloyd’s algorithm, which involves a combination of mathematical techniques and clever strategies.
By employing these mathematical formulations and optimization algorithms, K-means clustering enables us to identify the optimal K-centroids, group similar data points, and uncover valuable insights from the data.
Optimal Clustering:
Optimal Clustering is that in which we have smaller intra-cluster distances and high inter-cluster distances.
K-Means: LLoyd’s Algorithm:
Lloyd’s algorithm is remarkably simple and intuitive, providing a solution that is as close as possible to the problem at hand. It serves as an excellent approximation method while remaining straightforward to implement.
Let’s delve into Lloyd’s algorithm and explore how it tackles the challenge of finding centroids for K-means clustering.
Step 1: Initialization:
In this step, we randomly select “K” points from the dataset “D” as the initial K-centroids for clustering. This random initialization sets the foundation for subsequent steps.
Step 2: Assignment:
For each point ‘Xi’ in the dataset ‘D,’ we identify the nearest centroid “Cj” and assign ‘Xi’ to the corresponding set ‘Sj.’ This assignment is determined by evaluating the distances between ‘Xi’ and all the centroids ‘Cj.’
Step 3: Re-compute centroid stage/ Update stage:
Cj = (1/|Sj|) * Σ(xi) for Xi belongs to Sj.
As a result, new centroids are obtained after each iteration.
Step 4: Termination:
To achieve convergence, we repeat steps 2 and 3 until little or no change is observed in the centroids. This iterative process is crucial for reaching a stable solution.
Convergence occurs when the differences between the old centroids (C1, C2, C3, …, Ck) and the new centroids (C1', C2', C3', …, Ck’) are sufficiently small. At this point, we can consider the new centroids as the final centroids. Ultimately, we obtain K centroids and K corresponding sets at the conclusion of this stage.
Lloyd’s algorithm follows a straightforward four-step process: Initialization, Assignment, Update (or Re-compute), and Repeat and Terminate. By utilizing this approximation algorithm, we overcome the computational complexity associated with the N-p hard problem of K-means clustering.
Is that all? Will Lloyd’s algorithm solve all the issues?
The answer is no! While Lloyd’s algorithm offers a solution for the K-Means clustering problem, it does not address all the challenges associated with initialization. One significant concern is the problem of “Initialization Sensitivity.”
Appropriate & Inappropriate Initializations:
K-Means ++ :
K-Means++ is an advanced algorithm that improves upon the random initialization step in the traditional K-Means clustering method, offering better cluster initialization and mitigating the impact of outliers. While the overall framework of K-Means remains the same, the key difference lies in the initialization process.
In K-Means++, instead of randomly selecting the initial centroids, a smarter initialization scheme is used. The algorithm starts by randomly choosing one centroid, C1, from the dataset D. For each data point Xi in D, a distribution of distances from the centroid to every other point is calculated. The next centroid, C2, is then selected from the remaining data points in D, excluding C1, with a probability proportional to the squared distance, di, from the point to the previously chosen centroid, C1.
Mathematically,
the distance, d, between a data point Xi and centroid C1 is calculated using the squared Euclidean distance formula:
d = ||(Xi — C1)||2
The process continues iteratively, with subsequent centroids chosen in a similar manner until the desired number of clusters is obtained. By initializing centroids strategically, K-Means++ addresses the limitations of the classic K-Means and Lloyd’s algorithm. It provides more robust and accurate clustering results, particularly in scenarios with potential outliers.
The question may arise as to why we adopt a probabilistic approach instead of a deterministic one in the initialization step of K-Means++.
Limitation Of K-means:
-> K-means has these following problems or limitations.
-> K-means clustering will be affected and we can attain the optimal results when there is a difference in
领英推荐
And the most common problem is that it can be affected by the outliers in the K-means.
Overcoming K-Means limitations:
K-Means is not a perfect solution but is a good approximation.
After all these computations and variations can we interpret the results?
Yes, but it is not possible with the classical K-Means (or) K-Means ++ because we get centroids at the end. Originally, we are given “D” dataset ;
D = {D1,D2,…..,Dn}.
At the end we will have the centroid points and it is hard to interpret the centroids.
Example: Incase of any reviews and if we use Bow or any other technique and compute we will get some geometric result as centroid and most of the words in the review will have 0 as the values in the set after removing stop words and etc.
Big Idea:
In this case instead of giving some centroid point if we are given the point ‘Xi’ as the centroid that belongs to Dataset “D” we can read that particular review either +ve or -ve and understand the cluster belongs to certain group. And this much more interpretable now.
And this is the drawback of K-Means & K-means ++.
So, instead of each centroid what if we have a datapoint in ‘D’ is the big idea now.
=> Ci = Xj E D.
Now, it is not based on the centroids & then this is referred as “K-Medoids”.
Because we no more use mean point as the centroid and we will have the actual data points. This is called as K-medoids and it is a very very popular algorithm especially when we have to (or) want to interpret (or) read the centroids ad make sense what they are.
One among the very popular algorithms for K-Medoids is called “Partitioning Around Medoids” (PAMs).
Just Like the Lloyd’s being an algorithm for K-means; Pam is the algorithm for K-Medoids.
It is very very similar to K-means with subtle differences.
In the case of two clusters :
For every iteration swap the medics with the non-medoid points and compute the loss and swap X1 with X2, X3,X4& X5 keeping M2 as is and also swiping them and try all the swappings possible.
And we need to find the minimal loss.
Once we swap then we need to re-do the assignment. A swap is successful when we have the less-loss value.
The key difference is instead of computing average as in K-means we compute by shaping K-medoids.
Now, this algorithm is called as Partitioning Around Medoids (PAMs).
The only mathematical equation we needed is ||Xi-Mj||2.
||Xi-Mj||2 => distance(Xi,Mj) & Mj belongs to D dataset.
Standard K-means is not easy for Kernalisation.
Determining right K:
1.One way For determining right K we need to have domain knowledge.
Eg: Incase of review analysis we can see that there exists only 2 groups in reviews and hence two clusters.
2.If we can’t determine the right K then we have something to use that is elbow method (or) Knee method.
Aligning with the K-means in the elbow method we can determine the right?“K”?by looking at the rate of change of curve or steepness of the loss curve at a particular point as we know that as increasing “K”?can lead to zero loss as every point itself becomes cluster itself.
Elbow Method:
The Elbow or Knee method is a popular technique used in K-means clustering to determine the optimal number of clusters, denoted as K, for a given dataset. The method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters, and identifying the point where adding more clusters does not significantly reduce the WCSS.
To find the optimal K using the Elbow method, the following steps can be followed:
The Elbow or Knee method does not have a specific mathematical equation. However, the WCSS calculation used in this method can be expressed as follows:
WCSS = ∑(∑||x — centroid||2)
where:
By applying the Elbow or Knee method, data analysts can make an informed decision about the appropriate number of clusters for their specific dataset, balancing the trade-off between model complexity and the level of within-cluster variation.
Time complexity & Space complexity:
Time complexity:
In K-means as there is no train and test,
We will have the order of O(n*k*d*i);
Typically we will have around 9–10 clusters; K<_ 10 & iterations i <_ 300;
As the K & i are very smaller compared to n & d;
We will have the Time complexity ~~?O(nd).
Space complexity:
O(nd +kd);?assuming K much smaller than “n”.
As we need to store all of the points to deal with.It can be viewed as
~= O(nd) -> linear.
K-Means is quite fast actually for real world applications. This is the reason why it is popular and used widely.
Sample Code For K-Means:
And you can get about parameters here:
K-Means with the example dataset for customer behavior clustering with five named features:
#Algorithm:
from sklearn.cluster import KMean
import numpy as np
data = [
[35, 50000, 4, 10, 7.5],
[42, 75000, 8, 15, 8.2],
[28, 40000, 3, 5, 6.7],
[55, 100000, 2, 3, 9.0],
[38, 60000, 5, 12, 7.8],
[46, 85000, 9, 20, 8.9],
[31, 45000, 2, 6, 6.5],
[39, 70000, 7, 18, 8.4],
[27, 35000, 4, 8, 7.2],
[48, 90000, 6, 14, 8.7],
[33, 55000, 3, 7, 6.8],
[50, 95000, 5, 16, 8.1],
[25, 30000, 1, 4, 6.0],
[41, 80000, 8, 17, 8.5],
[29, 38000, 2, 9, 6.3],
[37, 65000, 6, 13, 7.9],
[43, 82000, 7, 19, 8.6],
[30, 42000, 3, 11, 6.9],
[52, 98000, 4, 15, 8.3],
[36, 58000, 5, 10, 7.6],
[44, 90000, 9, 21, 9.2],
[32, 50000, 2, 5, 6.6],
[49, 88000, 7, 16, 8.0],
[26, 32000, 4, 7, 6.4],
[40, 75000, 6, 15, 7.7],
[34, 52000, 3, 9, 6.7],
[51, 93000, 5, 17, 8.2],
[24, 28000, 1, 3, 5.8],
[45, 89000, 8, 20, 8.8],
[33, 48000, 3, 8, 6.7],
[47, 92000, 7, 18, 8.3]
]
X = np.array(data)
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, copy_x=True, algorithm='full')
kmeans.fit(X)
labels = kmeans.labels_
print(labels) # Prints the cluster labels assigned to each data point
predictions = kmeans.predict([[0, 0, 0, 0, 0], [12, 3, 0, 0, 0]])
print(predictions) # Prints the predicted clusters for the given data points
centers = kmeans.cluster_centers_
print(centers) # Prints the coordinates of the cluster centerss
#2D_Visualization:
import matplotlib.pyplot as pl
from sklearn.decomposition import PCA
# Perform PCA to reduce dimensionality to 2
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Fit KMeans on the reduced data
kmeans.fit(X_pca)
# Get cluster labels
labels = kmeans.labels_
# Get cluster centers
centers = kmeans.cluster_centers_
# Visualize the clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('KMeans Clustering')
plt.show()
Visualization in 3D
import matplotlib.pyplot as pl
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
# Perform PCA to reduce dimensionality to 3
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)
# Fit KMeans on the reduced data
kmeans.fit(X_pca)
# Get cluster labels
labels = kmeans.labels_
# Get cluster centers
centers = kmeans.cluster_centers_
# Visualize the clusters
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=labels)
ax.scatter(centers[:, 0], centers[:, 1], centers[:, 2], c='red', marker='X', s=200)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.set_title('KMeans Clustering')
plt.show()t
So, Enjoy the reading and get insights and know everything about K-Means, K-Means ++, K-Medoids and PAM’s and how they are useful.
References:
Images of Optimal Clustering, Optimal Initializations and Limitations are taken from :
Ala Al-Fuqaha (“AL”), Ph.D.
Professor and Director,
NEST Research Lab
College of Engineering & Applied Sciences
Computer Science Department
Editor, IEEE Communications Letters
Click Here: https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
Jr Project Manager @ Panelmatic | Data Driven Project Manager | Data Viz Enthusiast | SLU'24
1 年Very neat explanation. Thanks a ton for this buddy!
Are you looking for Exciting opportunities??? We have collaborated with 100+ start-ups in India to bring you jobs of your choice. signup now at https://productacademy.in/jobs to find your dream job. Best regards, ProductAcademy
Data & Business Analyst | RPA Developer | Experienced in Machine Learning, Data Analytics, & BI Tools | Power BI | SQL | Python | R | Tableau | Open to Full-Time Opportunities in Data Science, Analytics, and Automation
1 年Thanks for posting. Insightful!