Everything to Know about Hierarchical Clustering; Agglomerative & Divisive Hierarchical Clustering.
Chandra Prakash Bathula
Adjunct Faculty at Saint Louis University | Machine Learning Practitioner | Web Developer | GenAI Developer
Hierarchical Clustering: Agglomerative & Divisive Hierarchical Clustering.
Hierarchical Clustering:
Before we go on to its applications & benefits.To remember we have two types of “Hierarchical Clustering”. They are :
1.Agglomerative Hierarchical Clustering (Typically Most Popular) and
2.Divisive Hierarchical clustering.
Let’s get some understanding about these two types what they actually mean. So that we will get an idea of the hierarchical part of the hierarchical clustering.
=> Let’s start with an example.
Suppose, we have a bunch of points. The way Agglomerative works is as follows:
Each point being one cluster itself at the beginning.
Next it groups points with the clusters based on some sense of similarity (or) distance.
=> This is the core-idea of Agglomerative clustering.
=> As we go down for each Cluster at the end we reach to a single-large cluster.
=> This is called Agglomerative; because we are starting with very Small clusters being just a
“Agglomerative” means group together?.We are actually grouping together points (or) clusters to build larger clusters.
“ Each of these clusters have larger number of points. This is how agglomerative works it starts with individual points and groups together.
** On the other hand,?Divisive?works exactly opposite.
=> Divisive, does the exact opposite & let’s see how it works exactly opposite.
Let’s see how it works :
Divisive starts off by saving that everything belongs to One containing all the points in a single cluster.
=> Agglomerative is the other way around.
* Hence, Agglomerative is the most popular typically is studied and implemented and studied.
Divisive is seldom used and people typically use Agglomerate because trying to group based on similarity or distance is easier then trying to break thing. This is not saying it is hard (or) impossible. But agglomerative is what more popular.
=> So, these are the both agglomerative and divisive hierarchal clustering.
Now, I hope, you understood why it is called “Hierarchical Clustering”.
Because if we think about it there is a hierarchy in this entire process.
Let’s see how it, the “hierarchy” looks like.
But remember that the key ingredient in (or) for “Agglomerative clustering” to work is a similarity measure (or) a distance measure between not just points but between clusters too.
* The sequences of points in a cluster can be recorded like a tree and hence this is called “Hierarchy”. And this tree in data-miring and in clustering is often referred to as a?“Dendrogram”.
Dendrogram is nothing but a tree which records the sequence of merges in case of agglomerative clustering (or) splits in the case of Divisive Clustering.
=> One thing we can notice from this structure is when we do this “Agglomerative Clustering” (or) “Divisive Clustering” we never measured the no. of clusters. Once we build the hierarchy we can have the clusters and we can select the wanted clusters from the merges and splits.
Example:
If we have k=3, then we get C1, C2, C3 as clusters and пow, if we want k=7, then we need to re-train to get the following clusters.There is no way get the clusters organically (or) easily, we need to redo to get the desired output. In the case of “Agglomerative” it is simple for getting the no.of clusters 2 clusters -> 5 clusters, if we want to go deeper and deeper, very very easily without having to re-compute any thing.
***It is the “ most important property of “Hierarchical Clustering” which makes it more popular and a good alternative to k-Means.***
Agglomerative Clustering Algorithm:
=> Proximity Matrix = Similarity Matrix = Distance Matrix.
Now, we will have a loop as follows:
Merge the two-closest clusters.
and repeat with the update proximity matrix until only a Single cluster remains.
We ignore the diagonal values as the distance is from itself from the point. We need to look at the off diagonal values in this matrix.
*** The key operation here is how to compute the proximity of two clusters similarity (or) distance between two clusters.***
Once we figure this out , the whole algorithm falls in place. And also we need to be careful on updating the proximity matrix.
Intermediate Situation:-
Now, if we have to (Cluster) merge two clusters namely C3 & C4 and update the proximity matrix. Initially it will look as above.
And the is big, question is “How we update the Proximity Matrix ?"
So, which ever clusters you merge, the corresponding row and column should be updated in the Proximity Matrix.
And yet again how do you define the Similarity. How do you define Inter-Cluster Similarity ?
There are multiple approaches to compute. They are:-
1. MIN.
2. MAX.
3. Group -Average.
4. Distance between Centroid.
5. Ward’s Method.
The only key idea is how do you measure the similarity between two clusters. Once you figure it out the algorithm is a cake walk.
To make it simple, it is simply the set unions.
Defining Inter-Cluster Similarity:
Suppose, if we are given a cluster “C1” with one (or) more points and in the other cluster “C2”. How do we define similarity (or) proximity between points.
There are multiple approaches which are simple ideas. They are:-
Min:-
Sim(C1,C2) = min [Sim(Pi,Pj)]
Such that Pi belongs to C1 & Pj belongs to C2.
(or) d_min(C_i, C_j) = min(dist(a, b) for a in C_i, b in C_j).
This is how we define similarity using the MIN approach.
Max is the exact opposite.
MAX:-
Sim(C1,C2) = max [Sim(Pi,Pj)]
Such that Pi belongs to C1 & Pj belongs to C2.
(or)
d_max(C_i, C_j) = max(dist(a, b) for a in C_i, b in C_j).
This is how max computation works in “Agglomerative Heirarchical Clustering” works.
Group-Average:
Average Intergroup Distance = (Sum of all pairwise distances between data points in different clusters) / (Total number of pairwise distances)
Mathematically, if we have N clusters and the ith cluster has ni data points, the total number of pairwise distances between data points in different clusters is given by:
Total number of pairwise distances = n1 * (N — n1) + n2 * (N — n2) + … + ni * (N — ni) + … + nN * (N — nN)
where n1, n2, …, nN represent the number of data points in each cluster.
(or)
“Average Intergroup Distance=n?m∑i=1n ∑j=1m d(ci ,cj ) .”
where:
This is how Group Average Computation works.
Distance between Centroids:
With all the points in a cluster we can compute their centroid and by this we need to have both cluster centroids and take their similarity as the similarity between clusters. And this is less popular. Computed as follows:
***One important thing to note here is that all we need is similarity of points in every computation. Except for the centroid!***
=> Sim (Pi, Pj → kernalize It.
领英推荐
Comparison Of all the Computation Techniques:
Strength of MIN:
Limitations:
Strength Of MAX:-
? Less susceptible to noise and outliers.
Limitations:
MIN is also called as “Single Linkage Algorithm”.
MAX is also called as “Complete Linkage Algorithm”.
Group Average:
It is a compromise between Min and Max.
Strength of Group Average:
Limitations:
And there is one more method for the cluster similarity calculation called?“Ward’s Method”.
Ward’s Method:
=> d(A, B) = sqrt((n_A * n_B) / (n_A + n_B)) * ||c_A — c_B||2
Strength Of Ward’s Method:
Limitations:
Sometimes people often even use this to initialize K-Means.
=> All of them are used in Agglomerative Clustering.
Space & Time Complexity for Hierarchical Clustering:-
Space Complexity: -
It is roughly O (n2) = > n: no. of data pts.
Time Complexity:-
=> It is roughy O(n3).
Utmost n -iterations.
Updating Similarity matrix and re-storing it is roughly about ~ O(n2).
=> 0 (n) * 0 (n2 )
=> ~ O (n3).
“ This is not the exact value as we can optimize the updating and re-storing computations to get the final time-complexity.”
By using complex data-structures we can further break it down.
Typically, Space = O (n2)
And,
Time = 0 (n2 logn)., but if you brute force implement it, it will be 0(n3).
One thing we need to remember is
O(n2) ; O(n2logn); O (n3) is quite large complexities.
Because ”n” can be large in data.
Imagine if given 1 million pts ; 0(n2) = 10? * 10? => 1012
O(n2) bytes => 1000 gb = more than 1Tb for the above and we need it in RAM Storage which is impractical.
One big problem of ,”Agglomerative Clustering” is, it is not very useful when the data is very large. Because of it’s time and Space complexity.
Limitations of Hierarchical clustering:
1. There is no mathematical objective function (or) optimization function for that we are directly solving.
Ex:. If you look at it in k-Means :-
But we can’t connect Hierarchical clustering to any of the existing techniques very easily. There are ways to connect it but it is not straight forward and easy.
2.Take any computation in Agglomerative be it Min, Max (or) Group-Average they have their own limitations.
*Min:- outliers.
*Max:— breaks large clusters, can’t accommodate different sized clusters
Grouping : It was some limitations as it is a combination of Min and Max.
3. * * * The most important limitation from a practical stand point (or) applied stand point is?“Space & Time Complexity”. * * *
It has a terrible complexity of Space &, Time complexity.
So, it can’t be used when “n” is large which makes it hard for large scale data sets. for small scale data sets it might work but it does not fits well with large-scale dataset.
In case of large-scale data sets, Agglomerative (or) general Hierarchical clustering goes out of the door and a very good alternative is K-Means” as it has the following complexity:
=> K-Means : “O (nd).” -> linear. ~ O(n) when “d” is small
*** All other limitations that can be worked around but complexity Cannot be done anything.***
So, there is a lot of investment in k- Means as it is old and researchers trying to improve.
=> (Initially from 1960’s)K- Means => ( Have Nice extensions Later) ::>
K-Means ++ (Early 2000's.) ; kernel K- Means ; k- Medoids.
"Agglomerative Clustering is a powerful technique with a wide range of applications across various domains."
Here are some applications and benefits of Agglomerative Clustering:
Applications Of Agglomerative Clustering in General:
1. Customer Segmentation:?Agglomerative Clustering can be used to segment customers based on their behavior, preferences, or purchasing patterns. This segmentation enables businesses to better understand their customer base and tailor marketing strategies accordingly.
2. Image Segmentation:?Agglomerative Clustering can partition an image into distinct regions based on similarity criteria. It is commonly used in computer vision applications, such as object detection, image recognition, and image compression.
3. Document Clustering:?Agglomerative Clustering can group similar documents together based on their content. This is useful in text mining, information retrieval, and document organization tasks.
4. Anomaly Detection:?Agglomerative Clustering can identify outliers or anomalies in a dataset by considering them as separate clusters. This is valuable in fraud detection, network intrusion detection, and quality control.
5. Biological Data Analysis:?Agglomerative Clustering is widely used in bioinformatics for analyzing gene expression data, protein sequence analysis, and clustering biological samples based on their characteristics.
Benefits:
1. Flexibility:?Agglomerative Clustering can handle various types of data, including numerical, categorical, and mixed data. It can also accommodate different distance metrics and linkage methods, providing flexibility in analyzing different datasets.
2. Hierarchy Exploration:?Agglomerative Clustering creates a hierarchy of clusters, represented by a dendrogram. This allows for easy exploration of different levels of granularity in the clustering structure.
3. Interpretability:?Agglomerative Clustering produces clusters that are easy to interpret and understand. The hierarchical structure and visualizations aid in identifying meaningful patterns and relationships within the data.
4. Scalability:?Agglomerative Clustering can handles small datasets efficiently. The time complexity of the algorithm is manageable, especially when using optimized implementations.
5. No Prespecified Number of Clusters:?Unlike some other clustering algorithms, Agglomerative Clustering does not require the number of clusters to be specified beforehand. It automatically determines the number of clusters based on the data and the chosen linkage method.
Agglomerative Clustering can be highly useful in scenarios where the underlying data structure is hierarchical, and the goal is to identify meaningful clusters at different levels. It is particularly beneficial when there is no prior knowledge about the number of clusters or when the number of clusters can vary depending on the granularity of analysis. Additionally, the ability to visually explore the clustering hierarchy aids in gaining insights and making informed decisions based on the data.
Here is the implementation of the Agglomerative Clustering.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
# Load the dataset
df = pd.read_csv('wine-clustering.csv')
# Extract features
X = df.iloc[:, 1:].values
# Perform Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
labels = agg_clustering.fit_predict(X)
# Print the labels
print("Cluster Labels:")
print(labels)
# Create linkage matrix
linkage_matrix = linkage(X, method='ward')
# Plot the dendrogram
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Dendrogram')
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.show()
# Box plots
plt.figure(figsize=(10, 6))
sns.boxplot(x=labels, y=df['Alcohol'])
plt.title('Box Plot - Alcohol')
plt.xlabel('Cluster Labels')
plt.ylabel('Alcohol')
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(x=labels, y=df['Malic_Acid'])
plt.title('Box Plot - Malic Acid')
plt.xlabel('Cluster Labels')
plt.ylabel('Malic Acid')
plt.show()
# Violin plots
plt.figure(figsize=(10, 6))
sns.violinplot(x=labels, y=df['Alcohol'])
plt.title('Violin Plot - Alcohol')
plt.xlabel('Cluster Labels')
plt.ylabel('Alcohol')
plt.show()
plt.figure(figsize=(10, 6))
sns.violinplot(x=labels, y=df['Malic_Acid'])
plt.title('Violin Plot - Malic Acid')
plt.xlabel('Cluster Labels')
plt.ylabel('Malic Acid')
plt.show()
# Plot the clusters in 2D
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering - 2D')
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.show()
# Visualize clusters in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=labels)
ax.set_xlabel('Price')
ax.set_ylabel('Fuel Efficiency')
ax.set_zlabel('Acceleration')
plt.title('Agglomerative Clustering (3D)')
plt.show()
Key Words:
#HierarchicalClustering #AgglomerativeClustering #ClusteringAlgorithm #MachineLearning #ArtificialIntelligence #DataClustering #PatternRecognition #UnsupervisedLearning #DataMining #DataAnalysis #ClusterAnalysis #DataVisualization #BigData #FeatureExtraction #DataPreprocessing #DataScience #Algorithm #DataPoints #Dendrogram #SimilarityMeasure #DataExploration #DataClusteringTechniques #DataClassification #DataModelling #DataPatterns #ClusterValidation #IntraclusterDistance #InterclusterDistance #ScikitLearn #PythonProgramming
References:
Pictures of Strengths, limitations and individual dendrograms and values are taken from:
Click Here:
Jr Project Manager @ Panelmatic | Data Driven Project Manager | Data Viz Enthusiast | SLU'24
1 年This is very Helpful! Thanks buddy!
Data & Business Analyst | RPA Developer | Experienced in Machine Learning, Data Analytics, & BI Tools | Power BI | SQL | Python | R | Tableau | Open to Full-Time Opportunities in Data Science, Analytics, and Automation
1 年Very useful