Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Agglomerative Clustering Models. Let’s get started:
The What:
Agglomerative clustering is a general framework for hierarchical clustering that creates a nested hierarchy of clusters. It starts by treating each data point as its own cluster and then merges clusters iteratively based on their similarity until a stopping criterion is met. This process creates a hierarchical structure, represented by a dendrogram, that displays the relationships between clusters.
It is most suitable for?small to medium-sized datasets?in a variety of tasks such as:
- Image segmentation: It can be used to group similar pixels in an image and segment it into different regions.
- Customer segmentation: It can be used to group customers based on their purchasing behavior and demographics, allowing for more targeted marketing campaigns.
- Gene expression analysis: It can be used to group genes with similar expression patterns, which can provide insights into biological processes and disease mechanisms.
- Document clustering: It can be used to group similar documents based on their content, which can help with document organization and retrieval.
- Brain connectivity analysis: It has been successfully used for brain connectivity analysis using fMRI data.
- Social network analysis: It can be used to group individuals with similar social network connections, which can provide insights into community structure and social influence.
- Recommendation systems: It can be used to group similar products or items based on user behavior, which helps with personalized recommendations.
Note: Agglomerative and divisive clustering are two different approaches to hierarchical clustering.
- Agglomerative clustering is a bottom-up approach that starts with each data point as its own cluster and iteratively merges clusters until a stopping criterion is met.
- Divisive clustering is a top-down approach that starts with all data points in a single cluster and recursively divides it into smaller clusters until a stopping criterion is met.
The How:
Below is how agglomerative clustering algorithm works:
- Initialize the algorithm: Begin by treating each data point as a separate cluster.
- Compute the pair wise distances: Compute the distance between all pairs of clusters using a specified distance metric. This produces a distance matrix that represents similarity between clusters.
- Merge the closest clusters: Identify the two closest clusters according to the distance matrix and merge them into a new cluster.
- Update the distance matrix: Recompute the distances between the new cluster and all other clusters using the chosen linkage method. This updates the distance matrix to reflect the new cluster structure.
- Repeat steps 3-4 until stopping criterion is met: Continue merging clusters and updating the distance matrix until a stopping criterion is met.
- Output the final clustering solution: The final clustering solution is the set of clusters that were produced when the stopping criterion was met.
Note:?In any standard implementation, only two clusters can be merged in one iteration of algorithm. If multiple clusters have equal distances, the algorithm would typically merge only two of them and would not merge all of them at once.
Choosing a distance metric:
There are multiple ways to measure the distance between points and it depends on the values of features.?Euclidean, Manhattan, Minkowski distance and Cosine, Jaccard similarity?are few examples. We have explained their formula and implications in another edition (check?here).
Choosing Linkage:
There are several linkage methods that can be used?to measure the distance between clusters:
Single Linkage: Distance between two clusters is defined as?minimum distance between any two points in the two clusters.
- Tends to create long chains of similar data points and is susceptible to the "chaining phenomenon" where distant points are linked together.
- Useful when data contains long, narrow clusters or when there is a clear chain-like structure in the data.
- It can also produce clusters that are too scattered or that contain outliers.
Complete Linkage: Distance between two clusters is defined as?maximum distance between any two points in the two clusters.
- Tends to create compact, spherical clusters and is less sensitive to the chaining phenomenon.
- Useful when data contains well-defined, compact clusters or when the goal is to create non-overlapping clusters.
- It can also produce clusters that are too small or that merge too quickly.
Average Linkage: Distance between two clusters is defined as?average distance between all pairs of points in the two clusters.
- Combines the advantages of single and complete linkage and is a popular choice for many applications.
- Useful when there is no clear preference for a particular linkage method or when data contains clusters of varying shapes and sizes.
- It can also produce clusters that are too large or too small if the distance metric is not carefully chosen.
Ward's Method: Distance between two clusters is defined as?increase in the sum of squared distances when the two clusters are merged.
- This method tends to create clusters with low variance and is often used in analysis of variance (ANOVA) applications.
- Useful when the goal is to identify clusters that have low?within-cluster variance?or when data contains groups that differ in their means or variances.
- It can also produce clusters that are too homogeneous or that merge too quickly.
Centroid Linkage: Distance between two clusters is defined as?distance between the centroids (mean point) of the two clusters.
- Tends to create clusters that are more balanced than single linkage and less compact than complete linkage.
- Useful when the data contains clusters of varying shapes and sizes or when the goal is to identify clusters that are relatively evenly distributed.
- It will be sensitive to outliers in data.
Median Linkage: Distance between two clusters is defined as?distance between the median values of the two clusters.
- Compromise between single and complete linkage and tends to produce more balanced clusters than single linkage.
- Variation of centroid linkage that is less sensitive to outliers in data
Choosing Stopping Criterion:
Stopping criterion determines when the algorithm will terminate and produce a final clustering solution. Some common approaches to define a stopping criterion:
- Number of clusters: Set?fixed number of clusters?in advance and stop the algorithm when that number is reached. Useful only when there is prior knowledge about data or a clear goal for the number of clusters.
- Distance threshold: Set threshold for?distance between clusters?and stop the algorithm when that threshold is reached. Useful when the goal is to identify clusters that are well-separated in the feature space. Threshold will be subject to choice of distance metric and type of linkage.
- Merge threshold: Set threshold for?number of merges that can occur?and stop the algorithm when that threshold is reached. Useful when the goal is to limit the number of clusters that are generated or when there is noise or outliers in data. It may produce clusters that are too large or too small, depending on the threshold value.
- Evaluate clustering quality: A more sophisticated approach is to?evaluate the quality of the clustering solution at each step?of the algorithm and stop when the clustering solution reaches a certain level of quality. This approach is useful when there is no clear goal for the number of clusters or when the data contains complex or overlapping clusters.
Evaluating Clusters:
Eventually, you will need to accept/reject the clustering results. Certain methods can be used to judge quality of cluster, such as?Silhouette Coefficient, Gap Statistic?etc. We have covered their formula and implications in another edition (check?here).
Another approach is to?examine the dendrogram, which displays the hierarchical structure of the clusters. The vertical axis of dendrogram represents the distance or dissimilarity between clusters, while the horizontal axis represents the clusters themselves. The clusters should seems reasonable for your data and an?appropriate level can be selected?based on the problem at hand.
Once the clusters have been identified, their characteristics can be examined to gain insights into the data. This involves examining the mean or median values of features within each cluster, or the most representative data points for each cluster.
The Why:
There are several reasons one might choose agglomerative clustering over other clustering models:
- Handles non-linearly separable data: Meaning, it can identify clusters that may not be easily detected using other clustering methods.
- Produces a hierarchical structure that can be useful for visualizing and interpreting clusters in a dendrogram.
- More intuitive than other clustering methods: Agglomerative clustering is a bottom-up approach, which means - it starts by clustering individual data points and then progressively merges them into larger clusters.
- Useful for data exploration and pattern recognition: Natural groupings within the data, can be visualized.
The Why Not:
Some reasons why you might not want to use agglomerative clustering:
- Sensitive to outliers: Outliers can cause the clustering algorithm to merge clusters that should not be merged, resulting in inaccurate results. Even one outlier can make a difference here.
- Computationally expensive for large datasets: Because the algorithm requires calculation of the distance between all pairs of data points.
- Requires a stopping point or number of clusters: Which can be difficult to determine in practice.
- Affected by clustering starting point: Different initialization can result in different final clustering results.
- Sensitive to choice of linkage method and distance metric: Different linkage methods and distance metrics can result in different clustering results.
Time for you to help:
- Reply to this email with your question
- Forward/Share to a friend who can benefit from this
- Chat on Substack with BxD (here)
- Engage with BxD on LinkedIN (here)
In next coming posts, we will be covering four more types of clustering models - Spectral Clustering, Affinity Propagation, Mean-Shift, Fuzzy C-Means.
Let us know your feedback!