登录查看更多内容

BxD Primer Series: Agglomerative Clustering Models

Mayank K.

Founding Partner - BUSINESS x DATA

发布日期: 2023年4月22日

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Agglomerative Clustering Models. Let’s get started:

The What:

Agglomerative clustering is a general framework for hierarchical clustering that creates a nested hierarchy of clusters. It starts by treating each data point as its own cluster and then merges clusters iteratively based on their similarity until a stopping criterion is met. This process creates a hierarchical structure, represented by a dendrogram, that displays the relationships between clusters.

It is most suitable for?small to medium-sized datasets?in a variety of tasks such as:

Image segmentation: It can be used to group similar pixels in an image and segment it into different regions.
Customer segmentation: It can be used to group customers based on their purchasing behavior and demographics, allowing for more targeted marketing campaigns.
Gene expression analysis: It can be used to group genes with similar expression patterns, which can provide insights into biological processes and disease mechanisms.
Document clustering: It can be used to group similar documents based on their content, which can help with document organization and retrieval.
Brain connectivity analysis: It has been successfully used for brain connectivity analysis using fMRI data.
Social network analysis: It can be used to group individuals with similar social network connections, which can provide insights into community structure and social influence.
Recommendation systems: It can be used to group similar products or items based on user behavior, which helps with personalized recommendations.

Note: Agglomerative and divisive clustering are two different approaches to hierarchical clustering.

Agglomerative clustering is a bottom-up approach that starts with each data point as its own cluster and iteratively merges clusters until a stopping criterion is met.
Divisive clustering is a top-down approach that starts with all data points in a single cluster and recursively divides it into smaller clusters until a stopping criterion is met.

The How:

Below is how agglomerative clustering algorithm works:

Initialize the algorithm: Begin by treating each data point as a separate cluster.
Compute the pair wise distances: Compute the distance between all pairs of clusters using a specified distance metric. This produces a distance matrix that represents similarity between clusters.
Merge the closest clusters: Identify the two closest clusters according to the distance matrix and merge them into a new cluster.
Update the distance matrix: Recompute the distances between the new cluster and all other clusters using the chosen linkage method. This updates the distance matrix to reflect the new cluster structure.
Repeat steps 3-4 until stopping criterion is met: Continue merging clusters and updating the distance matrix until a stopping criterion is met.
Output the final clustering solution: The final clustering solution is the set of clusters that were produced when the stopping criterion was met.

Note:?In any standard implementation, only two clusters can be merged in one iteration of algorithm. If multiple clusters have equal distances, the algorithm would typically merge only two of them and would not merge all of them at once.

Choosing a distance metric:

There are multiple ways to measure the distance between points and it depends on the values of features.?Euclidean, Manhattan, Minkowski distance and Cosine, Jaccard similarity?are few examples. We have explained their formula and implications in another edition (check?here).

Choosing Linkage:

There are several linkage methods that can be used?to measure the distance between clusters:

Single Linkage: Distance between two clusters is defined as?minimum distance between any two points in the two clusters.

Tends to create long chains of similar data points and is susceptible to the "chaining phenomenon" where distant points are linked together.
Useful when data contains long, narrow clusters or when there is a clear chain-like structure in the data.
It can also produce clusters that are too scattered or that contain outliers.

Complete Linkage: Distance between two clusters is defined as?maximum distance between any two points in the two clusters.

Tends to create compact, spherical clusters and is less sensitive to the chaining phenomenon.
Useful when data contains well-defined, compact clusters or when the goal is to create non-overlapping clusters.
It can also produce clusters that are too small or that merge too quickly.

Average Linkage: Distance between two clusters is defined as?average distance between all pairs of points in the two clusters.

Combines the advantages of single and complete linkage and is a popular choice for many applications.
Useful when there is no clear preference for a particular linkage method or when data contains clusters of varying shapes and sizes.
It can also produce clusters that are too large or too small if the distance metric is not carefully chosen.

Ward's Method: Distance between two clusters is defined as?increase in the sum of squared distances when the two clusters are merged.

This method tends to create clusters with low variance and is often used in analysis of variance (ANOVA) applications.
Useful when the goal is to identify clusters that have low?within-cluster variance?or when data contains groups that differ in their means or variances.
It can also produce clusters that are too homogeneous or that merge too quickly.

Centroid Linkage: Distance between two clusters is defined as?distance between the centroids (mean point) of the two clusters.

领英推荐

Top 10 Machine Learning Models

Kavindu Rathnasiri 1 年前

Synerise Monad: Apply science to behavioral data…

Jaroslaw Krolewski 2 年前

Scaling Techniques in Machine Learning: A Beginner's…

Gundala Nagaraju (Raju) 8 个月前

Tends to create clusters that are more balanced than single linkage and less compact than complete linkage.
Useful when the data contains clusters of varying shapes and sizes or when the goal is to identify clusters that are relatively evenly distributed.
It will be sensitive to outliers in data.

Median Linkage: Distance between two clusters is defined as?distance between the median values of the two clusters.

Compromise between single and complete linkage and tends to produce more balanced clusters than single linkage.
Variation of centroid linkage that is less sensitive to outliers in data

Choosing Stopping Criterion:

Stopping criterion determines when the algorithm will terminate and produce a final clustering solution. Some common approaches to define a stopping criterion:

Number of clusters: Set?fixed number of clusters?in advance and stop the algorithm when that number is reached. Useful only when there is prior knowledge about data or a clear goal for the number of clusters.
Distance threshold: Set threshold for?distance between clusters?and stop the algorithm when that threshold is reached. Useful when the goal is to identify clusters that are well-separated in the feature space. Threshold will be subject to choice of distance metric and type of linkage.
Merge threshold: Set threshold for?number of merges that can occur?and stop the algorithm when that threshold is reached. Useful when the goal is to limit the number of clusters that are generated or when there is noise or outliers in data. It may produce clusters that are too large or too small, depending on the threshold value.
Evaluate clustering quality: A more sophisticated approach is to?evaluate the quality of the clustering solution at each step?of the algorithm and stop when the clustering solution reaches a certain level of quality. This approach is useful when there is no clear goal for the number of clusters or when the data contains complex or overlapping clusters.

Evaluating Clusters:

Eventually, you will need to accept/reject the clustering results. Certain methods can be used to judge quality of cluster, such as?Silhouette Coefficient, Gap Statistic?etc. We have covered their formula and implications in another edition (check?here).

Another approach is to?examine the dendrogram, which displays the hierarchical structure of the clusters. The vertical axis of dendrogram represents the distance or dissimilarity between clusters, while the horizontal axis represents the clusters themselves. The clusters should seems reasonable for your data and an?appropriate level can be selected?based on the problem at hand.

Once the clusters have been identified, their characteristics can be examined to gain insights into the data. This involves examining the mean or median values of features within each cluster, or the most representative data points for each cluster.

The Why:

There are several reasons one might choose agglomerative clustering over other clustering models:

Handles non-linearly separable data: Meaning, it can identify clusters that may not be easily detected using other clustering methods.
Produces a hierarchical structure that can be useful for visualizing and interpreting clusters in a dendrogram.
More intuitive than other clustering methods: Agglomerative clustering is a bottom-up approach, which means - it starts by clustering individual data points and then progressively merges them into larger clusters.
Useful for data exploration and pattern recognition: Natural groupings within the data, can be visualized.

The Why Not:

Some reasons why you might not want to use agglomerative clustering:

Sensitive to outliers: Outliers can cause the clustering algorithm to merge clusters that should not be merged, resulting in inaccurate results. Even one outlier can make a difference here.
Computationally expensive for large datasets: Because the algorithm requires calculation of the distance between all pairs of data points.
Requires a stopping point or number of clusters: Which can be difficult to determine in practice.
Affected by clustering starting point: Different initialization can result in different final clustering results.
Sensitive to choice of linkage method and distance metric: Different linkage methods and distance metrics can result in different clustering results.

Time for you to help:

Reply to this email with your question
Forward/Share to a friend who can benefit from this
Chat on Substack with BxD (here)
Engage with BxD on LinkedIN (here)

In next coming posts, we will be covering four more types of clustering models - Spectral Clustering, Affinity Propagation, Mean-Shift, Fuzzy C-Means.

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata?#bxd?#agglomerative?#clustering?#primer

BUSINESS x DATA

765 位关注者

要查看或添加评论，请登录

Mayank K.的更多文章

What we look for in new recruits?

2024年9月22日

What we look for in new recruits?

Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…
500+ Enrollments, ?????????? Ratings and a Podcast

2024年9月14日

500+ Enrollments, ?????????? Ratings and a Podcast

We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.
What you mean 'Build A Business'?

2024年9月7日

What you mean 'Build A Business'?

We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.
Why 'AI-Driven Personalization' niche?

2024年8月31日

Why 'AI-Driven Personalization' niche?

We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…
Entering the next chapter of BxD

2024年8月24日

Entering the next chapter of BxD

We are all in for AI Driven Personalization in Business. And recently we created a course about it.

1 条评论
We are ranking #1

2024年8月17日

We are ranking #1

We are all in for AI Driven Personalization in Business. And recently we created a course about it.
My favorites from the new release

2024年7月27日

My favorites from the new release

The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: ?? https://www.
Many senior level jobs inside....

2024年7月7日

Many senior level jobs inside....

Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…
People need more jobs and videos.

2024年6月29日

People need more jobs and videos.

From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…
BxD Saturday Letter #202425

2024年6月22日

BxD Saturday Letter #202425

Please take 2 mins to send your feedback. Link: https://forms.

See all articles

BxD Primer Series: Agglomerative Clustering Models

Mayank K.

Founding Partner - BUSINESS x DATA

The What:

The How:

Choosing a distance metric:

Choosing Linkage:

领英推荐

Choosing Stopping Criterion:

Evaluating Clusters:

The Why:

The Why Not:

Time for you to help:

BUSINESS x DATA

765 位关注者

Mayank K.的更多文章

社区洞察

其他会员也浏览了

Copy of Predictive vs Causal Models in Machine Learning: Distinguishing Prediction from Causal Inference

Unleashing the Power of Patterns: Elevating Data Mastery through Enhanced Recognition and Processing

Value Forecasting Benefits

Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality

Machine Learning for Predictive Analytics: Forecasting Future Trends

Frequentist vs. Bayesian Thinking: Why Human Intelligence is Different from Artificial.

How Model Diagnosis Can Aid in Early Detection of Data Issues

The Rise of Automated Machine Learning

Artificial Intelligence = Statistics without Bounds

Understanding Bayesian Classification

The What:

The How:

Choosing a distance metric:

Choosing Linkage:

领英推荐

Choosing Stopping Criterion:

Evaluating Clusters:

The Why:

The Why Not:

Time for you to help:

BUSINESS x DATA

765 位关注者

Mayank K.的更多文章

What we look for in new recruits?

500+ Enrollments, ?????????? Ratings and a Podcast

What you mean 'Build A Business'?

Why 'AI-Driven Personalization' niche?

Entering the next chapter of BxD

We are ranking #1

My favorites from the new release

Many senior level jobs inside....

People need more jobs and videos.

BxD Saturday Letter #202425

社区洞察

其他会员也浏览了

Copy of Predictive vs Causal Models in Machine Learning: Distinguishing Prediction from Causal Inference

Unleashing the Power of Patterns: Elevating Data Mastery through Enhanced Recognition and Processing

Value Forecasting Benefits

Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality

Machine Learning for Predictive Analytics: Forecasting Future Trends

Frequentist vs. Bayesian Thinking: Why Human Intelligence is Different from Artificial.

How Model Diagnosis Can Aid in Early Detection of Data Issues

The Rise of Automated Machine Learning

Artificial Intelligence = Statistics without Bounds

Understanding Bayesian Classification