登录查看更多内容

BxD Primer Series: Affinity Propagation Clustering Models

Mayank K.

Founding Partner - BUSINESS x DATA

发布日期: 2023年4月26日

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Affinity Propagation Clustering Models. Let’s get started:

The What:

The main idea behind Affinity Propagation Clustering is that each data point sends messages to all other data points to indicate how well-suited it is to represent a cluster center. The messages sent between data points are based on two factors: the similarity between the data points, and the availability of other data points to serve as cluster centers.

To determine the initial cluster centers, Affinity Propagation Clustering uses a similarity matrix that measures the similarity between each pair of data points. Once the similarity matrix is established, the algorithm starts iterating between two steps: "responsibility" and "availability".

In the responsibility step, each data point calculates a responsibility value that measures?how well-suited it is to serve as the cluster center?for each other data point.

In the availability step, each data point calculates an availability value that measures?how available it is to be a cluster center?based on the responsibilities sent by other data points.

The iterations continue until the algorithm converges and identifies the cluster centers. Data points are then assigned to these centers based on their similarity to center points.

The Why:

Here are a few advantages that Affinity Propagation Clustering provides over other clustering models:

Automatically determine the number of clusters without the need for a predefined value
Able to handle non-linear and complex data patterns
Suitable for high-dimensional datasets with a large number of features
Able to identify small clusters and outliers that may be missed by other clustering algorithms
Affinity Propagation can handle missing data by ignoring those points in the clustering process

The Why Not:

Reasons to not use Affinity Propagation:

Requires multiple iterations and can take a long time to converge
Requires tuning of several hyper-parameters, which can be difficult and time-consuming
Affinity Propagation may result in unevenly sized clusters
Affinity Propagation Clustering produces clusters that are based on similarity metrics and message passing, which can be difficult to interpret

The How:

There are three matrix involved in Affinity Propagation algorithm: Similarity (S), Responsibility (R), Availability (A). If there are n data points in dataset, all these matrices will be of n*n dimension.

Similarity Matrix?stores pairwise similarities between all data points using a chosen distance or similarity metric. There are multiple ways to measure the distance between points and it depends on the values of features. We have explained their formula and implications in another edition (check?here). Brief below:

Euclidean distance?is a popular choice for datasets with continuous numerical features.
Manhattan distance?is useful for datasets with discrete or categorical features.
Minkowski distance?is generalized distance metric that combines Euclidean and Manhattan.
Cosine similarity?is commonly used for text data, where the similarity between two documents is based on the angle between their feature vectors.
Jaccard similarity?is useful for datasets with binary features, such as presence/absence of certain attributes.

Responsibility Matrix?element R_{i,k} represents the extent to which data point x_i considers exemplar x_k to be a good representative of the data

Availability Matrix?element A_{i,k} represents the extent to which exemplar x_k considers data point x_i to be a good candidate for being a member of its cluster.

The algorithm works as follows:

Step 1: Initialize the responsibility and availability matrices to zero, and the self availability to a small number:

Where,?λ?is the damping factor that controls the?rate?at which availability matrix is updated during each iteration. A higher value leads to slower and more stable convergence, while a lower damping value leads to faster and more volatile convergence.

Step 2: Compute the responsibility R_{i,k} for each data point x_i and exemplar x_k as:

领英推荐

Artificial Neural Network Model Classification and…

Doug Rose 9 个月前

Keras Tuner

360DigiTMG 1 年前

Decoding the Future: A Deep Dive into Time Series…

Iain Brown PhD 1 年前

Where,

R(i, k) represents the responsibility of data point i to exemplar k
S(i, k) is the similarity between data point i and exemplar k
A(i, k) is the availability of exemplar k to data point i.

Step 3: Compute the availability of all data points as:

Where,

A(k, k) represents the availability of exemplar k to itself
A(i, k) represents the availability of exemplar k to data point i

Step 4: Update the exemplars as the data points with highest net responsibility (responsibility minus availability):

Step 5: Repeat steps 2-4 until convergence, which is defined as the point when responsibility and availability matrices do not change significantly.

Step 6: Assign each data point to its closest exemplar based on the similarity matrix.

Note:?The value of?k?will change over time, ranging from 0 to?n, where?n?is number of data points.

Convergence Criterion?is typically based on the change in exemplar set and cluster assignments between iterations. The algorithm terminates when the change in exemplar set and cluster assignments falls below a certain threshold, indicating that the algorithm has converged to a stable solution.

Change in exemplar set: Measures the percentage of exemplars that are added or removed between two consecutive iterations, and compares it to the threshold value. If the percentage is less than the threshold value, the algorithm is considered to have converged.

Change in cluster assignment: Measures the percentage of data points whose cluster assignment changes between two consecutive iterations, and compares it to the threshold value. If the percentage is less than the threshold value, the algorithm is considered to have converged.

Evaluating Clusters: Eventually, you will need to accept/reject the clustering results. Certain methods can be used to judge quality of cluster. We have covered their formula and implications in another edition (check?here). Brief below:

Silhouette Coefficient?measures the compactness and separation of clusters. It is useful when the data is well-separated and the clusters have similar sizes.
Davies-Bouldin Index: This metric measures the similarity between clusters and the distance between their centroids. It is useful when the clusters have different sizes and densities.
Calinski-Harabasz Index: This metric measures the ratio of the between-cluster variance to the within-cluster variance. It is useful when the clusters are well-separated and have similar sizes.

Time for you to support:

Reply to this email with your question
Forward/Share to a friend who can benefit from this
Chat on Substack with BxD (here)
Engage with BxD on LinkedIN (here)

In next coming posts, we will be covering two more types of clustering models - Mean-Shift, Fuzzy C-Means.

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata?#bxd?#affinity #propagation?#clustering?#primer

BUSINESS x DATA

765 位关注者

要查看或添加评论，请登录

Mayank K.的更多文章

What we look for in new recruits?

2024年9月22日

What we look for in new recruits?

Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…
500+ Enrollments, ?????????? Ratings and a Podcast

2024年9月14日

500+ Enrollments, ?????????? Ratings and a Podcast

We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.
What you mean 'Build A Business'?

2024年9月7日

What you mean 'Build A Business'?

We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.
Why 'AI-Driven Personalization' niche?

2024年8月31日

Why 'AI-Driven Personalization' niche?

We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…
Entering the next chapter of BxD

2024年8月24日

Entering the next chapter of BxD

We are all in for AI Driven Personalization in Business. And recently we created a course about it.

1 条评论
We are ranking #1

2024年8月17日

We are ranking #1

We are all in for AI Driven Personalization in Business. And recently we created a course about it.
My favorites from the new release

2024年7月27日

My favorites from the new release

The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: ?? https://www.
Many senior level jobs inside....

2024年7月7日

Many senior level jobs inside....

Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…
People need more jobs and videos.

2024年6月29日

People need more jobs and videos.

From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…
BxD Saturday Letter #202425

2024年6月22日

BxD Saturday Letter #202425

Please take 2 mins to send your feedback. Link: https://forms.

See all articles

BxD Primer Series: Affinity Propagation Clustering Models

Mayank K.

Founding Partner - BUSINESS x DATA

The What:

The Why:

The Why Not:

The How:

领英推荐

Time for you to support:

BUSINESS x DATA

765 位关注者

Mayank K.的更多文章

社区洞察

其他会员也浏览了

The Quest for Interpretable Machine Learning Models

N-BEATS: The Unique Interpretable Deep Learning Model for Time Series Forecasting

BxD Primer Series: DBSCAN Clustering Models

DeepSORT Algorithm For Object Tracking

Hand Gesture Recognition using ML Algorithms

How to Choose right ML model?

Day 30 — Hyperparameter Optimization

BxD Primer Series: Bagging Ensemble Models

BxD Primer Series: Linear Discriminant Analysis (LDA) for Dimensionality Reduction

The What:

The Why:

The Why Not:

The How:

领英推荐

Time for you to support:

BUSINESS x DATA

765 位关注者

Mayank K.的更多文章

What we look for in new recruits?

500+ Enrollments, ?????????? Ratings and a Podcast

What you mean 'Build A Business'?

Why 'AI-Driven Personalization' niche?

Entering the next chapter of BxD

We are ranking #1

My favorites from the new release

Many senior level jobs inside....

People need more jobs and videos.

BxD Saturday Letter #202425

社区洞察

其他会员也浏览了

The Quest for Interpretable Machine Learning Models

N-BEATS: The Unique Interpretable Deep Learning Model for Time Series Forecasting

BxD Primer Series: DBSCAN Clustering Models

DeepSORT Algorithm For Object Tracking

Hand Gesture Recognition using ML Algorithms

How to Choose right ML model?

Day 30 — Hyperparameter Optimization

BxD Primer Series: Bagging Ensemble Models

BxD Primer Series: Linear Discriminant Analysis (LDA) for Dimensionality Reduction