BxD Primer Series: Affinity Propagation Clustering Models
Hey there ??
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Affinity Propagation Clustering Models. Let’s get started:
The What:
The main idea behind Affinity Propagation Clustering is that each data point sends messages to all other data points to indicate how well-suited it is to represent a cluster center. The messages sent between data points are based on two factors: the similarity between the data points, and the availability of other data points to serve as cluster centers.
To determine the initial cluster centers, Affinity Propagation Clustering uses a similarity matrix that measures the similarity between each pair of data points. Once the similarity matrix is established, the algorithm starts iterating between two steps: "responsibility" and "availability".
In the responsibility step, each data point calculates a responsibility value that measures?how well-suited it is to serve as the cluster center?for each other data point.
In the availability step, each data point calculates an availability value that measures?how available it is to be a cluster center?based on the responsibilities sent by other data points.
The iterations continue until the algorithm converges and identifies the cluster centers. Data points are then assigned to these centers based on their similarity to center points.
The Why:
Here are a few advantages that Affinity Propagation Clustering provides over other clustering models:
The Why Not:
Reasons to not use Affinity Propagation:
The How:
There are three matrix involved in Affinity Propagation algorithm: Similarity (S), Responsibility (R), Availability (A). If there are n data points in dataset, all these matrices will be of n*n dimension.
Similarity Matrix?stores pairwise similarities between all data points using a chosen distance or similarity metric. There are multiple ways to measure the distance between points and it depends on the values of features. We have explained their formula and implications in another edition (check?here). Brief below:
Responsibility Matrix?element R_{i,k} represents the extent to which data point x_i considers exemplar x_k to be a good representative of the data
Availability Matrix?element A_{i,k} represents the extent to which exemplar x_k considers data point x_i to be a good candidate for being a member of its cluster.
The algorithm works as follows:
Step 1: Initialize the responsibility and availability matrices to zero, and the self availability to a small number:
Where,?λ?is the damping factor that controls the?rate?at which availability matrix is updated during each iteration. A higher value leads to slower and more stable convergence, while a lower damping value leads to faster and more volatile convergence.
Step 2: Compute the responsibility R_{i,k} for each data point x_i and exemplar x_k as:
领英推荐
Where,
Step 3: Compute the availability of all data points as:
Where,
Step 4: Update the exemplars as the data points with highest net responsibility (responsibility minus availability):
Step 5: Repeat steps 2-4 until convergence, which is defined as the point when responsibility and availability matrices do not change significantly.
Step 6: Assign each data point to its closest exemplar based on the similarity matrix.
Note:?The value of?k?will change over time, ranging from 0 to?n, where?n?is number of data points.
Convergence Criterion?is typically based on the change in exemplar set and cluster assignments between iterations. The algorithm terminates when the change in exemplar set and cluster assignments falls below a certain threshold, indicating that the algorithm has converged to a stable solution.
Change in exemplar set: Measures the percentage of exemplars that are added or removed between two consecutive iterations, and compares it to the threshold value. If the percentage is less than the threshold value, the algorithm is considered to have converged.
Change in cluster assignment: Measures the percentage of data points whose cluster assignment changes between two consecutive iterations, and compares it to the threshold value. If the percentage is less than the threshold value, the algorithm is considered to have converged.
Evaluating Clusters: Eventually, you will need to accept/reject the clustering results. Certain methods can be used to judge quality of cluster. We have covered their formula and implications in another edition (check?here). Brief below:
Time for you to support:
In next coming posts, we will be covering two more types of clustering models - Mean-Shift, Fuzzy C-Means.
Let us know your feedback!
Until then,
Have a great time! ??