登录查看更多内容

BxD Primer Series: Spectral Clustering Models

Mayank K.

Founding Partner - BUSINESS x DATA

发布日期: 2023年4月24日

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Spectral Clustering Models. Let’s get started:

The What:

Spectral clustering is a powerful technique for clustering data that is not easily separated by linear boundaries and other clustering algorithms. Instead of clustering the original dataset, it looks at the underlying structure of data in terms of similarity between data points. Then it uses that underlying structure to arrive at clusters.

It's called spectral clustering because it relies on the spectral properties (eigenvectors and eigenvalues) of a similarity graph that is constructed from the data. In summary:

Spectral Clustering = Reduced Underlying Structure + Any other Clustering Algorithm

The Accuracy:

You can observe from below image that Spectral Clustering is able to separate almost all data distribution types into appropriate clusters. Whereas some other algorithm can fail to do so on some particular data distributions.

The How:

There are eight general steps to perform spectral clustering. Multiple choices can be made at each step. Here are the eight steps and choices will be explained separately:

Choose a pairwise similarity measure or distance metric to quantify the similarity or dissimilarity between each pair of data points.
Construct an affinity matrix that captures the pairwise relationships between data points based on the similarity measure or distance metric chosen.
Normalize the affinity matrix to obtain a matrix that satisfies certain properties, such as being symmetric, positive semi-definite, and having rows that sum to one.
Compute the eigenvectors and eigenvalues of the normalized affinity matrix.
Choose a small subset of the top eigenvectors based on the desired number of clusters or other criteria.
Use the selected eigenvectors to form a lower-dimensional representation of the data points.
Apply a clustering algorithm, such as k-means or hierarchical clustering, to the lower-dimensional representation to group the data points into clusters.
Map the cluster assignments back to the original data space to obtain the final clustering result.

??Choice for pairwise similarity measure or distance metric?will depend on the nature of data:

Euclidean distance: It measures the distance between two points in a straight line.
Cosine similarity: This measures the angle between two vectors, which is useful for text or image data.
Correlation coefficient: This measures the linear correlation between two vectors.
Gaussian kernel similarity: This is a similarity measure that assigns a weight to each pair of points based on their distance.
Pearson correlation: This measures the linear correlation between two vectors after they have been centered.
k-Nearest Neighbors graph: Connects each point to its k nearest neighbors based on the chosen distance metric.
Fully connected graph: Connects each point to all other points based on the chosen distance metric.
Laplacian kernel: This is a kernel-based approach that uses the Laplacian matrix to measure similarity between points.

??Affinity matrix?depends on choice of similarity or distance metric:

Each row and column of the affinity matrix corresponds to a single data point, and the value of each entry in the matrix represents the degree of similarity or dissimilarity between the corresponding pair of data points.

Depending on choice of similarity or distance measure, there are three types of affinity matrix:

Binary adjacency matrix, where entries are 1 if two points are connected, and 0 otherwise
Weighted adjacency matrix, where entries are weighted by the strength of the similarity or distance measure
Gaussian similarity matrix, where entries are computed using a Gaussian kernel function

Affinity matrix is a key input to spectral clustering algorithms, as it is used to?construct the lower-dimensional representation of the data points.

??Normalizing the affinity matrix: Unnormalized affinity is rarely used to calculate eigenvectors as it can result in unstable and inconsistent clustering results.

It is normalized to ensure that?rows sum to 1, and the resulting matrix is then used for eigenvector computation.

Row normalization: Dividing each row of affinity matrix by sum of the elements in that row.

Symmetric normalization: Symmetry ensures that eigenvectors of the matrix are orthogonal.

Compute a diagonal matrix (D) whose diagonal entries are sum of the elements in that row of affinity matrix (A).
Compute the unnormalized Laplacian matrix L: L = D - A.
Compute the normalized Laplacian matrix L_sym:

The resulting matrix L_sym is symmetric and has positive eigenvalues.

Random walk normalization?also provides a symmetric matrix. Symmetric normalization takes a global approach, while random walk normalization takes a local approach. It is better suited in scenarios where dataset have disconnected cluster locations.

Compute a diagonal matrix (D) whose diagonal entries are sum of the elements in that row of affinity matrix (A).
Compute the normalized affinity matrix L_rw:

Each row of L_rw represents a probability distribution over the neighboring nodes.

??Computing Eigenvectors and Eigenvalues:?Eigenvectors are the vectors that when multiplied by a matrix,?produce a scalar multiple of the original vector. Eigenvalues are the corresponding scalar multiples. Details will be covered in PCA primer.

Whole idea of spectral clustering is based on use of eigenvectors and eigenvalues instead of original features and they are the ‘spectral’ property of data. They are used to extract most of information from data into less number of features effectively performing dimensionality reduction.

The eigenvectors represent the directions in which the data is most separable, while the eigenvalues correspond to the degree of influence or importance of each direction.

??Selecting a subset of (m)?eigenvectors: Common options when selecting a subset of top eigenvectors:

Fixed number of eigenvectors: Choose a fixed number of top eigenvectors based on the desired number of clusters or other criteria. This is the simplest approach, but it may not always lead to the best clustering result.
Ratio of eigenvalues:?Choose the top eigenvectors based on a ratio of their corresponding eigenvalues. For example, you can choose the eigenvectors that correspond to the top 10% of eigenvalues. This approach aims to retain the eigenvectors that capture the most amount of variance in data.
Gap statistic:?Compare the eigenvalues of the normalized affinity matrix to the eigenvalues of random matrices generated with the same dimensions, and identify a "gap" in the eigenvalue spectrum that indicates the optimal number of eigenvectors to pick.

领英推荐

Graph Machine Learning: It's Everywhere!

Tyler Blalock 5 个月前

Predictive Analytics

Bluechip Technologies Asia 8 个月前

Table Parsing Made Simple with Homegrown Neural…

Xiao-Fei Zhang 2 个月前

??Forming a lower-dimensional representation of data:

Form a matrix X by stacking the original data points as rows. Dimension of X: p*n
Form a matrix V by stacking the selected eigenvectors as columns. Dimension of V: n*k
Compute the matrix product Y = XV. Dimension of Y: p*k
The rows of Y represent the lower-dimensional embedding of the original data points.

??Apply a clustering algorithm on lower-dimensional data:

By this step, ‘spectral’ part of algorithm is done. Now, a suitable clustering algorithm can be applied to group lower dimension data points into clusters. Some common choices:

K-means clustering: This is a popular choice for spectral clustering. We have already covered this algorithm in our previous edition (check?here).
Hierarchical clustering: This is another popular choice for spectral clustering. We have already covered agglomerative algorithm in our previous edition (check?here).
DBSCAN: We have already covered this algorithm in our previous edition (check?here).

??Mapping lower-dimensional clusters to original data points:

Options for mapping the cluster assignments from the lower-dimensional space back to the original data space:

One-to-One Map: Inversely map the embedding to original data point and assign it to the cluster assignment of embedding.
Nearest neighbor: For each data point in the original space, find its nearest neighbor in the lower-dimensional space. Assign the data point to the same cluster as its nearest neighbor in the lower-dimensional space.

Evaluating Clusters:

Spectral clustering algorithm has multiple parameters that can affect the resulting cluster formation and there are no direct methods to visualize the clustering results.

Eventually, you will need to accept/reject the clustering results. Certain statistical methods can be used to judge quality of cluster. We have covered their formula and implications in a previous edition (check?here).

Another approach is to use a measure of clustering stability. Some common techniques:

Bootstrapping:

Original dataset is sampled with replacement to create multiple bootstrap datasets.
Each bootstrap dataset is then clustered using spectral clustering, and the resulting clustering solutions are compared using a stability metric such as Normalized Mutual Information (NMI)
NMI ranges from 0 to 1, with a value of 1 indicating perfect alignment between the two solutions and a value of 0 indicating no alignment.
The stability metric is then averaged across all bootstrap samples to produce an overall measure of stability.

Subsampling:

A random subset of the original dataset is selected, and spectral clustering is performed on the subset.
This process is repeated multiple times with different subsets of the data, and resulting clustering solutions are compared to assess their stability using a stability metric

Parameter perturbation:

Clustering algorithm is run multiple times with small perturbations to the parameters.
Resulting clustering solutions are compared using a stability metric to identify the most stable solution.

Consensus clustering:

Multiple clustering algorithms are run with different parameter settings or initialization.
The resulting clustering solutions are then combined to produce a consensus clustering solution that represents the most stable clusters across all the individual solutions.

The Why:

There are several reasons why one might choose spectral clustering over other clustering models:

Can handle datasets with complex or nonlinear structures, where other clustering algorithms struggle
Can identify clusters of arbitrary shapes and provide clear separation
Generally produces high-quality clusters, especially when the underlying data has clear geometric structure
Can be used with a variety of similarity measures, including both graph-based and distance-based measures
Can be used with both normalized and unnormalized Laplacians, allowing for more flexibility in the algorithm's behavior
Embeddings of data points in the low-dimensional space can be used as features for downstream tasks

The Why Not:

Some reasons why you might not want to use spectral clustering:

Can be computationally expensive, especially for large datasets or high-dimensional feature spaces
Requires selection and tuning of several parameters, such as the similarity function, affinity matrix type and normalization techniques
May not perform well on datasets with low-dimensional structure or when the clusters are highly overlapping or dispersed

Time for you to help:

Reply to this email with your question
Forward/Share to a friend who can benefit from this
Chat on Substack with BxD (here)
Engage with BxD on LinkedIN (here)

In next coming posts, we will be covering three more types of clustering models - Affinity Propagation, Mean-Shift, Fuzzy C-Means.

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata?#bxd?#spectral?#clustering?#primer

BUSINESS x DATA

765 位关注者

Anant Pratap

Service Introduction Manager - Cloud & Compute at Alstom

1 年

Hi Mayank I want to let you know that your posts are very informative and easy to understand even for someone who has little or no relevance to this domain. Keep up the good work ??

查看更多评论

要查看或添加评论，请登录

Mayank K.的更多文章

What we look for in new recruits?

2024年9月22日

What we look for in new recruits?

Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…
500+ Enrollments, ?????????? Ratings and a Podcast

2024年9月14日

500+ Enrollments, ?????????? Ratings and a Podcast

We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.
What you mean 'Build A Business'?

2024年9月7日

What you mean 'Build A Business'?

We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.
Why 'AI-Driven Personalization' niche?

2024年8月31日

Why 'AI-Driven Personalization' niche?

We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…
Entering the next chapter of BxD

2024年8月24日

Entering the next chapter of BxD

We are all in for AI Driven Personalization in Business. And recently we created a course about it.

1 条评论
We are ranking #1

2024年8月17日

We are ranking #1

We are all in for AI Driven Personalization in Business. And recently we created a course about it.
My favorites from the new release

2024年7月27日

My favorites from the new release

The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: ?? https://www.
Many senior level jobs inside....

2024年7月7日

Many senior level jobs inside....

Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…
People need more jobs and videos.

2024年6月29日

People need more jobs and videos.

From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…
BxD Saturday Letter #202425

2024年6月22日

BxD Saturday Letter #202425

Please take 2 mins to send your feedback. Link: https://forms.

See all articles

BxD Primer Series: Spectral Clustering Models

Mayank K.

Founding Partner - BUSINESS x DATA

The What:

The Accuracy:

The How:

领英推荐

Evaluating Clusters:

The Why:

The Why Not:

Time for you to help:

BUSINESS x DATA

765 位关注者

Mayank K.的更多文章

社区洞察

其他会员也浏览了

Regression: From Theory to ML

LSTM for Enterprise Time Series Forecasting

BxD Primer Series: Support Vector Machine (SVM) Models

How Machine Learning is used in Predicting Stock Prices - LSTM

BxD Primer Series: K-Nearest Neighbors (K-NN) Models

Comparing Machine Learning Models to Find the Best Fit

Model Selection

BxD Primer Series: Bayesian Model Averaging (BMA) Ensemble

Choosing the Right Time Series Model: A Blend of Data Science, Statistics, and Financial Understanding.

BxD Primer Series: Stacking Ensemble Models

The What:

The Accuracy:

The How:

领英推荐

Evaluating Clusters:

The Why:

The Why Not:

Time for you to help:

BUSINESS x DATA

765 位关注者

Mayank K.的更多文章

What we look for in new recruits?

500+ Enrollments, ?????????? Ratings and a Podcast

What you mean 'Build A Business'?

Why 'AI-Driven Personalization' niche?

Entering the next chapter of BxD

We are ranking #1

My favorites from the new release

Many senior level jobs inside....

People need more jobs and videos.

BxD Saturday Letter #202425

社区洞察

其他会员也浏览了

Regression: From Theory to ML

LSTM for Enterprise Time Series Forecasting

BxD Primer Series: Support Vector Machine (SVM) Models

How Machine Learning is used in Predicting Stock Prices - LSTM

BxD Primer Series: K-Nearest Neighbors (K-NN) Models

Comparing Machine Learning Models to Find the Best Fit

Model Selection

BxD Primer Series: Bayesian Model Averaging (BMA) Ensemble

Choosing the Right Time Series Model: A Blend of Data Science, Statistics, and Financial Understanding.

BxD Primer Series: Stacking Ensemble Models