BxD Primer Series: Introduction to Unsupervised Learning and K-Means Clustering Models
Hey there ??
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is an?K-Means Clustering Models with an Introduction to Unsupervised Learning. Let’s get started:
Unsupervised Learning Models 101:
Supervised learning requires a labeled dataset, where the input data is labeled with the corresponding output or target variable. In contrast, Unsupervised learning algorithms learn patterns, structures and relationships within the data?without explicit guidance from a labeled dataset.
This makes them more versatile in scenarios where labeled data is difficult or expensive to obtain. However, unsupervised learning models also require more processing power and expertise to correctly interpret the results, since there is no explicit target variable to evaluate the model's performance.
Use cases of unsupervised learning models:
Starting with K-Means Clustering Model - The What:
K-Means algorithm attempts to partition a given dataset into K clusters, where each data point belongs to the nearest mean or centroid cluster. K-Means is a ‘hard clustering’ algorithm, which means that each data point is assigned to exactly one cluster.
The algorithm works by first selecting K initial centroids randomly from the dataset. Then, it iteratively?assigns each data point to the nearest centroid?and updates the centroids as the mean of the data points assigned to it. This process continues until the centroids no longer move significantly or a maximum number of iterations is reached.
The Why:
Some reasons to use K-Means for clustering tasks:
The Why Not:
Some reasons to not use K-Means for clustering tasks and use other algorithms:
The How:
The objective of K-Means clustering is to minimize the distances between each data point and its assigned centroid. Depending on data types, multiple distance metrics are available for use, such as?Euclidean, Manhattan, Cosine?etc. We explained choosing distance metrics in another edition, please check it?here. In this edition, we will use Euclidean distance. Steps in K-Means:
? Given a set of K cluster centroids, the K-Means algorithm assigns each data point to its closest centroid, based on the?Euclidean distance between the data point and each centroid.
? Once each data point is assigned to a centroid, algorithm?calculates the distances between each data point and its assigned centroid. These distance are then summed together.
? The goal of the K-Means algorithm is to minimize this sum of distances called cost function.
Cost Function?in K-Means is called within-cluster sum of squares (WCSS). It can be defined as:
Where,
? During each iteration of the algorithm, the K-Means algorithm minimizes the cost function by?updating cluster assignments of data points and recalculating cluster centroids. The algorithm stops when the change in the cost function between iterations falls below a certain threshold, indicating that the algorithm has converged.
Note One: A cluster centroid is simple average of all the data points in that cluster. It will have n-dimensions if there are n features in data.
Note Two: K-Means is sensitive to the scale of features. It is recommended to normalize or standardize the data before applying the algorithm.
Once converged, the centroid values are stored as final.?During prediction, algorithm calculates the Euclidean distance between new data point and each of the fixed cluster centroids. The new data point is then assigned to the closest cluster centroid based on this distance.
Parameters in K-Means Clustering:
There are multiple options that control the quality of clusters created by algorithm:
? Number of clusters (K):?It is important to strike a balance between simplicity and accuracy when choosing the optimal K value for your data. A small K value can lead to oversimplified clustering results, missing finer details in the data, while a large K value can capture more patterns but may result in overfitting and clusters that are too small and not meaningful. Some methods used to decide K value:
领英推荐
? Initialization methods:?K-Means starts from initial provided centroids and then move them to accommodate variation in underlying data. The choice of initialization method can affect performance and accuracy of clustering algorithm:
In practice, it is a good idea to try multiple initializations or initialization methods and compare the results to choose the best one.
? Distance metrics:?Distance metric determines how similar or dissimilar two data points are. It's important to choose a metric that is appropriate for the data being clustered. We explained choosing distance metrics in another edition, please check it?here.
? Convergence criteria:?Theoretically, K-Means algorithm can continue forever. Selecting convergence criteria is a tradeoff between accuracy of clustering and computational cost. Most commonly used criteria are:
Evaluating Cluster Quality:
Clusters can be evaluated after algorithm converges. Based on the quality of clusters, model can be accepted or re-trained. Eventually whichever version of model provides best clusters will be accepted. There are two main concepts that help judge cluster quality:
Measuring cluster cohesion and separation:
? Within-Cluster Sum of Squares (WCSS): It measures the sum of squared distances?between each data point and its assigned cluster center. A lower WCSS value indicates better cluster cohesion, as it means that the data points are closer to their respective cluster centers.
? Between-Cluster Sum of Squares (BCSS): BCSS measures the sum of squared distances?between the cluster centers. A higher BCSS value indicates better cluster separation, as it means that the cluster centers are far apart from each other.
? Average Silhouette Width: It is a measure of cluster quality that combines both cluster cohesion and cluster separation. The average silhouette width for a cluster is calculated as the mean of the silhouette coefficients for all data points in that cluster.
Where,
The coefficient ranges from -1 to 1, where -1 indicates that data point is probably assigned to wrong cluster, 0 indicates that data point is on the border between two clusters, and 1 indicates that data point is well-clustered.
? Density-Based Clustering Validity (DBCV): The DBCV value ranges between 0 and 1, higher value indicates better cluster quality, where clusters are well-separated and compact. However, it can be negative, indicating highly overlapping or poorly separated clusters.
Where,
There are multiple other criterions to measure cluster quality. Those are for your self study. Start with - Dunn Index, Davies-Bouldin Index (DBI), Calinski-Harabasz Index.
Variations of K-Means:
Other than K-Means++ (discussed earlier), there are two more variations of K-Means that are useful in particular scenarios:
Time for you to support:
In next coming posts, we will be covering few more clustering models used for clustering, such as DBSCAN, Affinity Propagation etc. Then we will move to pattern search algorithms.
Let us know your feedback!
Until then,
Enjoy life in full ??