BxD Primer Series: Introduction to Unsupervised Learning and K-Means Clustering Models

BxD Primer Series: Introduction to Unsupervised Learning and K-Means Clustering Models

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is an?K-Means Clustering Models with an Introduction to Unsupervised Learning. Let’s get started:

Unsupervised Learning Models 101:

Supervised learning requires a labeled dataset, where the input data is labeled with the corresponding output or target variable. In contrast, Unsupervised learning algorithms learn patterns, structures and relationships within the data?without explicit guidance from a labeled dataset.

This makes them more versatile in scenarios where labeled data is difficult or expensive to obtain. However, unsupervised learning models also require more processing power and expertise to correctly interpret the results, since there is no explicit target variable to evaluate the model's performance.

Use cases of unsupervised learning models:

  1. Clustering: Unsupervised learning models are commonly used for clustering data points into groups based on their similarities. For example, clustering algorithms can be used to group customers based on their purchasing patterns or group news articles based on their content.
  2. Anomaly detection: Unsupervised learning models can learn normal patterns of data and detect any deviations from those patterns. This is useful in fraud detection, network intrusion detection, or identifying defects in manufacturing processes.
  3. Dimensionality reduction: Unsupervised learning models can be used to identify and extract most important features or components of the input data. This is useful for visualizing high-dimensional data, compressing data for storage, or reducing noise in the data.
  4. Recommendation systems: Unsupervised learning models can be used to recommend products or content to users based on their behavior or preferences. For example, clustering algorithms can be used to group users based on their past purchases or interests, and then recommend products to users in those groups.
  5. Generative models: Unsupervised learning models can be used to generate new data that is similar to the input data. This is useful in image or speech generation, or for generating new designs based on existing designs.
  6. Pattern Search: The model searches for recurring patterns or motifs within the input data. This can be useful in fields such as bioinformatics where patterns in DNA sequences can be identified, or in finance where patterns in stock prices can be detected. Unsupervised learning algorithms such as self-organizing maps (SOM) or autoencoders can be used for pattern search.

Starting with K-Means Clustering Model - The What:

K-Means algorithm attempts to partition a given dataset into K clusters, where each data point belongs to the nearest mean or centroid cluster. K-Means is a ‘hard clustering’ algorithm, which means that each data point is assigned to exactly one cluster.

The algorithm works by first selecting K initial centroids randomly from the dataset. Then, it iteratively?assigns each data point to the nearest centroid?and updates the centroids as the mean of the data points assigned to it. This process continues until the centroids no longer move significantly or a maximum number of iterations is reached.

The Why:

Some reasons to use K-Means for clustering tasks:

  1. Speed and efficiency: K-Means is suitable for large datasets with a large number of features. It is particularly well-suited for situations where the number of clusters is relatively small compared to number of data points.
  2. Ease of use and implementation: K-Means can be used for both numerical and categorical data. It is also a well-studied algorithm, making it easier to find resources and support for implementation.
  3. Ability to handle noise: K-Means can effectively separate noisy data points into their own clusters or exclude them from the clustering process altogether.
  4. Automatic natural groupings: K-Means doesn’t require prior knowledge of the data's structure. It is particularly useful in exploratory data analysis, where the goal is to identify patterns and relationships within the data.
  5. Scalability: K-Means can handle large datasets. It can also be parallelized to further increase its speed and efficiency.

The Why Not:

Some reasons to not use K-Means for clustering tasks and use other algorithms:

  1. K-Means requires the number of clusters to be?specified in advance, which can be difficult if the optimal number of clusters is not known.
  2. The results of K-Means clustering are?sensitive to the initial positions?of the cluster centers. Different initial positions can result in different final clustering results.
  3. K-Means assumes that clusters are?spherical, evenly sized, and have similar densities, which may not be true for all datasets.
  4. It is not suitable for datasets with?non-linear cluster boundaries, where more complex clustering algorithms may be necessary.
  5. K-Means can be?biased towards equal sized clusters, which may not be appropriate for all datasets.

The How:

The objective of K-Means clustering is to minimize the distances between each data point and its assigned centroid. Depending on data types, multiple distance metrics are available for use, such as?Euclidean, Manhattan, Cosine?etc. We explained choosing distance metrics in another edition, please check it?here. In this edition, we will use Euclidean distance. Steps in K-Means:


? Given a set of K cluster centroids, the K-Means algorithm assigns each data point to its closest centroid, based on the?Euclidean distance between the data point and each centroid.

? Once each data point is assigned to a centroid, algorithm?calculates the distances between each data point and its assigned centroid. These distance are then summed together.

? The goal of the K-Means algorithm is to minimize this sum of distances called cost function.

Cost Function?in K-Means is called within-cluster sum of squares (WCSS). It can be defined as:

No alt text provided for this image

Where,

  • n is the number of data points
  • K is the number of clusters
  • x_i is the i’th data point
  • c_j is the centroid of the j’th cluster
  • ||.|| denotes the Euclidean distance
  • w_i,j is an indicator that equals 1 if data point i is assigned to cluster j, and 0 otherwise

? During each iteration of the algorithm, the K-Means algorithm minimizes the cost function by?updating cluster assignments of data points and recalculating cluster centroids. The algorithm stops when the change in the cost function between iterations falls below a certain threshold, indicating that the algorithm has converged.


Note One: A cluster centroid is simple average of all the data points in that cluster. It will have n-dimensions if there are n features in data.

Note Two: K-Means is sensitive to the scale of features. It is recommended to normalize or standardize the data before applying the algorithm.

Once converged, the centroid values are stored as final.?During prediction, algorithm calculates the Euclidean distance between new data point and each of the fixed cluster centroids. The new data point is then assigned to the closest cluster centroid based on this distance.

Parameters in K-Means Clustering:

There are multiple options that control the quality of clusters created by algorithm:

? Number of clusters (K):?It is important to strike a balance between simplicity and accuracy when choosing the optimal K value for your data. A small K value can lead to oversimplified clustering results, missing finer details in the data, while a large K value can capture more patterns but may result in overfitting and clusters that are too small and not meaningful. Some methods used to decide K value:

  1. Elbow Method: It involves plotting WCSS as a function of the number of clusters (K) and identifying the "elbow point" where the rate of decrease in WCSS slows down significantly. The number of clusters at this point is chosen as the optimal number.
  2. Silhouette Method: It involves computing silhouette coefficient (explained later in this post) for each data point, which measures how well it fits into its assigned cluster compared to other clusters. Average silhouette coefficient for each value of K is then calculated, and value of K with highest average silhouette coefficient is chosen as the optimal number.
  3. Gap Statistic Method: It involves comparing WCSS of each K-Value with the WCSS of a null reference distribution generated by randomly permuting the data. Optimal number of clusters is chosen as the value of K where this gap is largest.

? Initialization methods:?K-Means starts from initial provided centroids and then move them to accommodate variation in underlying data. The choice of initialization method can affect performance and accuracy of clustering algorithm:

  1. Random Initialization: Initial centroids are randomly selected from the data points. It is simple and easy to implement, but the choice of random initial centroids may lead to suboptimal clusters, especially if the data is not well-separated or there are outliers in the data.
  2. K-Means++ Initialization: The first centroid is randomly selected from the data points, and then subsequent centroids are selected based on their distance from previous centroids. This ensures that initial centroids are well-distributed and can lead to better clustering results.
  3. Initialization with Prior Knowledge: In some cases, prior knowledge about data or clusters may be available. For example, if we already know that there should be three distinct clusters in the data, we can choose three initial centroids accordingly.

In practice, it is a good idea to try multiple initializations or initialization methods and compare the results to choose the best one.

? Distance metrics:?Distance metric determines how similar or dissimilar two data points are. It's important to choose a metric that is appropriate for the data being clustered. We explained choosing distance metrics in another edition, please check it?here.

? Convergence criteria:?Theoretically, K-Means algorithm can continue forever. Selecting convergence criteria is a tradeoff between accuracy of clustering and computational cost. Most commonly used criteria are:

  1. Maximum number of iterations: The algorithm will stop iterating once it has performed a certain number of iterations, regardless of whether or not the clusters have converged. If too low, the algorithm may result in suboptimal clusters. If too high, the algorithm will continue iterating unnecessarily, increasing the computational cost. Based on data size, usual range is 100-500 iterations.
  2. Minimum change in WCSS: The algorithm will stop iterating once the WCSS metric no longer decreases by more than a specified amount. A commonly used range for the minimum change in WCSS is 0.0001 to 0.001.

Evaluating Cluster Quality:

Clusters can be evaluated after algorithm converges. Based on the quality of clusters, model can be accepted or re-trained. Eventually whichever version of model provides best clusters will be accepted. There are two main concepts that help judge cluster quality:

  1. Cluster cohesion?measures the similarity between data points within the same cluster. A good clustering algorithm should group data points that are similar in terms of their features into the same cluster, resulting in high cluster cohesion.
  2. Cluster separation?measures the dissimilarity between data points in different clusters. A good clustering algorithm should keep data points that are dissimilar in separate clusters, resulting in high cluster separation.

Measuring cluster cohesion and separation:

? Within-Cluster Sum of Squares (WCSS): It measures the sum of squared distances?between each data point and its assigned cluster center. A lower WCSS value indicates better cluster cohesion, as it means that the data points are closer to their respective cluster centers.

? Between-Cluster Sum of Squares (BCSS): BCSS measures the sum of squared distances?between the cluster centers. A higher BCSS value indicates better cluster separation, as it means that the cluster centers are far apart from each other.

? Average Silhouette Width: It is a measure of cluster quality that combines both cluster cohesion and cluster separation. The average silhouette width for a cluster is calculated as the mean of the silhouette coefficients for all data points in that cluster.

No alt text provided for this image

Where,

  • a(i) is average distance between data point i and all other data points in same cluster’
  • b(i) is average distance between data point i and all other data points in nearest cluster’

The coefficient ranges from -1 to 1, where -1 indicates that data point is probably assigned to wrong cluster, 0 indicates that data point is on the border between two clusters, and 1 indicates that data point is well-clustered.

  • The average silhouette width for a cluster is calculated as the average of silhouette coefficients of all data points in a cluster. Further, average silhouette width for a set of clusters is the mean of the average silhouette widths for each cluster.
  • A value close to 1 indicates that the clusters are well separated, and a value close to 0 indicates that clusters are overlapping or not well-separated. A negative value indicates that the data points may be assigned to the wrong cluster.

? Density-Based Clustering Validity (DBCV): The DBCV value ranges between 0 and 1, higher value indicates better cluster quality, where clusters are well-separated and compact. However, it can be negative, indicating highly overlapping or poorly separated clusters.

No alt text provided for this image

Where,

  • k is the number of clusters
  • d_i is the average distance of each data point in cluster i to its nearest cluster center
  • d_j is the average intra-cluster distance of cluster j

There are multiple other criterions to measure cluster quality. Those are for your self study. Start with - Dunn Index, Davies-Bouldin Index (DBI), Calinski-Harabasz Index.

Variations of K-Means:

Other than K-Means++ (discussed earlier), there are two more variations of K-Means that are useful in particular scenarios:

  1. Mini-Batch K-Means?can handle large datasets by using a random subset of the data to update the centroids in each iteration, instead of using the entire dataset. This makes the algorithm much faster and scalable, especially when dealing with large datasets or streaming data.
  2. Kernel K-Means?applies a kernel function to transform the data into a higher-dimensional space, which can improve the separation of non-linearly separable clusters. For example, in image segmentation, Kernel K-Means can be used to separate objects in an image based on their texture or color, which may not be linearly separable in the original space. And, then apply normal K-Means clustering to categorize these images.

Time for you to support:

  1. Reply to this email with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here)
  4. Engage with BxD on LinkedIN (here)

In next coming posts, we will be covering few more clustering models used for clustering, such as DBSCAN, Affinity Propagation etc. Then we will move to pattern search algorithms.

Let us know your feedback!

Until then,

Enjoy life in full ??

#businessxdata?#bxd?#unsupervisedlearning #kmeans?#primer

要查看或添加评论,请登录

Mayank K.的更多文章

  • What we look for in new recruits?

    What we look for in new recruits?

    Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…

  • 500+ Enrollments, ?????????? Ratings and a Podcast

    500+ Enrollments, ?????????? Ratings and a Podcast

    We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.

  • What you mean 'Build A Business'?

    What you mean 'Build A Business'?

    We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.

  • Why 'AI-Driven Personalization' niche?

    Why 'AI-Driven Personalization' niche?

    We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…

  • Entering the next chapter of BxD

    Entering the next chapter of BxD

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

    1 条评论
  • We are ranking #1

    We are ranking #1

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

  • My favorites from the new release

    My favorites from the new release

    The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: ?? https://www.

  • Many senior level jobs inside....

    Many senior level jobs inside....

    Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…

  • People need more jobs and videos.

    People need more jobs and videos.

    From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…

  • BxD Saturday Letter #202425

    BxD Saturday Letter #202425

    Please take 2 mins to send your feedback. Link: https://forms.

社区洞察

其他会员也浏览了