K-means clustering

K-means clustering

Welcome to our third week of "Cup of Coffee With an ML Algorithm"! We're excited to have our new guest, K Means Clustering Algorithm, with us today.

With K Means Clustering Algorithm, we can group similar data points and uncover hidden insights and patterns in our data. So grab your favorite cup of coffe and let's dive in!

So for New Starters of ML?! Hope I can get the below meme as your mind voice

No alt text provided for this image

But don't worry,?with patience, consistent learning, and a lot of coffee ,one can become an ML expert. It may seem daunting at first, but the more you learn and practice, the more comfortable and confident you'll become with the algorithms, math, and data involved in ML. So keep at it, stay curious, and don't forget to take breaks and enjoy your favorite cup of caffeine along the way!

No alt text provided for this image

Alright, Let's raise our cups and take our first sip with our New Friend K Means algorithm!

No alt text provided for this image

K Means Clustering is a Unsupervised machine learning algorithm that groups similar data points together into clusters. It does this by first randomly assigning a number of cluster centers (known as centroids), and then iteratively moving those centroids closer to the center of their respective clusters until the clusters stop changing. The final result is a set of clusters.

No alt text provided for this image

For beginners, it is something like the one below

No alt text provided for this image

I understand that you may have the following questions:

No alt text provided for this image


  • What is unsupervised machine learning?
  • What is clustering? Both clustering and classification involve grouping, so what is the difference between them?
  • What does the "k" term mean in K-means?
  • Where is K-means used?
  • Does K-means have specific steps?
  • How do you determine the optimal number of clusters for K-means?
  • How do you choose the initial centroid values in K-means?
  • How do you measure the performance of K-means?
  • What are some alternatives to K-means?

Questions shows you are trying to learn

No alt text provided for this image

K means is Unsupervised

K Means is an Unsupervised learning Algorithm, which is used when there are no labeled data available, and the goal is to discover patterns and relationships within the data.

We are here mentioning unlabelled data, so how unlabelled dataset look, haven't you got this question?

No alt text provided for this image

For example, in a dataset of customer transactions, each data point might contain information such as the customer's name, age, gender, purchase amount, and purchase date. In an unsupervised setting, there would be no additional information such as whether the purchase was for a specific product or category, or whether the purchase was made on a promotion or sale.

No alt text provided for this image


The goal of an unsupervised algorithm on this type of dataset would be to identify any patterns, groups, or similarities in the data based on the features provided, without any prior knowledge of the specific categories or labels.

Unlabeled data does not have any corresponding labels or target variables that the algorithm is attempting to learn from, and the algorithm must discover the inherent structure and patterns in the data on its own.

No alt text provided for this image

On the Whole in K Means I am not going to make prediction ,Instead, it groups the data points into clusters based on their similarity.

Here your mind knock another question why wont we call it classification?

No alt text provided for this image


In classification, we have labeled data where each data point is associated with a specific label or class, and the goal is to train a model that can accurately predict the label for new, unseen data. The model is trained on a set of labeled data, and the accuracy of the predictions is evaluated based on how well it can correctly predict the labels of the test data.

In clustering, we have unlabeled data and the goal is to group similar data points together into clusters based on their similarity. The clustering algorithm does not attempt to predict a label or a class for each data point, but instead groups similar data points together. The quality of the clustering is evaluated based on how well the data points within each cluster are similar to each other, and how different they are from the data points in other clusters.

No alt text provided for this image


So they differ more

Classification Goal - To predict the class

Clustering Goal - To group the datapoints based on Similarities

No alt text provided for this image

Types of Clustering

No alt text provided for this image

Why K Means is based on Partition Clustering?

K-means is based on partition clustering because it partitions the data into k non-overlapping clusters, with each data point belonging to exactly one cluster.

No alt text provided for this image

What does the "K" term mean in K-means?

In K-means clustering, the "k" refers to the number of clusters that the algorithm will attempt to partition the data into. And k is a user-specified parameter.

Choosing the right value of k is an important step in the K-means clustering process.

No alt text provided for this image

How do we choose the right value of K?

The Elbow Method is a heuristic used to estimate the optimal number of clusters in K-means clustering. It involves plotting the sum of squared distances of each data point to its nearest cluster centroid against the number of clusters, K. The optimal value of K is at the "elbow" of the plot, where the decrease in the sum of squared distances starts to level off. Even many appoaches, it is highly used.

No alt text provided for this image

How do you choose the initial centroid values in K-means?

  1. Random Initialization: The simplest method is to randomly select K data points from the dataset as the initial centroid values.

No alt text provided for this image


K-Means++ Initialization: K-Means++ is an improvement over the random initialization method that aims to choose more representative initial centroid values. The algorithm selects the first centroid randomly, and then for each subsequent centroid, it chooses a data point that is farthest from the existing centroids.

No alt text provided for this image

Don't get Confused

No alt text provided for this image

K - number of Clusters

Centroid - Represents the center of a cluster.

Steps to implement K Means

Step 1: Initialization

Choose K, and random data points to be the initial centroids.

Step 2: Assignment

Assign each data point to the nearest centroid to form K clusters. This is done by computing the Euclidean distance between each data point and each centroid, and assigning the data point to the centroid with the closest distance.

No alt text provided for this image

Step 3: Update

Compute the new centroids for each cluster by taking the mean of all the data points assigned to that cluster.

Step 4: Repeat

Repeat steps 2 and 3 until the centroids no longer change significantly or a maximum number of iterations is reached.

The K-means algorithm is an iterative process that aims to minimize the within-cluster sum of squares.

How do you measure the performance of K-means?

  1. Within-cluster sum of squares (WSS)
  2. Between-cluster sum of squares (BSS)

No alt text provided for this image

A higher value of BSS (Between-Cluster Sum of Squares) and a lower value of WSS (Within-Cluster Sum of Squares) are generally considered to indicate better cluster separation and therefore better performance.

Hope you got it!

A quick Summary of K Means Algorithm

No alt text provided for this image


  1. Unsupervised Algorithm, which performs Clustering
  2. It is partition based Clustering approach
  3. K - no. of Clusters ( User can define, no. of clusters. We can use elbow method)
  4. We assign random Centroid datapoints to K number of Clusters
  5. To initialize Centroids - We can use Random or K-Means++ Initialization.
  6. To evaluate Performance - a higher value of BSS and a lower value of WSS

No alt text provided for this image

Finally let's do implementation of K Means Algorithm

1. Import Necessary Packages

import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt        

2. Create a Dataframe

# Create a dictionary of data
data = {'column1': [1, 2, 3, 4, 5, 6],
        'column2': [2, 3, 4, 5, 6, 7],
        'column3': [3, 4, 5, 6, 7, 8]}

# Create a data frame from the dictionary
df = pd.DataFrame(data)        

3. Convert that Dataframe to CSV

# Save the data frame to a CSV file
df.to_csv('data.csv', index=False)        

The above is not a mandatory step, I am performing for gaining additional knowledge.

4. Implementation of algorithm

# Select the columns to be used for clustering
X = df[['column1', 'column2', 'column3']]

# Perform K-means clustering with k=4 clusters
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

# Add the predicted cluster labels to the data frame
df['cluster'] = kmeans.predict(X)        

4. Visualization

# Visualize the clustered data
fig, ax = plt.subplots(figsize=(10, 6))
scatter = ax.scatter(df['column1'], df['column2'], c=df['cluster'], s=50)
plt.xlabel('column1')
plt.ylabel('column2')
plt.title('K-means Clustering')
legend = ax.legend(*scatter.legend_elements(), loc="lower right", title="Clusters")
ax.add_artist(legend)
plt.show()        

Output

Hope you got it!

No alt text provided for this image

Link to see how it works : link

I have answered all the questionsE

No alt text provided for this image

Except ,Where we use K Means ?

No alt text provided for this image


Finally we made it!

No alt text provided for this image

Meeting you all next week, for a cup of coffee with another ML algorithm.

Cheers,

Kiruthika

Tim Gatlin

Senior Research Data Specialist @ Timberline | Data Science, Analytics

1 年

AI assisted remote viewing contextualizing forecast emerging markets meta materials underway

Jayesh Seth

Android @1Pharmacy | KMM | Flutter | Linux | OSS

1 年

Cup of koffee ??

要查看或添加评论,请登录

Kiruthika Subramani的更多文章

  • RAG System with Video

    RAG System with Video

    Hello Everyone,It’s Friday, and guess who’s back? Hope you all had a fantastic week! This week, let’s dive into…

    2 条评论
  • Building a RAG System using Gemini API

    Building a RAG System using Gemini API

    Welcome to the first episode of AI Weekly with Krithi! In this series, we’ll explore various AI topics, tools, and…

    3 条评论
  • Evaluation methods for LLMs

    Evaluation methods for LLMs

    Hey all, Welcome back for the sixth Episode of Cup of Coffee Series with LLMs. Again we have Mr.

  • Different Fine-tuning Methods for LLMs

    Different Fine-tuning Methods for LLMs

    Hey all, Welcome back for the fifth Episode of Cup of Coffee Series with LLMs. Again we have Mr.

    1 条评论
  • Pretraining and Fine Tuning LLMs

    Pretraining and Fine Tuning LLMs

    Hey all, Welcome back for the fourth Episode of Cup of Coffee Series with LLMs. Again we have Mr.

  • Architecting Large Language Models

    Architecting Large Language Models

    Hey all, Welcome back for the third Episode of Cup of Coffee Series with LLMs. Again we have Mr.

  • LLMs #2

    LLMs #2

    Hey all, Welcome back for the second Episode of Cup of Coffee Series with LLMs. Again we have Mr.

    2 条评论
  • LLM's Introduction

    LLM's Introduction

    Hello Everyone! Kiruthika here, after a long. I am back with the cup of coffee series with LLMs.

    2 条评论
  • Transformers

    Transformers

    Hello, folks! Kiruthika is back after a long break. Yep, let's get started with our Cup of Coffee Series! Today, we…

    4 条评论
  • Generative Adversarial Network (GAN)

    Generative Adversarial Network (GAN)

    ??????Pour yourself a virtual cup of coffee with GANs after a long. Finally, we are stepping into 19 th week of this…

    1 条评论

社区洞察

其他会员也浏览了