登录查看更多内容

K-means clustering

Kiruthika Subramani

Innovating AI for a Better Tomorrow | AI Engineer | Google Developer Expert | Author | IBM Dual Champion | 200+ Global AI Talks | Master's Student at MILA

发布日期: 2023年4月22日

Welcome to our third week of "Cup of Coffee With an ML Algorithm"! We're excited to have our new guest, K Means Clustering Algorithm, with us today.

With K Means Clustering Algorithm, we can group similar data points and uncover hidden insights and patterns in our data. So grab your favorite cup of coffe and let's dive in!

So for New Starters of ML?! Hope I can get the below meme as your mind voice

But don't worry,?with patience, consistent learning, and a lot of coffee ,one can become an ML expert. It may seem daunting at first, but the more you learn and practice, the more comfortable and confident you'll become with the algorithms, math, and data involved in ML. So keep at it, stay curious, and don't forget to take breaks and enjoy your favorite cup of caffeine along the way!

Alright, Let's raise our cups and take our first sip with our New Friend K Means algorithm!

K Means Clustering is a Unsupervised machine learning algorithm that groups similar data points together into clusters. It does this by first randomly assigning a number of cluster centers (known as centroids), and then iteratively moving those centroids closer to the center of their respective clusters until the clusters stop changing. The final result is a set of clusters.

For beginners, it is something like the one below

I understand that you may have the following questions:

What is unsupervised machine learning?
What is clustering? Both clustering and classification involve grouping, so what is the difference between them?
What does the "k" term mean in K-means?
Where is K-means used?
Does K-means have specific steps?
How do you determine the optimal number of clusters for K-means?
How do you choose the initial centroid values in K-means?
How do you measure the performance of K-means?
What are some alternatives to K-means?

Questions shows you are trying to learn

K means is Unsupervised

K Means is an Unsupervised learning Algorithm, which is used when there are no labeled data available, and the goal is to discover patterns and relationships within the data.

We are here mentioning unlabelled data, so how unlabelled dataset look, haven't you got this question?

For example, in a dataset of customer transactions, each data point might contain information such as the customer's name, age, gender, purchase amount, and purchase date. In an unsupervised setting, there would be no additional information such as whether the purchase was for a specific product or category, or whether the purchase was made on a promotion or sale.

The goal of an unsupervised algorithm on this type of dataset would be to identify any patterns, groups, or similarities in the data based on the features provided, without any prior knowledge of the specific categories or labels.

Unlabeled data does not have any corresponding labels or target variables that the algorithm is attempting to learn from, and the algorithm must discover the inherent structure and patterns in the data on its own.

On the Whole in K Means I am not going to make prediction ,Instead, it groups the data points into clusters based on their similarity.

Here your mind knock another question why wont we call it classification?

In classification, we have labeled data where each data point is associated with a specific label or class, and the goal is to train a model that can accurately predict the label for new, unseen data. The model is trained on a set of labeled data, and the accuracy of the predictions is evaluated based on how well it can correctly predict the labels of the test data.

In clustering, we have unlabeled data and the goal is to group similar data points together into clusters based on their similarity. The clustering algorithm does not attempt to predict a label or a class for each data point, but instead groups similar data points together. The quality of the clustering is evaluated based on how well the data points within each cluster are similar to each other, and how different they are from the data points in other clusters.

So they differ more

Classification Goal - To predict the class

Clustering Goal - To group the datapoints based on Similarities

Types of Clustering

Why K Means is based on Partition Clustering?

K-means is based on partition clustering because it partitions the data into k non-overlapping clusters, with each data point belonging to exactly one cluster.

What does the "K" term mean in K-means?

In K-means clustering, the "k" refers to the number of clusters that the algorithm will attempt to partition the data into. And k is a user-specified parameter.

Choosing the right value of k is an important step in the K-means clustering process.

How do we choose the right value of K?

The Elbow Method is a heuristic used to estimate the optimal number of clusters in K-means clustering. It involves plotting the sum of squared distances of each data point to its nearest cluster centroid against the number of clusters, K. The optimal value of K is at the "elbow" of the plot, where the decrease in the sum of squared distances starts to level off. Even many appoaches, it is highly used.

How do you choose the initial centroid values in K-means?

Random Initialization: The simplest method is to randomly select K data points from the dataset as the initial centroid values.

领英推荐

Types of CLustering Algorithm

Shashank Sharma 2 年前

Data Scientist’s Dilemma: The Cold Start Problem – Ten…

Kirk Borne, Ph.D. 6 年前

Introduction to Simple Linear Regression in Machine…

Learnbay 2 年前

K-Means++ Initialization: K-Means++ is an improvement over the random initialization method that aims to choose more representative initial centroid values. The algorithm selects the first centroid randomly, and then for each subsequent centroid, it chooses a data point that is farthest from the existing centroids.

Don't get Confused

K - number of Clusters

Centroid - Represents the center of a cluster.

Steps to implement K Means

Step 1: Initialization

Choose K, and random data points to be the initial centroids.

Step 2: Assignment

Assign each data point to the nearest centroid to form K clusters. This is done by computing the Euclidean distance between each data point and each centroid, and assigning the data point to the centroid with the closest distance.

Step 3: Update

Compute the new centroids for each cluster by taking the mean of all the data points assigned to that cluster.

Step 4: Repeat

Repeat steps 2 and 3 until the centroids no longer change significantly or a maximum number of iterations is reached.

The K-means algorithm is an iterative process that aims to minimize the within-cluster sum of squares.

How do you measure the performance of K-means?

Within-cluster sum of squares (WSS)
Between-cluster sum of squares (BSS)

A higher value of BSS (Between-Cluster Sum of Squares) and a lower value of WSS (Within-Cluster Sum of Squares) are generally considered to indicate better cluster separation and therefore better performance.

Hope you got it!

A quick Summary of K Means Algorithm

Unsupervised Algorithm, which performs Clustering
It is partition based Clustering approach
K - no. of Clusters ( User can define, no. of clusters. We can use elbow method)
We assign random Centroid datapoints to K number of Clusters
To initialize Centroids - We can use Random or K-Means++ Initialization.
To evaluate Performance - a higher value of BSS and a lower value of WSS

Finally let's do implementation of K Means Algorithm

1. Import Necessary Packages

import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

2. Create a Dataframe

# Create a dictionary of data
data = {'column1': [1, 2, 3, 4, 5, 6],
        'column2': [2, 3, 4, 5, 6, 7],
        'column3': [3, 4, 5, 6, 7, 8]}

# Create a data frame from the dictionary
df = pd.DataFrame(data)

3. Convert that Dataframe to CSV

# Save the data frame to a CSV file
df.to_csv('data.csv', index=False)

The above is not a mandatory step, I am performing for gaining additional knowledge.

4. Implementation of algorithm

# Select the columns to be used for clustering
X = df[['column1', 'column2', 'column3']]

# Perform K-means clustering with k=4 clusters
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

# Add the predicted cluster labels to the data frame
df['cluster'] = kmeans.predict(X)

4. Visualization

# Visualize the clustered data
fig, ax = plt.subplots(figsize=(10, 6))
scatter = ax.scatter(df['column1'], df['column2'], c=df['cluster'], s=50)
plt.xlabel('column1')
plt.ylabel('column2')
plt.title('K-means Clustering')
legend = ax.legend(*scatter.legend_elements(), loc="lower right", title="Clusters")
ax.add_artist(legend)
plt.show()

Output

Hope you got it!

Link to see how it works : link

I have answered all the questionsE

Except ,Where we use K Means ?

Finally we made it!

Meeting you all next week, for a cup of coffee with another ML algorithm.

Cheers,

Kiruthika

Tim Gatlin

Senior Research Data Specialist @ Timberline | Data Science, Analytics

1 年

AI assisted remote viewing contextualizing forecast emerging markets meta materials underway

5 次回应

Jayesh Seth

Android @1Pharmacy | KMM | Flutter | Linux | OSS

1 年

Cup of koffee ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Kiruthika Subramani的更多文章

RAG System with Video

2024年9月13日

RAG System with Video

Hello Everyone,It’s Friday, and guess who’s back? Hope you all had a fantastic week! This week, let’s dive into…

2 条评论
Building a RAG System using Gemini API

2024年9月6日

Building a RAG System using Gemini API

Welcome to the first episode of AI Weekly with Krithi! In this series, we’ll explore various AI topics, tools, and…

3 条评论
Evaluation methods for LLMs

2024年5月22日

Evaluation methods for LLMs

Hey all, Welcome back for the sixth Episode of Cup of Coffee Series with LLMs. Again we have Mr.
Different Fine-tuning Methods for LLMs

2024年5月10日

Different Fine-tuning Methods for LLMs

Hey all, Welcome back for the fifth Episode of Cup of Coffee Series with LLMs. Again we have Mr.

1 条评论
Pretraining and Fine Tuning LLMs

2024年5月5日

Pretraining and Fine Tuning LLMs

Hey all, Welcome back for the fourth Episode of Cup of Coffee Series with LLMs. Again we have Mr.
Architecting Large Language Models

2024年5月2日

Architecting Large Language Models

Hey all, Welcome back for the third Episode of Cup of Coffee Series with LLMs. Again we have Mr.
LLMs #2

2024年4月29日

LLMs #2

Hey all, Welcome back for the second Episode of Cup of Coffee Series with LLMs. Again we have Mr.

2 条评论
LLM's Introduction

2024年4月26日

LLM's Introduction

Hello Everyone! Kiruthika here, after a long. I am back with the cup of coffee series with LLMs.

2 条评论
Transformers

2023年12月25日

Transformers

Hello, folks! Kiruthika is back after a long break. Yep, let's get started with our Cup of Coffee Series! Today, we…

4 条评论
Generative Adversarial Network (GAN)

2023年10月24日

Generative Adversarial Network (GAN)

??????Pour yourself a virtual cup of coffee with GANs after a long. Finally, we are stepping into 19 th week of this…

1 条评论

See all articles

K-means clustering

Kiruthika Subramani

Innovating AI for a Better Tomorrow | AI Engineer | Google Developer Expert | Author | IBM Dual Champion | 200+ Global AI Talks | Master's Student at MILA

K means is Unsupervised

On the Whole in K Means I am not going to make prediction ,Instead, it groups the data points into clusters based on their similarity.

Types of Clustering

Why K Means is based on Partition Clustering?

What does the "K" term mean in K-means?

How do we choose the right value of K?

How do you choose the initial centroid values in K-means?

领英推荐

Steps to implement K Means

How do you measure the performance of K-means?

A quick Summary of K Means Algorithm

1. Import Necessary Packages

2. Create a Dataframe

3. Convert that Dataframe to CSV

4. Implementation of algorithm

4. Visualization

Kiruthika Subramani的更多文章

社区洞察

其他会员也浏览了

Random Forest Classification Using LOOCV

Understanding Support Vector Machines (SVM) and Decision Trees in Machine Learning

Relation between statistical machine learning and big data

The Hidden Truth About Data Science (That No One Talks About!)

The Connection Between Machine Learning and Statistics

Hypothesis Testing in Machine Learning

Machine Learning Algorithms

Get your machine learning programs right every time - most comprehensive guide ever ( with code)!

Data Scaling and Training space in Machine Learning. A Statistical perspective.

From Data Chaos to Clarity: The Magic of Machine Learning Algorithms

K means is Unsupervised

On the Whole in K Means I am not going to make prediction ,Instead, it groups the data points into clusters based on their similarity.

Types of Clustering

Why K Means is based on Partition Clustering?

What does the "K" term mean in K-means?

How do we choose the right value of K?

How do you choose the initial centroid values in K-means?

领英推荐

Steps to implement K Means

How do you measure the performance of K-means?

A quick Summary of K Means Algorithm

1. Import Necessary Packages

2. Create a Dataframe

3. Convert that Dataframe to CSV

4. Implementation of algorithm

4. Visualization

Kiruthika Subramani的更多文章

RAG System with Video

Building a RAG System using Gemini API

Evaluation methods for LLMs

Different Fine-tuning Methods for LLMs

Pretraining and Fine Tuning LLMs

Architecting Large Language Models

LLMs #2

LLM's Introduction

Transformers

Generative Adversarial Network (GAN)

社区洞察

其他会员也浏览了

Random Forest Classification Using LOOCV

Understanding Support Vector Machines (SVM) and Decision Trees in Machine Learning

Relation between statistical machine learning and big data

The Hidden Truth About Data Science (That No One Talks About!)

The Connection Between Machine Learning and Statistics

Hypothesis Testing in Machine Learning

Machine Learning Algorithms

Get your machine learning programs right every time - most comprehensive guide ever ( with code)!

Data Scaling and Training space in Machine Learning. A Statistical perspective.

From Data Chaos to Clarity: The Magic of Machine Learning Algorithms