K-means clustering
Kiruthika Subramani
Innovating AI for a Better Tomorrow | AI Engineer | Google Developer Expert | Author | IBM Dual Champion | 200+ Global AI Talks | Master's Student at MILA
Welcome to our third week of "Cup of Coffee With an ML Algorithm"! We're excited to have our new guest, K Means Clustering Algorithm, with us today.
With K Means Clustering Algorithm, we can group similar data points and uncover hidden insights and patterns in our data. So grab your favorite cup of coffe and let's dive in!
So for New Starters of ML?! Hope I can get the below meme as your mind voice
But don't worry,?with patience, consistent learning, and a lot of coffee ,one can become an ML expert. It may seem daunting at first, but the more you learn and practice, the more comfortable and confident you'll become with the algorithms, math, and data involved in ML. So keep at it, stay curious, and don't forget to take breaks and enjoy your favorite cup of caffeine along the way!
Alright, Let's raise our cups and take our first sip with our New Friend K Means algorithm!
K Means Clustering is a Unsupervised machine learning algorithm that groups similar data points together into clusters. It does this by first randomly assigning a number of cluster centers (known as centroids), and then iteratively moving those centroids closer to the center of their respective clusters until the clusters stop changing. The final result is a set of clusters.
For beginners, it is something like the one below
I understand that you may have the following questions:
Questions shows you are trying to learn
K means is Unsupervised
K Means is an Unsupervised learning Algorithm, which is used when there are no labeled data available, and the goal is to discover patterns and relationships within the data.
We are here mentioning unlabelled data, so how unlabelled dataset look, haven't you got this question?
For example, in a dataset of customer transactions, each data point might contain information such as the customer's name, age, gender, purchase amount, and purchase date. In an unsupervised setting, there would be no additional information such as whether the purchase was for a specific product or category, or whether the purchase was made on a promotion or sale.
The goal of an unsupervised algorithm on this type of dataset would be to identify any patterns, groups, or similarities in the data based on the features provided, without any prior knowledge of the specific categories or labels.
Unlabeled data does not have any corresponding labels or target variables that the algorithm is attempting to learn from, and the algorithm must discover the inherent structure and patterns in the data on its own.
On the Whole in K Means I am not going to make prediction ,Instead, it groups the data points into clusters based on their similarity.
Here your mind knock another question why wont we call it classification?
In classification, we have labeled data where each data point is associated with a specific label or class, and the goal is to train a model that can accurately predict the label for new, unseen data. The model is trained on a set of labeled data, and the accuracy of the predictions is evaluated based on how well it can correctly predict the labels of the test data.
In clustering, we have unlabeled data and the goal is to group similar data points together into clusters based on their similarity. The clustering algorithm does not attempt to predict a label or a class for each data point, but instead groups similar data points together. The quality of the clustering is evaluated based on how well the data points within each cluster are similar to each other, and how different they are from the data points in other clusters.
So they differ more
Classification Goal - To predict the class
Clustering Goal - To group the datapoints based on Similarities
Types of Clustering
Why K Means is based on Partition Clustering?
K-means is based on partition clustering because it partitions the data into k non-overlapping clusters, with each data point belonging to exactly one cluster.
What does the "K" term mean in K-means?
In K-means clustering, the "k" refers to the number of clusters that the algorithm will attempt to partition the data into. And k is a user-specified parameter.
Choosing the right value of k is an important step in the K-means clustering process.
How do we choose the right value of K?
The Elbow Method is a heuristic used to estimate the optimal number of clusters in K-means clustering. It involves plotting the sum of squared distances of each data point to its nearest cluster centroid against the number of clusters, K. The optimal value of K is at the "elbow" of the plot, where the decrease in the sum of squared distances starts to level off. Even many appoaches, it is highly used.
How do you choose the initial centroid values in K-means?
领英推荐
K-Means++ Initialization: K-Means++ is an improvement over the random initialization method that aims to choose more representative initial centroid values. The algorithm selects the first centroid randomly, and then for each subsequent centroid, it chooses a data point that is farthest from the existing centroids.
Don't get Confused
K - number of Clusters
Centroid - Represents the center of a cluster.
Steps to implement K Means
Step 1: Initialization
Choose K, and random data points to be the initial centroids.
Step 2: Assignment
Assign each data point to the nearest centroid to form K clusters. This is done by computing the Euclidean distance between each data point and each centroid, and assigning the data point to the centroid with the closest distance.
Step 3: Update
Compute the new centroids for each cluster by taking the mean of all the data points assigned to that cluster.
Step 4: Repeat
Repeat steps 2 and 3 until the centroids no longer change significantly or a maximum number of iterations is reached.
The K-means algorithm is an iterative process that aims to minimize the within-cluster sum of squares.
How do you measure the performance of K-means?
A higher value of BSS (Between-Cluster Sum of Squares) and a lower value of WSS (Within-Cluster Sum of Squares) are generally considered to indicate better cluster separation and therefore better performance.
Hope you got it!
A quick Summary of K Means Algorithm
Finally let's do implementation of K Means Algorithm
1. Import Necessary Packages
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
2. Create a Dataframe
# Create a dictionary of data
data = {'column1': [1, 2, 3, 4, 5, 6],
'column2': [2, 3, 4, 5, 6, 7],
'column3': [3, 4, 5, 6, 7, 8]}
# Create a data frame from the dictionary
df = pd.DataFrame(data)
3. Convert that Dataframe to CSV
# Save the data frame to a CSV file
df.to_csv('data.csv', index=False)
The above is not a mandatory step, I am performing for gaining additional knowledge.
4. Implementation of algorithm
# Select the columns to be used for clustering
X = df[['column1', 'column2', 'column3']]
# Perform K-means clustering with k=4 clusters
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
# Add the predicted cluster labels to the data frame
df['cluster'] = kmeans.predict(X)
4. Visualization
# Visualize the clustered data
fig, ax = plt.subplots(figsize=(10, 6))
scatter = ax.scatter(df['column1'], df['column2'], c=df['cluster'], s=50)
plt.xlabel('column1')
plt.ylabel('column2')
plt.title('K-means Clustering')
legend = ax.legend(*scatter.legend_elements(), loc="lower right", title="Clusters")
ax.add_artist(legend)
plt.show()
Hope you got it!
Link to see how it works : link
I have answered all the questionsE
Except ,Where we use K Means ?
Finally we made it!
Meeting you all next week, for a cup of coffee with another ML algorithm.
Cheers,
Kiruthika
Senior Research Data Specialist @ Timberline | Data Science, Analytics
1 年AI assisted remote viewing contextualizing forecast emerging markets meta materials underway
Android @1Pharmacy | KMM | Flutter | Linux | OSS
1 年Cup of koffee ??