登录查看更多内容

Day 10 - K-Means Clustering

Ime Eti-mfon

Data Scientist | Machine Learning Engineer | Data Program Community Ambassador @ ALX

发布日期: 2025年1月30日

+ 关注

Concept: Partitioning data into k clusters.
Implementation: Centroid initialization.
Evaluation: Inertia, silhouette score.

CONCEPT

K-Means is an unsupervised learning algorithm used for clustering tasks. The goal is to partition a dataset into (k) clusters, where each data point belongs to the cluster with the nearest mean. It is an iterative algorithm that aims to minimize the variance within each cluster.

The steps involved in K-Means clustering are:

Initialization: Choose (k) initial cluster centroids randomly.
Assignment: Assign each data point to the nearest cluster centroid.
Update: Recalculate the centroids as the mean of all points in each cluster.
Repeat: Repeat steps 2 and 3 until the centroids do not change significantly or a maximum number of iterations is reached.

IMPLEMENTATION

Suppose we have a dataset with points in 2D space and we want to cluster them into (k = 3) clusters:

# Import necessary libraries

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

import warnings                                        
warnings.simplefilter(action = 'ignore')

# Example data

np.random.seed(0)
X = np.vstack((np.random.normal(0, 1, (100, 2)),
    np.random.normal(5, 1, (100, 2)),
    np.random.normal(-5, 1, (100, 2))))

# Applying k-Means clustering

k = 3
kmeans = KMeans(n_clusters = k, random_state = 42)
y_kmeans = kmeans.fit_predict(X)

# Plotting the clusters

plt.figure(figsize = (8, 6))
sns.scatterplot(x = X[:, 1], hue = y_kmeans, palette = 'viridis', s = 50, edgecolor = 'k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'red', label = 'Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('k-Means Clustering')
plt.legend()
plt.show()

EXPLANATION OF THE CODE

Libraries: We import necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.
Data Preparation: We generate a synthetic dataset with three clusters using normal distributions.
K-Means Clustering: We create a K-Means object with (k = 3) clusters and fit it to the data. The fit_predict method assigns each data point to a cluster.
Plotting: We plot a scatterplot of the data points with colors indicating the assigned clusters and plot the centroids in red.

领英推荐

Future Trends in Data Science & Analytics | Data…

Pratibha Kumari J. 9 个月前

Unraveling Clustering Algorithms: From Evolution to…

Pratik Thorat 1 年前

Terms In Data Science (A-Z)

Sachin M 9 个月前

Choosing the Number of Clusters

Selecting the appropriate number of clusters ((k)) is crucial. Common methods to determine (k) include:

Elbow Method: Plot the Within-Cluster Sum of Squares (WCSS) against the number of clusters and look for an ‘elbow’ point where the rate of decrease sharply slows.
Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters. Higher silhouette scores indicate better-defined clusters.

ELBOW METHOD EXAMPLE

# Use Elbow method to find the optimal number of clusters

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, random_state = 42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    
plt.figure(figsize = (8, 6))
plt.plot(range(1, 11), wcss, marker = 'o')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()

EVALUATION METRICS

Within-Cluster Sum of Squares (WCSS): Measures the compactness of the clusters. Lower WCSS indicates more compact clusters.
Silhouette Score: Measures the separation between clusters. Values range from -1 to 1, with higher values indicating better-defined clusters.

APPLICATIONS

K-Means clustering is widely used in:

Market Segmentation: Grouping customers based on purchasing behaviour.
Image Compression: Reducing the number of colours in an image.
Anomaly Detection: Identifying outliers in a dataset.

K-Means is efficient and easy to implement but can be sensitive to the initial placement of centroids and the choice of (k). It works well for spherical clusters but may struggle with non-spherical or overlapping clusters.

Download the Jupyter Notebook file for Day 10 here.

Ime Eti-mfon

Data Scientist | Machine Learning Engineer | Data Program Community Ambassador @ ALX

1 个月

What do you think about the importance of K-Means Clustering?

要查看或添加评论，请登录

Ime Eti-mfon的更多文章

Fake News Detection Using Machine Learning and Deep Learning

2025年3月11日

Fake News Detection Using Machine Learning and Deep Learning

Combatting Misinformation using Tech Tools Introduction Misinformation has become a major issue with the rise of social…

1 条评论
30 Days, 30 Concepts: A Deep Dive into Machine Learning

2025年2月24日

30 Days, 30 Concepts: A Deep Dive into Machine Learning

Introduction Over the past month, I completed a 30-day Data Science learning challenge focused on Machine Learning…

3 条评论
Day 30 — Hyperparameter Optimization

2025年2月23日

Day 30 — Hyperparameter Optimization

Concept: Model tuning. Implementation: Grid search, random search.

3 条评论
Day 29 — Model Deployment and Monitoring

2025年2月22日

Day 29 — Model Deployment and Monitoring

CONCEPT Model Deployment and Monitoring involve the processes of making trained machine learning models accessible for…

1 条评论
Day 28 — Time Series Analysis and Forecasting

2025年2月21日

Day 28 — Time Series Analysis and Forecasting

CONCEPT Time Series Analysis involves analyzing data points collected over time to extract meaningful statistics and…

1 条评论
Day 27 — Natural Language Processing (NLP)

2025年2月20日

Day 27 — Natural Language Processing (NLP)

CONCEPT Natural Language Processing (NLP) is a field of artificial intelligence focused on enabling computers to…

1 条评论
Day 26?-?Ensemble?Learning

2025年2月20日

Day 26?-?Ensemble?Learning

CONCEPT Ensemble learning is a machine learning technique where multiple models (learners) are trained to solve the…

1 条评论
Day 25 — Transfer Learning

2025年2月19日

Day 25 — Transfer Learning

Concept: Pre-trained models. Implementation: Fine-tuning.

1 条评论
Day 24 - Generative Adversarial Networks (GANs)

2025年2月18日

Day 24 - Generative Adversarial Networks (GANs)

Concept: Generative models. Implementation: Generator, discriminator.

5 条评论
Day 23 — Autoencoders

2025年2月17日

Day 23 — Autoencoders

Concept: Data compression. Implementation: Encoder, decoder.

1 条评论

See all articles

Day 10 - K-Means Clustering

Ime Eti-mfon

Data Scientist | Machine Learning Engineer | Data Program Community Ambassador @ ALX

CONCEPT

IMPLEMENTATION

EXPLANATION OF THE CODE

领英推荐

Choosing the Number of Clusters

ELBOW METHOD EXAMPLE

EVALUATION METRICS

APPLICATIONS

Ime Eti-mfon的更多文章

社区洞察

其他会员也浏览了

K-means Clustering: Applications and Real-world Use Cases

Clustering - Machine Learning Algorithms

Principal Component Analysis (PCA)

Decision Tree

Data for Good: Clustering Countries using Unsupervised Machine Learning

Bayesian Thinking in Modern Data Science

K-Means Clustering, Centroid, Inertia, Convergence & more.

k-mean clustering and its real usecase in the security domain

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!

Understanding Clustering Algorithms: Key Techniques and Their Applications

CONCEPT

IMPLEMENTATION

EXPLANATION OF THE CODE

领英推荐

Choosing the Number of Clusters

ELBOW METHOD EXAMPLE

EVALUATION METRICS

APPLICATIONS

Ime Eti-mfon的更多文章

Fake News Detection Using Machine Learning and Deep Learning

30 Days, 30 Concepts: A Deep Dive into Machine Learning

Day 30 — Hyperparameter Optimization

Day 29 — Model Deployment and Monitoring

Day 28 — Time Series Analysis and Forecasting

Day 27 — Natural Language Processing (NLP)

Day 26?-?Ensemble?Learning

Day 25 — Transfer Learning

Day 24 - Generative Adversarial Networks (GANs)

Day 23 — Autoencoders

社区洞察

其他会员也浏览了

K-means Clustering: Applications and Real-world Use Cases

Clustering - Machine Learning Algorithms

Principal Component Analysis (PCA)

Decision Tree

Data for Good: Clustering Countries using Unsupervised Machine Learning

Bayesian Thinking in Modern Data Science

K-Means Clustering, Centroid, Inertia, Convergence & more.

k-mean clustering and its real usecase in the security domain

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!

Understanding Clustering Algorithms: Key Techniques and Their Applications