Day 10 - K-Means Clustering

Day 10 - K-Means Clustering


  • Concept: Partitioning data into k clusters.
  • Implementation: Centroid initialization.
  • Evaluation: Inertia, silhouette score.


CONCEPT

K-Means is an unsupervised learning algorithm used for clustering tasks. The goal is to partition a dataset into (k) clusters, where each data point belongs to the cluster with the nearest mean. It is an iterative algorithm that aims to minimize the variance within each cluster.

The steps involved in K-Means clustering are:

  1. Initialization: Choose (k) initial cluster centroids randomly.
  2. Assignment: Assign each data point to the nearest cluster centroid.
  3. Update: Recalculate the centroids as the mean of all points in each cluster.
  4. Repeat: Repeat steps 2 and 3 until the centroids do not change significantly or a maximum number of iterations is reached.


IMPLEMENTATION

Suppose we have a dataset with points in 2D space and we want to cluster them into (k = 3) clusters:

# Import necessary libraries

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

import warnings                                        
warnings.simplefilter(action = 'ignore')          
# Example data

np.random.seed(0)
X = np.vstack((np.random.normal(0, 1, (100, 2)),
    np.random.normal(5, 1, (100, 2)),
    np.random.normal(-5, 1, (100, 2))))        
# Applying k-Means clustering

k = 3
kmeans = KMeans(n_clusters = k, random_state = 42)
y_kmeans = kmeans.fit_predict(X)        
# Plotting the clusters

plt.figure(figsize = (8, 6))
sns.scatterplot(x = X[:, 1], hue = y_kmeans, palette = 'viridis', s = 50, edgecolor = 'k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'red', label = 'Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('k-Means Clustering')
plt.legend()
plt.show()        

EXPLANATION OF THE CODE

  1. Libraries: We import necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.
  2. Data Preparation: We generate a synthetic dataset with three clusters using normal distributions.
  3. K-Means Clustering: We create a K-Means object with (k = 3) clusters and fit it to the data. The fit_predict method assigns each data point to a cluster.
  4. Plotting: We plot a scatterplot of the data points with colors indicating the assigned clusters and plot the centroids in red.


Choosing the Number of Clusters

Selecting the appropriate number of clusters ((k)) is crucial. Common methods to determine (k) include:

  • Elbow Method: Plot the Within-Cluster Sum of Squares (WCSS) against the number of clusters and look for an ‘elbow’ point where the rate of decrease sharply slows.
  • Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters. Higher silhouette scores indicate better-defined clusters.


ELBOW METHOD EXAMPLE

# Use Elbow method to find the optimal number of clusters

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, random_state = 42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    
plt.figure(figsize = (8, 6))
plt.plot(range(1, 11), wcss, marker = 'o')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()        

EVALUATION METRICS

  • Within-Cluster Sum of Squares (WCSS): Measures the compactness of the clusters. Lower WCSS indicates more compact clusters.
  • Silhouette Score: Measures the separation between clusters. Values range from -1 to 1, with higher values indicating better-defined clusters.


APPLICATIONS

K-Means clustering is widely used in:

  • Market Segmentation: Grouping customers based on purchasing behaviour.
  • Image Compression: Reducing the number of colours in an image.
  • Anomaly Detection: Identifying outliers in a dataset.

K-Means is efficient and easy to implement but can be sensitive to the initial placement of centroids and the choice of (k). It works well for spherical clusters but may struggle with non-spherical or overlapping clusters.


Download the Jupyter Notebook file for Day 10 here.

Ime Eti-mfon

Data Scientist | Machine Learning Engineer | Data Program Community Ambassador @ ALX

1 个月

What do you think about the importance of K-Means Clustering?

回复

要查看或添加评论,请登录

Ime Eti-mfon的更多文章

社区洞察

其他会员也浏览了