Day 10 - K-Means Clustering
Ime Eti-mfon
Data Scientist | Machine Learning Engineer | Data Program Community Ambassador @ ALX
CONCEPT
K-Means is an unsupervised learning algorithm used for clustering tasks. The goal is to partition a dataset into (k) clusters, where each data point belongs to the cluster with the nearest mean. It is an iterative algorithm that aims to minimize the variance within each cluster.
The steps involved in K-Means clustering are:
IMPLEMENTATION
Suppose we have a dataset with points in 2D space and we want to cluster them into (k = 3) clusters:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action = 'ignore')
# Example data
np.random.seed(0)
X = np.vstack((np.random.normal(0, 1, (100, 2)),
np.random.normal(5, 1, (100, 2)),
np.random.normal(-5, 1, (100, 2))))
# Applying k-Means clustering
k = 3
kmeans = KMeans(n_clusters = k, random_state = 42)
y_kmeans = kmeans.fit_predict(X)
# Plotting the clusters
plt.figure(figsize = (8, 6))
sns.scatterplot(x = X[:, 1], hue = y_kmeans, palette = 'viridis', s = 50, edgecolor = 'k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'red', label = 'Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('k-Means Clustering')
plt.legend()
plt.show()
EXPLANATION OF THE CODE
领英推荐
Choosing the Number of Clusters
Selecting the appropriate number of clusters ((k)) is crucial. Common methods to determine (k) include:
ELBOW METHOD EXAMPLE
# Use Elbow method to find the optimal number of clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.figure(figsize = (8, 6))
plt.plot(range(1, 11), wcss, marker = 'o')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()
EVALUATION METRICS
APPLICATIONS
K-Means clustering is widely used in:
K-Means is efficient and easy to implement but can be sensitive to the initial placement of centroids and the choice of (k). It works well for spherical clusters but may struggle with non-spherical or overlapping clusters.
Data Scientist | Machine Learning Engineer | Data Program Community Ambassador @ ALX
1 个月What do you think about the importance of K-Means Clustering?