K-means algorithm
K-means algorithm is a clustering technique used to partition as set of data points into K clusters based on their similarity. It is an unsupervised learning algorithm, which means it does not require any labelled data to train. K-means algorithm can be applied to various fields such as image segmentation, text clustering, market segmentation, and maybe more.
The algorithm works by iteratively assigning each point to the cluster whose center is closest to it, and then updating the center of each cluster based on the average of the points assigned to it. The algorithm converges when the assignment of points to clusters does not change anymore, or a maximum number of iterations is reached.
Here are the steps involved in the K-means algorithm:
1.????Initialize K centroids: The first step in the K-means algorithm is to randomly select K data points from the dataset as centroids. These centroids are the initial clusters.
2.????Assign data points to clusters: Next, each data point is assigned to the closest centroid based on the Euclidean distance between the data point and the centroid. The step creates K clusters.
3.????Recalculate centroids: Once all data points are assigned th the closest centroid, the next step is to recalculate the centroid for each cluster based on the mean of all the points assigned to that cluster. This step will move the centroid to the center of the cluster.
4.????Repeat step 2 and 3: Steps 2 and 3 are repeated until there is no change in the cluster assignment of the data points, or the maximum number of iterations is reached.
5.????Optimize K: To determine the optimal number of clusters, the Elbow method can be used, which involves calculating the sum of squared distances between each data point and its assigned centroid and plotting this value for different values of K. the optimal value of K is where the graph shows a significant decrease in the sum of squared distances.
Some key points to keep in mind while using K-means algorithms:
Overall, K-means algorithm is a powerful tool for clustering data points and finding meaningful patterns in the data. It is widely used in many fields due to its simplicity and effectiveness.
Implementation of K-Means Algorithm in Python (Example)
Step 1: Import the necessary libraries
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler #
from matplotlib import pyplot as plt
%matplotlib inlines
Step 2: Import data from the csv file/Construct data from a dictionary
df = pd.read_csv('driver_data.csv') # Note: Make sure the path is correct here.
df.head() # .head() gives us the first five rows of the dataset
Step 3: We draw a scatter plot of the dataset using pyplot.
plt.scatter(df.mean_dist_day, df.mean_over_speed_perc, s=20)
Step4: Use the Elbow Method to find the optimal number of clusters.
The elbow method is a popular technique for determining the optimal number of clusters in a clustering algorithm. It is based on the idea that as the number of clusters increase, the within-cluster sum of squares (WCSS) will decrease because the points in each cluster will be closer together. however, at some point, adding more clusters will not lead to a significant decrease in the WCSS, and the curve of the WCSS vs. number of clusters will start to flatten out. This point is often referred to as the "elbow" of the curve, and it represents the optimal number of clusters.
k_rng = range (1, 10)
sse = []
for k in k_rng:
? ? km = KMeans(n_clusters=k)
? ? km.fit(df[['mean_dist_day', 'mean_over_speed_perc']])
? ? sse.append(km.inertia_)
We plot the SSE using pyplot
plt.plot(k_rng, sse)
Step 5: Set K = 2 and perform the prediction
km = KMeans(n_clusters=2)
y_pred = km.fit_predict(df[['mean_dist_day', 'mean_over_speed_perc']])
We are adding a cluster column to the DataFrame.
df['cluster'] = y_pred
If we want, we can also find the centroids and plot them on the diagram using .cluster_centers_
Now, we plot the diagram.
df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
plt.scatter(df1['mean_dist_day'], df1['mean_over_speed_perc'], c='red', s=10, label='mdd1')
plt.scatter(df2['mean_dist_day'], df2['mean_over_speed_perc'], c='blue', s=10, label='mdd2')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='gold', marker='*', label='centroid')