K-Means Clustering Algorithm Overview

K-Means Clustering Algorithm Overview

K-means algorithm

K-means algorithm is a clustering technique used to partition as set of data points into K clusters based on their similarity. It is an unsupervised learning algorithm, which means it does not require any labelled data to train. K-means algorithm can be applied to various fields such as image segmentation, text clustering, market segmentation, and maybe more.

The algorithm works by iteratively assigning each point to the cluster whose center is closest to it, and then updating the center of each cluster based on the average of the points assigned to it. The algorithm converges when the assignment of points to clusters does not change anymore, or a maximum number of iterations is reached.

Here are the steps involved in the K-means algorithm:

1.????Initialize K centroids: The first step in the K-means algorithm is to randomly select K data points from the dataset as centroids. These centroids are the initial clusters.

2.????Assign data points to clusters: Next, each data point is assigned to the closest centroid based on the Euclidean distance between the data point and the centroid. The step creates K clusters.

3.????Recalculate centroids: Once all data points are assigned th the closest centroid, the next step is to recalculate the centroid for each cluster based on the mean of all the points assigned to that cluster. This step will move the centroid to the center of the cluster.

4.????Repeat step 2 and 3: Steps 2 and 3 are repeated until there is no change in the cluster assignment of the data points, or the maximum number of iterations is reached.

5.????Optimize K: To determine the optimal number of clusters, the Elbow method can be used, which involves calculating the sum of squared distances between each data point and its assigned centroid and plotting this value for different values of K. the optimal value of K is where the graph shows a significant decrease in the sum of squared distances.

Some key points to keep in mind while using K-means algorithms:

  • The algorithm is sensitive to the initial placement of centroids, so multiple runs with different initializations may be necessary to get the best results.
  • The algorithm may converge to local optima, meaning it may not find the globally optimal solution.
  • The algorithm assumes that the clusters are spherical and have equal variance, which may not be true for all datasets.
  • The algorithm can be computationally expensive for large datasets, as it requires calculating distances between all data points and centroids.

Overall, K-means algorithm is algorithm is a powerful tool for clustering data points and finding meaningful patterns in the data. It is widely used in many fields due to its simplicity and effectiveness.

Implementation of K-Means Algorithm in Python (Example)

Step 1: Import the necessary libraries

import pandas as pd
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler # 
from matplotlib import pyplot as plt
%matplotlib inlines        

Step 2: Import data from the csv file/Construct data from a dictionary

df = pd.read_csv('driver_data.csv') # Note: Make sure the path is correct here. 
df.head() # .head() gives us the first five rows of the dataset        
No alt text provided for this image
df.head() only gives us the first 5 rows of the dataset.

Step 3: We draw a scatter plot of the dataset using pyplot.

plt.scatter(df.mean_dist_day, df.mean_over_speed_perc, s=20)        
No alt text provided for this image
From the scatter plot, we can see the data is roughly divided into two clusters. But how do we verify it?

Step4: Use the Elbow Method to find the optimal number of clusters.

The elbow method is a popular technique for determining the optimal number of clusters in a clustering algorithm. It is based on the idea that as the number of clusters increase, the within-cluster sum of squares (WCSS) will decrease because the points in each cluster will be closer together. however, at some point, adding more clusters will not lead to a significant decrease in the WCSS, and the curve of the WCSS vs. number of clusters will start to flatten out. This point is often referred to as the "elbow" of the curve, and it represents the optimal number of clusters.

k_rng = range (1, 10)
sse = []
for k in k_rng:
? ? km = KMeans(n_clusters=k)
? ? km.fit(df[['mean_dist_day', 'mean_over_speed_perc']])
? ? sse.append(km.inertia_)
sse        
No alt text provided for this image
These are the SSEs (Sum of Squared Errors)

We plot the SSE using pyplot

plt.xlabel('X')
plt.ylabel('SSE')
plt.plot(k_rng, sse)        
No alt text provided for this image
Now, we can clearly see that the "elbow" happens when there are 2 clusters. (i.e., We should set K=2.)

Step 5: Set K = 2 and perform the prediction

km = KMeans(n_clusters=2) 
km        
No alt text provided for this image
Set the number of cluster to 2.
y_pred = km.fit_predict(df[['mean_dist_day', 'mean_over_speed_perc']])
y_pred        
No alt text provided for this image

We are adding a cluster column to the DataFrame.

df['cluster'] = y_pred
df        
No alt text provided for this image
A new column "cluster" is added.

If we want, we can also find the centroids and plot them on the diagram using .cluster_centers_

km.cluster_centers_        
No alt text provided for this image
We have two clusters and thus have two centroids.

Now, we plot the diagram.

df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
plt.scatter(df1['mean_dist_day'], df1['mean_over_speed_perc'], c='red', s=10, label='mdd1')
plt.scatter(df2['mean_dist_day'], df2['mean_over_speed_perc'], c='blue', s=10, label='mdd2')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='gold', marker='*', label='centroid')

plt.xlabel('mean_dist_day')
plt.ylabel('mean_over_speed_perc')
plt.legend()        
No alt text provided for this image

要查看或添加评论,请登录

Ben W.的更多文章

  • International Parity Conditions Overview

    International Parity Conditions Overview

    What are international parity conditions? International parity conditions show how expected inflation differentials…

  • CSV files overview (And how to use Python to read/write simple csv files)

    CSV files overview (And how to use Python to read/write simple csv files)

    A CSV (Comma-Separated Values) file is a plain test file that stores tabular data, where each line represents a row…

  • NumPy (Python Library) Overview + Some code

    NumPy (Python Library) Overview + Some code

    Introduction of NumPy NumPy (short for Numerical Python) is a powerful Python library for numerical computing. It…

  • Credit Valuation Adjustment (CVA) Overview

    Credit Valuation Adjustment (CVA) Overview

    Abstract: Credit Valuation Adjustment (CVA) is an essential concept in the world of finance, particularly in…

    1 条评论
  • Clustering overview

    Clustering overview

    1. What is clustering? Clustering is a technique in machine learning and data mining that involves grouping a set of…

社区洞察

其他会员也浏览了