登录查看更多内容

K-Means Clustering Algorithm Overview

Ben W.

Murex Consultant at Synechron

发布日期: 2023年5月1日

K-means algorithm

K-means algorithm is a clustering technique used to partition as set of data points into K clusters based on their similarity. It is an unsupervised learning algorithm, which means it does not require any labelled data to train. K-means algorithm can be applied to various fields such as image segmentation, text clustering, market segmentation, and maybe more.

The algorithm works by iteratively assigning each point to the cluster whose center is closest to it, and then updating the center of each cluster based on the average of the points assigned to it. The algorithm converges when the assignment of points to clusters does not change anymore, or a maximum number of iterations is reached.

Here are the steps involved in the K-means algorithm:

1.????Initialize K centroids: The first step in the K-means algorithm is to randomly select K data points from the dataset as centroids. These centroids are the initial clusters.

2.????Assign data points to clusters: Next, each data point is assigned to the closest centroid based on the Euclidean distance between the data point and the centroid. The step creates K clusters.

3.????Recalculate centroids: Once all data points are assigned th the closest centroid, the next step is to recalculate the centroid for each cluster based on the mean of all the points assigned to that cluster. This step will move the centroid to the center of the cluster.

4.????Repeat step 2 and 3: Steps 2 and 3 are repeated until there is no change in the cluster assignment of the data points, or the maximum number of iterations is reached.

5.????Optimize K: To determine the optimal number of clusters, the Elbow method can be used, which involves calculating the sum of squared distances between each data point and its assigned centroid and plotting this value for different values of K. the optimal value of K is where the graph shows a significant decrease in the sum of squared distances.

Some key points to keep in mind while using K-means algorithms:

The algorithm is sensitive to the initial placement of centroids, so multiple runs with different initializations may be necessary to get the best results.
The algorithm may converge to local optima, meaning it may not find the globally optimal solution.
The algorithm assumes that the clusters are spherical and have equal variance, which may not be true for all datasets.
The algorithm can be computationally expensive for large datasets, as it requires calculating distances between all data points and centroids.

Overall, K-means algorithm is algorithm is a powerful tool for clustering data points and finding meaningful patterns in the data. It is widely used in many fields due to its simplicity and effectiveness.

Implementation of K-Means Algorithm in Python (Example)

Step 1: Import the necessary libraries

import pandas as pd
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler # 
from matplotlib import pyplot as plt
%matplotlib inlines

Step 2: Import data from the csv file/Construct data from a dictionary

df = pd.read_csv('driver_data.csv') # Note: Make sure the path is correct here. 
df.head() # .head() gives us the first five rows of the dataset

No alt text provided for this image — df.head() only gives us the first 5 rows of the dataset.

Step 3: We draw a scatter plot of the dataset using pyplot.

plt.scatter(df.mean_dist_day, df.mean_over_speed_perc, s=20)

领英推荐

K-means Clustering: Applications and Real-world Use…

Vrata Tech Solutions (VTS) 11 个月前

K-nearest neighbor Classification(KNN)

Bluechip Technologies Asia 9 个月前

Clustering Algorithms

Bluechip Technologies Asia 10 个月前

Step4: Use the Elbow Method to find the optimal number of clusters.

The elbow method is a popular technique for determining the optimal number of clusters in a clustering algorithm. It is based on the idea that as the number of clusters increase, the within-cluster sum of squares (WCSS) will decrease because the points in each cluster will be closer together. however, at some point, adding more clusters will not lead to a significant decrease in the WCSS, and the curve of the WCSS vs. number of clusters will start to flatten out. This point is often referred to as the "elbow" of the curve, and it represents the optimal number of clusters.

k_rng = range (1, 10)
sse = []
for k in k_rng:
? ? km = KMeans(n_clusters=k)
? ? km.fit(df[['mean_dist_day', 'mean_over_speed_perc']])
? ? sse.append(km.inertia_)
sse

We plot the SSE using pyplot

plt.xlabel('X')
plt.ylabel('SSE')
plt.plot(k_rng, sse)

Step 5: Set K = 2 and perform the prediction

km = KMeans(n_clusters=2) 
km

y_pred = km.fit_predict(df[['mean_dist_day', 'mean_over_speed_perc']])
y_pred

We are adding a cluster column to the DataFrame.

df['cluster'] = y_pred
df

If we want, we can also find the centroids and plot them on the diagram using .cluster_centers_

km.cluster_centers_

Now, we plot the diagram.

df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
plt.scatter(df1['mean_dist_day'], df1['mean_over_speed_perc'], c='red', s=10, label='mdd1')
plt.scatter(df2['mean_dist_day'], df2['mean_over_speed_perc'], c='blue', s=10, label='mdd2')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='gold', marker='*', label='centroid')

plt.xlabel('mean_dist_day')
plt.ylabel('mean_over_speed_perc')
plt.legend()

要查看或添加评论，请登录

Ben W.的更多文章

International Parity Conditions Overview

2024年8月6日

International Parity Conditions Overview

What are international parity conditions? International parity conditions show how expected inflation differentials…
CSV files overview (And how to use Python to read/write simple csv files)

2023年6月21日

CSV files overview (And how to use Python to read/write simple csv files)

A CSV (Comma-Separated Values) file is a plain test file that stores tabular data, where each line represents a row…
NumPy (Python Library) Overview + Some code

2023年5月10日

NumPy (Python Library) Overview + Some code

Introduction of NumPy NumPy (short for Numerical Python) is a powerful Python library for numerical computing. It…
Credit Valuation Adjustment (CVA) Overview

2023年5月4日

Credit Valuation Adjustment (CVA) Overview

Abstract: Credit Valuation Adjustment (CVA) is an essential concept in the world of finance, particularly in…

1 条评论
Clustering overview

2023年4月27日

Clustering overview

1. What is clustering? Clustering is a technique in machine learning and data mining that involves grouping a set of…

See all articles

K-Means Clustering Algorithm Overview

Ben W.

Murex Consultant at Synechron

K-means algorithm

Implementation of K-Means Algorithm in Python (Example)

领英推荐

Ben W.的更多文章

社区洞察

其他会员也浏览了

Future Trends in Data Science & Analytics | Data Science vs. Analytics vs. Business Intelligence: A Detailed Comparison

Data clustering

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!

Unveiling the Power of Representation-Based Clustering: A Comprehensive Exploration

Clustering - Machine Learning Algorithms

Understanding the Concept of the Five Numbers in Machine Learning and Statistics

Mastering Statistical Foundations: Central Limit Theorem and Confidence Intervals Explained

Bayesian Thinking in Modern Data Science

Mastering Graph Data Science: Techniques and Applications

Group Think: A Deep Dive into the World of Clustering Algorithms

K-means algorithm

Implementation of K-Means Algorithm in Python (Example)

领英推荐

Ben W.的更多文章

International Parity Conditions Overview

CSV files overview (And how to use Python to read/write simple csv files)

NumPy (Python Library) Overview + Some code

Credit Valuation Adjustment (CVA) Overview

Clustering overview

社区洞察

其他会员也浏览了

Future Trends in Data Science & Analytics | Data Science vs. Analytics vs. Business Intelligence: A Detailed Comparison

Data clustering

Decoding Classification Algorithms: A Fun Guide to Finding Your Data's Perfect Match!

Unveiling the Power of Representation-Based Clustering: A Comprehensive Exploration

Clustering - Machine Learning Algorithms

Understanding the Concept of the Five Numbers in Machine Learning and Statistics

Mastering Statistical Foundations: Central Limit Theorem and Confidence Intervals Explained

Bayesian Thinking in Modern Data Science

Mastering Graph Data Science: Techniques and Applications

Group Think: A Deep Dive into the World of Clustering Algorithms