K-Means Clustering- Unsupervised Machine Learning Algorithm
Khushboo Jain
SQL-savvy Data Analyst with Excel expertise at Eli Lilly, crunching numbers and unlocking insights to drive business growth.
It is an unsupervised learning algorithm used to solve clustering problem. This article will be about what is clustering, what K-means clustering is, how it works, how to choose an appropriate value of K, and the advantages and disadvantages of the K-Means Clustering Algorithm.
Let's get started.
What is clustering?
It is the process of dividing the data into groups such that points in the same groups are similar to each other than the points in the other group. In short, it means a grouping of things having similar properties.
For example, there is an inter-school quiz competition. Students from different classes take part in this. So now, how we are going to differentiate between these students?
We will group them according to their classes in which they are studying. Students studying in the same class will be in one group, and the students studying in another class will be in other groups. As a group of four students studying in class 11 in group 1. Similarly, a group of students studying in Class 12 will be a group 2.
So, the group-1 student has similar properties and group 2 students have similar properties between them. But group 1 and group 2 don't show any kind of similar properties between them.
This is what clustering is: a grouping of similar things into one.
Now let discuss the properties of clusters:
1. Data points belonging to the same cluster should be like each other.
For Eg: Almond, apricot, peaches belongs to the family Rosaceae, as they are similar to each other.
2. The data points from different clusters should be as different as possible.
Which of these cases do you think will give us the better clusters?
If you look at the case I: Points in the blue-grey cluster are completely different from the points in the dark blue-grey cluster. We have a better clustering of data points in this case.
Whereas, if you look at case 2, we can see that data points in the blue-grey and dark blue-grey clusters are quite similar to each other. Some points in the blue-grey cluster share similar properties as that of the dark blue-grey cluster.
Datapoint of the different cluster should be different from each other as much as possible to have more meaningful clusters.
So far, we have understood what clustering is and the different properties of clusters.
K-means Clustering Algorithm:
It groups the unlabeled data into different clusters.
The main goal of the K-Means Clustering algorithm is to minimize the sum of distances between the points and their respective cluster centroid. Also, to assign each data point to its closet k-centre. Points that are near to a particular K centre, create a cluster.
It is an iterative algorithm that divides the unlabeled datasets into k clusters so that each data set belongs only to one group with similar properties.
It is a centroid based algorithm, where each cluster is associated with a centroid.
Here K defines the number of pre-defined clusters that need to be created. If we have K=2, then there will be 2 clusters or 2 different categories, if K=3, there will be 3 clusters.
The value of K is predetermined in this algorithm.
This algorithm takes input as an unlabeled dataset, divides the dataset into K-number of clusters, and repeats the process until we find the best cluster.
Working of K-Means Algorithm:
Step 1: Choose K value, randomly or from the dataset, to form a cluster.
Step 2: Assign each data points to its nearest centroid, which will form clusters.
Step 3: Calculate the new centroid for clusters.
Step 4: Repeat step 2, which means reassign each data points to its new nearest centroid of each cluster.
Step 5: Repeat this process unless you get the same value of centroid or the same data points in the cluster.
Step 6: If the stopping condition is meet then stop the algorithm and the model is ready.
Let us understand this with an example:
Suppose we have two variables A and B. The XY Axis scatter plot of these two variables is given below:
1. Initially take the value of K= 2. It means we will try to group this data set into two different clusters.
2. Now choose some random K points to form the cluster. These points can be either from the data set or any other point.
We are taking the points that belong to the data set, consider the below image:
3. Now assign each data point to its closest k-point or centroid. We will do this by using the shortest distance between two points, Euclidean Formula.
4. Now compute the new centroids or K-point by calculating the mean of data points in a cluster.
5. Reassign each data point to the new centroid. For this, again calculate the shortest distance between two points.
We can see that two points of Blue-Gray and one point of Dark Blue-Gray are on the opposite side of their K-point. Assign them to the new centroids.
Now go to step 4 and repeat the process the finding the new centroid points.
6. As we got the new centroid points. Again reassign the data points.
After multiple iterations, we’ll see there are no dissimilar data points, which mean our model is formed:
Stopping Condition for K-Means clustering:
There are three conditions out of which if anyone is met then stop the K means clustering algorithm.
1. Centroids of the newly formed cluster remains the same as in the previous step.
2. Data points in the cluster does not change.
3. The maximum number of iterations is reached.
We can stop the algorithm when the new centroid does not change. After multiple iterations, if the value of Centroid for all clusters remains the same, we can imply that the algorithm does not learn a new pattern to stop the training.
Another way to stop the training is that if all the data points remain in the same cluster after multiple iterations.
Finally, we can stop the training if the maximum number of iterations is reached. E.g., if K=10, and till 10 iterations we don't get the above two condition then after the 10th iteration we can stop the algorithm.
Choosing the best value of K:
The performance of the K means clustering algorithm depends upon the highly efficient cluster that it forms. Since the value of the K is pre-determined, it becomes difficult to choose the number of clusters.
To find the optimal value of k we use the elbow method:-
To find the optimal value of clusters, the elbow method follows the below steps:
1. It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
2. For every value of K, calculates the WCSS value.
3. Plots a curve between calculated WCSS values and the number of clusters K.
4. The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of K.
If we see the graph more carefully, after the elbow point, the value of WSS changes very slowly so the elbow point must be considered to get the final value of the number of clusters.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The graph for the elbow method looks like the below image:
Advantage of K-Means Algorithm:
1. K-Means Algorithm produces a tight cluster than hierarchical clustering.
2. Works well for Large-scale data.
3. Easy to implement as compared to other complex clustering technique.
Disadvantages of K-Means Algorithm:
1. Its efficiency depends on the value of K, which is difficult to predict.
2. If the class has a complex geometric shape, k-means do a poor job in clustering the data.
3. K-means algorithm doesn't let the data points that are far away from each other share the same cluster even though they belong to the same cluster.
Conclusion:
In this article, we understood the most famous clustering algorithms: K-Means. We have discussed it from scratch and looked at its step-by-step implementation. We have also seen how to choose an optimal value of K.
If you have any doubts or feedback, feel free to share them in the comments section below.