Understanding the basics of Data Clustering
Clustering
Clustering is the task of dividing the population or data points into a few groups such that data points in the same groups are like other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
In the terms of data analytics, when we plot the data points on a X-Y plane, the distance between two points can be computed easily using the formula
Here, ?X and ?Y are the differences of X & Y coordinates of the two points.
Now, clustering is a method to segregate the datapoints is a way that the distance between the datapoints in one group is lesser than the same from the other groups. The datapoints which are closer to each other are formed as a group whereas the datapoints which are relatively with a greater distance, are not included in the group.
Here, the distance between the datapoints inside one group is called inter cluster distance whereas the distance between the datapoints of two different groups is called intra cluster distance.
As we can see here that there is no fixed shape in which a group can be defined. The group of datapoints can be of any shape and size.
The clustering technique is mainly used with unsupervised learning method where we do not have a labelled data. We need to draw references from datasets consisting of input data without labelled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of provided datapoints.
Clustering Methods
1. Density-Based Methods: These methods consider the clusters as the dense region having some similarity and different from the lower dense region of the space. These methods have good accuracy and ability to merge two clusters.
2. Hierarchical Based Methods: The clusters formed in this method forms a tree type structure based on the hierarchy. New clusters are formed using the previously formed one. It is divided into two categories.
· Agglomerative (bottom up approach).
· Divisive (top down approach).
3. Partitioning Methods: These methods partition the objects into k clusters and each partition forms one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example K-means, CLARANS (Clustering Large Applications based upon randomized Search) etc. This method is also referred as Centroid based clustering as it organizes the data into non-hierarchical method described above.
4. Distribution based Methods: This clustering approach assumes data is composed of distributions, such as Gaussian distributions. In the example below, the distribution-based algorithm clusters data into three Gaussian distributions. As distance from the distribution's center increases, the probability that a point belongs to the distribution decreases. The bands show that decrease in probability.
5. Fuzzy Clustering: In this type of clustering, the data points can belong to more than one cluster. Each component present in the cluster has a membership coefficient that corresponds to a degree of being present in that cluster. Fuzzy Clustering method is also known as a soft method of clustering.
K-Means Clustering
There are multiple ways to cluster the data, but K-Means algorithm is the most used algorithm. Which tries to improve the inter group similarity while keeping the groups as far as possible from each other.
Basically K-Means runs on distance calculations, which again uses “Euclidean Distance” for this purpose. Euclidean distance calculates the distance between two given points using the following formula:
Euclidean Distance =
Above formula captures the distance in 2-Dimensional space but the same is applicable in multi-dimensional space as well with increase in number of terms getting added. “K” in K-Means represents the number of clusters in which we want our data to divide into. The basic restriction for K-Means algorithm is that your data should be continuous in nature. It won’t work if data is categorical in nature.
Algorithm
K-Means is an iterative process of clustering; which keeps iterating until it reaches the best solution or clusters in our problem space. Following pseudo example talks about the basic steps in K-Means clustering which is generally used to cluster our data
- Start with number of clusters we want e.g., 3 in this case. K-Means algorithm start the process with random centers in data, and then tries to attach the nearest points to these centers.
- Algorithm then moves the randomly allocated centers to the means of created groups.
- In the next step, data points are again reassigned to these newly created centers.
- Steps 2 & 3 are repeated until no member changes their association/groups.
Advantages
· Fast, robust and easier to understand.
· Relatively simple to implement.
· Scales to large data sets.
· Guarantees convergence.
· Can warm-start the positions of centroids.
· Easily adapts to new examples.
· Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Disadvantages
· The learning algorithm requires a prior specification of the number of cluster centers.
· The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be able to resolve that there are two clusters.
· The learning algorithm is not invariant to non-linear transformations i.e. with different representation of data we get different results (data represented in form of cartesian co-ordinates and polar co-ordinates will give different results).
· Euclidean distance measures can unequally weight underlying factors.
· The learning algorithm provides the local optima of the squared error function.
· Randomly choosing of the cluster center cannot lead us to the fruitful result.
· Applicable only when mean is defined i.e. fails for categorical data.
· Unable to handle noisy data and outliers.
· Algorithm fails for non-linear data set.
Example
Below is a simple python code to show how K-Means is used:
Now, we will import K-Means from sci-kit learnings and apply the same on this data: