Understanding the basics of Data Clustering

No alt text provided for this image

Clustering

Clustering is the task of dividing the population or data points into a few groups such that data points in the same groups are like other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

In the terms of data analytics, when we plot the data points on a X-Y plane, the distance between two points can be computed easily using the formula

Here, ?X and ?Y are the differences of X & Y coordinates of the two points.

Now, clustering is a method to segregate the datapoints is a way that the distance between the datapoints in one group is lesser than the same from the other groups. The datapoints which are closer to each other are formed as a group whereas the datapoints which are relatively with a greater distance, are not included in the group.

Here, the distance between the datapoints inside one group is called inter cluster distance whereas the distance between the datapoints of two different groups is called intra cluster distance.

No alt text provided for this image

As we can see here that there is no fixed shape in which a group can be defined. The group of datapoints can be of any shape and size.

The clustering technique is mainly used with unsupervised learning method where we do not have a labelled data. We need to draw references from datasets consisting of input data without labelled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of provided datapoints.

Clustering Methods

1.    Density-Based Methods: These methods consider the clusters as the dense region having some similarity and different from the lower dense region of the space. These methods have good accuracy and ability to merge two clusters.

No alt text provided for this image

2.    Hierarchical Based Methods: The clusters formed in this method forms a tree type structure based on the hierarchy. New clusters are formed using the previously formed one. It is divided into two categories.

·        Agglomerative (bottom up approach).

·        Divisive (top down approach).

No alt text provided for this image

3.    Partitioning Methods: These methods partition the objects into k clusters and each partition forms one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example K-means, CLARANS (Clustering Large Applications based upon randomized Search) etc. This method is also referred as Centroid based clustering as it organizes the data into non-hierarchical method described above.

No alt text provided for this image

4.    Distribution based Methods: This clustering approach assumes data is composed of distributions, such as Gaussian distributions. In the example below, the distribution-based algorithm clusters data into three Gaussian distributions. As distance from the distribution's center increases, the probability that a point belongs to the distribution decreases. The bands show that decrease in probability.

No alt text provided for this image

5.    Fuzzy Clustering: In this type of clustering, the data points can belong to more than one cluster. Each component present in the cluster has a membership coefficient that corresponds to a degree of being present in that cluster. Fuzzy Clustering method is also known as a soft method of clustering.

K-Means Clustering

There are multiple ways to cluster the data, but K-Means algorithm is the most used algorithm. Which tries to improve the inter group similarity while keeping the groups as far as possible from each other.

Basically K-Means runs on distance calculations, which again uses “Euclidean Distance” for this purpose. Euclidean distance calculates the distance between two given points using the following formula:

Euclidean Distance =  

Above formula captures the distance in 2-Dimensional space but the same is applicable in multi-dimensional space as well with increase in number of terms getting added. “K” in K-Means represents the number of clusters in which we want our data to divide into. The basic restriction for K-Means algorithm is that your data should be continuous in nature. It won’t work if data is categorical in nature.

Algorithm

K-Means is an iterative process of clustering; which keeps iterating until it reaches the best solution or clusters in our problem space. Following pseudo example talks about the basic steps in K-Means clustering which is generally used to cluster our data

  • Start with number of clusters we want e.g., 3 in this case. K-Means algorithm start the process with random centers in data, and then tries to attach the nearest points to these centers.
No alt text provided for this image
  • Algorithm then moves the randomly allocated centers to the means of created groups.
No alt text provided for this image
  • In the next step, data points are again reassigned to these newly created centers.
No alt text provided for this image
  • Steps 2 & 3 are repeated until no member changes their association/groups.

Advantages

·        Fast, robust and easier to understand.

·        Relatively simple to implement.

·        Scales to large data sets.

·        Guarantees convergence.

·        Can warm-start the positions of centroids.

·        Easily adapts to new examples.

·        Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

Disadvantages

·        The learning algorithm requires a prior specification of the number of cluster centers.

·        The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be able to resolve that there are two clusters.

·        The learning algorithm is not invariant to non-linear transformations i.e. with different representation of data we get different results (data represented in form of cartesian co-ordinates and polar co-ordinates will give different results).

·        Euclidean distance measures can unequally weight underlying factors.

·        The learning algorithm provides the local optima of the squared error function. 

·        Randomly choosing of the cluster center cannot lead us to the fruitful result.

·        Applicable only when mean is defined i.e. fails for categorical data.

·        Unable to handle noisy data and outliers.

·        Algorithm fails for non-linear data set.

Example

Below is a simple python code to show how K-Means is used:

No alt text provided for this image

Now, we will import K-Means from sci-kit learnings and apply the same on this data:

No alt text provided for this image



要查看或添加评论,请登录

Gautam Kumar的更多文章

  • Treating outliers on a dataset

    Treating outliers on a dataset

    An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In…

  • What is Cloud Computing

    What is Cloud Computing

    The most simplistic definition of cloud computing is the delivery of on-demand IT services over the internet. The…

    1 条评论
  • An Introduction to Lambda Function

    An Introduction to Lambda Function

    Functions are basically piece of codes which execute only when we invoke them. For any programming language, functions…

  • Understanding Support Vector Machine

    Understanding Support Vector Machine

    Support Vector Machine: An Introduction I have talked about Linear regression and Classification on my prior articles…

  • Classification in Data Science

    Classification in Data Science

    What is Classification? Although classification can be performed on both structured and unstructured data, it is mainly…

  • Multicollinearity - understanding the relationship between variables

    Multicollinearity - understanding the relationship between variables

    Multicollinearity Multicollinearity or simply collinearity is defined by the phenomenon in which two or more…

  • Dimension Reduction - Principal Component Analysis (aka PCA)

    Dimension Reduction - Principal Component Analysis (aka PCA)

    Being in an era of data flowing from every here and there, we often come across scenarios that we gather way too much…

    2 条评论
  • Understanding the ROC & AUC

    Understanding the ROC & AUC

    Introduction In any type of machine learning, we need to calculate the accuracy of the model for performance…

  • Linear Regression

    Linear Regression

    When it comes to supervised machine learning, there are two types of learning algorithms: Regression – this basically…

社区洞察

其他会员也浏览了