Introduction to Clustering and its Usecases

Introduction to Clustering and its Usecases

Overview of Clustering

No alt text provided for this image

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Types of Clustering

No alt text provided for this image

  • Hard Clustering:?In hard clustering, each data point either belongs to a cluster completely or not. For example,?in the above example each customer is put into one group out of the 10 groups.
  • Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store


Types of clustering algorithms

No alt text provided for this image

  • Connectivity models:?As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its variants.
  • Centroid models:?These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.
  • Distribution models:?These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions.
  • Density Models:?These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assign the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.

No alt text provided for this image

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.

A cluster refers to a collection of data points aggregated together because of certain similarities.

You’ll define a target number?k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.

Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids

It halts creating and optimizing clusters when either:

  • The centroids have stabilized — there is no change in their values because the clustering has been successful.
  • The defined number of iterations has been achieved.

Hierarchical Clustering

Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be interpreted as:

No alt text provided for this image

At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest clusters are then merged till we have just one cluster at the top. The height in the dendrogram at which two clusters are merged represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different groups can be chosen by observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the dendrogram below covers maximum vertical distance AB.

No alt text provided for this image

Difference between K Means and Hierarchical clustering

  • Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n2).
  • In K Means clustering, since we start with random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.
  • K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D).
  • K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram

?Applications of Clustering

Clustering has a large no. of applications spread across various domains. Some of the most popular applications of clustering are:

  • Recommendation engines
  • Market segmentation
  • Social network analysis
  • Search result grouping
  • Medical imaging
  • Image segmentation
  • Anomaly detection

In this article, we have discussed what are the various ways of performing clustering. It find applications for unsupervised learning in a large no. of domains. You also saw how you can improve the accuracy of your supervised machine learning algorithm using clustering.

Although clustering is easy to implement, you need to take care of some important aspects like treating outliers in your data and making sure each cluster has sufficient population. These aspects of clustering are dealt in great detail in this article









要查看或添加评论,请登录

Tharani Vadde的更多文章

  • Data Analysis Using Apache Hadoop and Apache Spark

    Data Analysis Using Apache Hadoop and Apache Spark

    What is Big Data? “Big data is high-volume, high-velocity and high-variety information assets that demand…

    1 条评论
  • Supply Chain Management Field Study

    Supply Chain Management Field Study

    INTRODUCTION: SUPPLY CHAIN MANAGEMENT (SCM) is integrating management practices and information technology to optimize…

    3 条评论

社区洞察

其他会员也浏览了