登录查看更多内容

Introduction to Clustering and its Usecases

Tharani Vadde

Graduate Research Assistant | Expertise in Neural Networks, Data Engineering, Machine Learning, and Software Development | Ex-Red Hat, Informatica | Passionate about AI Innovation

发布日期: 2021年11月9日

+ 关注

Overview of Clustering

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Types of Clustering

Hard Clustering:?In hard clustering, each data point either belongs to a cluster completely or not. For example,?in the above example each customer is put into one group out of the 10 groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store

Types of clustering algorithms

Connectivity models:?As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its variants.
Centroid models:?These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.
Distribution models:?These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions.
Density Models:?These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assign the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.

A cluster refers to a collection of data points aggregated together because of certain similarities.

You’ll define a target number?k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.

Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids

It halts creating and optimizing clusters when either:

The centroids have stabilized — there is no change in their values because the clustering has been successful.
The defined number of iterations has been achieved.

Hierarchical Clustering

Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

领英推荐

Data Science – The Cornerstone of Certainty During…

Radiant Digital 1 年前

Unmasking Real-World Data Science: A Departure from…

Royal Cyber Asia 1 年前

Mastering Data Science [Concepts and Practices]

Nowasys LTD 9 个月前

The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be interpreted as:

At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest clusters are then merged till we have just one cluster at the top. The height in the dendrogram at which two clusters are merged represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different groups can be chosen by observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the dendrogram below covers maximum vertical distance AB.

Difference between K Means and Hierarchical clustering

Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n2).
In K Means clustering, since we start with random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram

?Applications of Clustering

Clustering has a large no. of applications spread across various domains. Some of the most popular applications of clustering are:

Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection

In this article, we have discussed what are the various ways of performing clustering. It find applications for unsupervised learning in a large no. of domains. You also saw how you can improve the accuracy of your supervised machine learning algorithm using clustering.

Although clustering is easy to implement, you need to take care of some important aspects like treating outliers in your data and making sure each cluster has sufficient population. These aspects of clustering are dealt in great detail in this article

要查看或添加评论，请登录

Tharani Vadde的更多文章

Data Analysis Using Apache Hadoop and Apache Spark

2022年4月24日

Data Analysis Using Apache Hadoop and Apache Spark

What is Big Data? “Big data is high-volume, high-velocity and high-variety information assets that demand…

1 条评论
Supply Chain Management Field Study

2021年2月14日

Supply Chain Management Field Study

INTRODUCTION: SUPPLY CHAIN MANAGEMENT (SCM) is integrating management practices and information technology to optimize…

3 条评论

Introduction to Clustering and its Usecases

Tharani Vadde

Graduate Research Assistant | Expertise in Neural Networks, Data Engineering, Machine Learning, and Software Development | Ex-Red Hat, Informatica | Passionate about AI Innovation

Overview of Clustering

Types of Clustering

Types of clustering algorithms

Hierarchical Clustering

领英推荐

Difference between K Means and Hierarchical clustering

?Applications of Clustering

Tharani Vadde的更多文章

社区洞察

其他会员也浏览了

Cracking the Code: How to Tell a Story with Your Numbers

The Art and Science of Data Analysis

From Raw Data to Actionable Insights: The Data Science Pipeline Unraveled

Use of Data Science for Better Business Decisions

Harnessing Big Data: Strategies for Effective Data Analytics

Data Alchemy: Turning Raw Data into Valuable Insights

Types of Clustering Methods

What is Data Normalization? And Why Is It Important To Do Before Data Visualization?

How can we differentiate Big Data Analytics from Statistical Predictive Modeling Techniques?

Unveiling the Power of Representation-Based Clustering: A Comprehensive Exploration

Overview of Clustering

Types of Clustering

Types of clustering algorithms

Hierarchical Clustering

领英推荐

Difference between K Means and Hierarchical clustering

?Applications of Clustering

Tharani Vadde的更多文章

Data Analysis Using Apache Hadoop and Apache Spark

Supply Chain Management Field Study

社区洞察

其他会员也浏览了

Cracking the Code: How to Tell a Story with Your Numbers

The Art and Science of Data Analysis

From Raw Data to Actionable Insights: The Data Science Pipeline Unraveled

Use of Data Science for Better Business Decisions

Harnessing Big Data: Strategies for Effective Data Analytics

Data Alchemy: Turning Raw Data into Valuable Insights

Types of Clustering Methods

What is Data Normalization? And Why Is It Important To Do Before Data Visualization?

How can we differentiate Big Data Analytics from Statistical Predictive Modeling Techniques?

Unveiling the Power of Representation-Based Clustering: A Comprehensive Exploration