Hierarchical clustering: The simplest clustering algorithm
Suravi Mahanta
Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | AI/ML Engineer | 4X Microsoft Certified | 3X Databricks Certified | Data Architecture
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters.
Hierarchical clustering is one of the popular and easy to understand clustering technique. This clustering technique is divided into two types:
- Agglomerative Hierarchical clustering:
In this technique, initially each data point is considered as an individual cluster. At each iteration, the similar clusters merge with other clusters until one cluster or K clusters are formed.
The basic algorithm of Agglomerative is straight forward.
- Compute the proximity matrix
- Let each data point be a cluster
- Repeat: Merge the two closest clusters and update the proximity matrix
- Until only a single cluster remains
Let's try to understand it with a example:
To understand better let’s see a pictorial representation of the Agglomerative Hierarchical clustering Technique. Lets say we have six data points {A,B,C,D,E,F}.
- Step- 1: In the initial step, we calculate the proximity of individual points and consider all the six data points as individual clusters as shown in the image below.
- Step- 2: In step two, similar clusters are merged together and formed as a single cluster. Let’s consider B,C, and D,E are similar clusters that are merged in step two. Now, we’re left with four clusters which are A, BC, DE, F.
- Step- 3: We again calculate the proximity of new clusters and merge the similar clusters to form new clusters A, BC, DEF.
- Step- 4: Calculate the proximity of the new clusters. The clusters DEF and BC are similar and merged together to form a new cluster. We’re now left with two clusters A, BCDEF.
- Step- 5: Finally, all the clusters are merged together and form a single cluster.
The Hierarchical clustering Technique can be visualized using a Dendrogram.
2. Divisive Hierarchical clustering Technique:
Since the Divisive Hierarchical clustering Technique is not much used in the real world, I’ll give a brief of the Divisive Hierarchical clustering Technique.
In simple words, we can say that the Divisive Hierarchical clustering is exactly the opposite of the Agglomerative Hierarchical clustering.
In Divisive Hierarchical clustering, we consider all the data points as a single cluster and in each iteration, we separate the data points from the cluster which are not similar. Each data point which is separated is considered as an individual cluster. In the end, we’ll be left with n clusters.As we’re dividing the single clusters into n clusters, it is named as Divisive Hierarchical clustering.
We divide the cluster into smaller chunks using the Measures of distance or the linkage criteria. So, let's understand what it is
Measures of distance (similarity)
Calculating the similarity between two clusters is important to merge or divide the clusters. There are certain approaches which are used to calculate the similarity between two clusters:
The pros and cons of the cluster are given below:
Space and Time Complexity of Hierarchical clustering Technique:
Space complexity:
The space required for the Hierarchical clustering Technique is very high when the number of data points are high as we need to store the similarity matrix in the RAM. The space complexity is the order of the square of n.
Space complexity = O(n2) where n is the number of data points.
Time complexity:
Since we’ve to perform n iterations and in each iteration, we need to update the similarity matrix and restore the matrix, the time complexity is also very high. The time complexity is the order of cube of n.
Time complexity = O(n3) where n is the number of data points.
Limitations of Hierarchical clustering Technique:
- There is no mathematical objective for Hierarchical clustering.
- All the approaches to calculate the similarity between clusters has its own disadvantages.
- High space and time complexity for Hierarchical clustering. Hence this clustering algorithm cannot be used when we have huge data.
Hierarchical clustering with Python:
You can download the dataset using https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv this link. The data is hosted on the UCI Machine Learning repository. The aim of this problem is to segment the clients of a wholesale distributor based on their annual spending on diverse product categories, like milk, grocery, region, etc. I come to know about this dataset in Analytic Vidhya "A begineer guide to Hierarchical clustering" artivle.
Let’s explore the data first and then apply Hierarchical Clustering to segment the clients.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('Wholesale customers data.csv')
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled)
Sklearn has a library called sklearn.cluster which contain all the popular clustering concepts. For more information you can refer this link https://scikit-learn.org/stable/modules/clustering.html
I hope you like this article.
Through this article, I would also like to thank each and everyone who read, liked, clapped, commented on my articles. This is the sole motivation which encourages me to write articles.
Keep reading and I’ll keep writing.
References:
https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/
Senior Business Systems Analyst / Data Analyst | 9+ yrs exp. | Banking Fintech domain | Agile Scrum | Data Analysis, API Design Architecture (SA) & Integration, Digital Transformation (Web/Mobile apps), AML projects.
1 年Business Data Analyst
Student at Wayamba University of Sri Lanka
3 年mu
Expert CISCO/Brocade SAN Switches, DELL EMC Isilon/SAN, IBM SAN/Netapp | Storage Lead at Kyndryl | Ex-IBM | GCP & Azure & AWS Certified Solution Architect | 2X Microsoft Certified Cloud Practitioner
5 年keep on writing awesome contents like this ????
Data Analytics - Tableau || DOMO || Power BI || SQL || Python || Machine Learning
5 年Nice mam....thanks