登录查看更多内容

Hierarchical clustering: The simplest clustering algorithm

Suravi Mahanta

Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | AI/ML Engineer | 4X Microsoft Certified | 3X Databricks Certified | Data Architecture

发布日期: 2019年8月21日

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters.

Hierarchical clustering is one of the popular and easy to understand clustering technique. This clustering technique is divided into two types:

Agglomerative Hierarchical clustering:

In this technique, initially each data point is considered as an individual cluster. At each iteration, the similar clusters merge with other clusters until one cluster or K clusters are formed.

The basic algorithm of Agglomerative is straight forward.

Compute the proximity matrix
Let each data point be a cluster
Repeat: Merge the two closest clusters and update the proximity matrix
Until only a single cluster remains

Let's try to understand it with a example:

To understand better let’s see a pictorial representation of the Agglomerative Hierarchical clustering Technique. Lets say we have six data points {A,B,C,D,E,F}.

Step- 1: In the initial step, we calculate the proximity of individual points and consider all the six data points as individual clusters as shown in the image below.
Step- 2: In step two, similar clusters are merged together and formed as a single cluster. Let’s consider B,C, and D,E are similar clusters that are merged in step two. Now, we’re left with four clusters which are A, BC, DE, F.
Step- 3: We again calculate the proximity of new clusters and merge the similar clusters to form new clusters A, BC, DEF.
Step- 4: Calculate the proximity of the new clusters. The clusters DEF and BC are similar and merged together to form a new cluster. We’re now left with two clusters A, BCDEF.
Step- 5: Finally, all the clusters are merged together and form a single cluster.

The Hierarchical clustering Technique can be visualized using a Dendrogram.

2. Divisive Hierarchical clustering Technique:

Since the Divisive Hierarchical clustering Technique is not much used in the real world, I’ll give a brief of the Divisive Hierarchical clustering Technique.

In simple words, we can say that the Divisive Hierarchical clustering is exactly the opposite of the Agglomerative Hierarchical clustering.

In Divisive Hierarchical clustering, we consider all the data points as a single cluster and in each iteration, we separate the data points from the cluster which are not similar. Each data point which is separated is considered as an individual cluster. In the end, we’ll be left with n clusters.As we’re dividing the single clusters into n clusters, it is named as Divisive Hierarchical clustering.

We divide the cluster into smaller chunks using the Measures of distance or the linkage criteria. So, let's understand what it is

Measures of distance (similarity)

Calculating the similarity between two clusters is important to merge or divide the clusters. There are certain approaches which are used to calculate the similarity between two clusters:

The pros and cons of the cluster are given below:

Space and Time Complexity of Hierarchical clustering Technique:

Space complexity:

The space required for the Hierarchical clustering Technique is very high when the number of data points are high as we need to store the similarity matrix in the RAM. The space complexity is the order of the square of n.

Space complexity = O(n2) where n is the number of data points.

Time complexity:

Since we’ve to perform n iterations and in each iteration, we need to update the similarity matrix and restore the matrix, the time complexity is also very high. The time complexity is the order of cube of n.

Time complexity = O(n3) where n is the number of data points.

Limitations of Hierarchical clustering Technique:

There is no mathematical objective for Hierarchical clustering.
All the approaches to calculate the similarity between clusters has its own disadvantages.
High space and time complexity for Hierarchical clustering. Hence this clustering algorithm cannot be used when we have huge data.

Hierarchical clustering with Python:

You can download the dataset using https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv this link. The data is hosted on the UCI Machine Learning repository. The aim of this problem is to segment the clients of a wholesale distributor based on their annual spending on diverse product categories, like milk, grocery, region, etc. I come to know about this dataset in Analytic Vidhya "A begineer guide to Hierarchical clustering" artivle.

Let’s explore the data first and then apply Hierarchical Clustering to segment the clients.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('Wholesale customers data.csv')
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')  
	cluster.fit_predict(data_scaled)

Sklearn has a library called sklearn.cluster which contain all the popular clustering concepts. For more information you can refer this link https://scikit-learn.org/stable/modules/clustering.html

I hope you like this article.

Through this article, I would also like to thank each and everyone who read, liked, clapped, commented on my articles. This is the sole motivation which encourages me to write articles.

Keep reading and I’ll keep writing.

References:

https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/

https://scikit-learn.org/stable/modules/clustering.html

Priti Handa

Senior Business Systems Analyst / Data Analyst | 9+ yrs exp. | Banking Fintech domain | Agile Scrum | Data Analysis, API Design Architecture (SA) & Integration, Digital Transformation (Web/Mobile apps), AML projects.

1 年

Business Data Analyst

Mekhala Uduweriya

Student at Wayamba University of Sri Lanka

3 年

Dibya Ranjan Rath

Expert CISCO/Brocade SAN Switches, DELL EMC Isilon/SAN, IBM SAN/Netapp | Storage Lead at Kyndryl | Ex-IBM | GCP & Azure & AWS Certified Solution Architect | 2X Microsoft Certified Cloud Practitioner

5 年

keep on writing awesome contents like this ????

1 次回应

Atul Kumar

Data Analytics - Tableau || DOMO || Power BI || SQL || Python || Machine Learning

5 年

Nice mam....thanks

1 次回应

查看更多评论

要查看或添加评论，请登录

Suravi Mahanta的更多文章

Why we need Analytics in Suppl chain management

2020年6月20日

Why we need Analytics in Suppl chain management

Supply chain is the backbone of any product based company. To have a succesful busniess, you need a sucessfull supply…
Prophet Forecasting

2020年6月7日

Prophet Forecasting

Forecasting is one of the most commonly used machine learning algorithms in any business. It’s become a necessity to…
Important statistics for Data science

2020年2月13日

Important statistics for Data science

In this blog I'll try to cover most of the statistical measures which are used in almost all data science projects. So…

5 条评论
Text Mining

2019年7月7日

Text Mining

Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that…

2 条评论
Let's evaluate classification model with ROC and PR curves.

2019年5月31日

Let's evaluate classification model with ROC and PR curves.

Model evaluation is one of the most important part while building model. And to evaluate the model we use different…

4 条评论
Let's find the value of K for K means in 2 minutes and by using two methods.

2019年5月10日

Let's find the value of K for K means in 2 minutes and by using two methods.

Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, such as K…

1 条评论
What is the best algorithm for classification problem?

2019年4月23日

What is the best algorithm for classification problem?

Classification is one of the data mining tasks, applied in many area especially in retail, banking sector…
Journey from DataBricks to Azure DataBricks

2019年4月4日

Journey from DataBricks to Azure DataBricks

DataBricks is an organization and big data processing platform designed by the creators of Apache Spark. It was founded…
What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

2019年3月24日

What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

Intelligence has been defined in many ways, including: the capacity for logic, understanding, self-awareness, learning,…

See all articles

Hierarchical clustering: The simplest clustering algorithm

Suravi Mahanta

Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | AI/ML Engineer | 4X Microsoft Certified | 3X Databricks Certified | Data Architecture

2. Divisive Hierarchical clustering Technique:

Measures of distance (similarity)

Space and Time Complexity of Hierarchical clustering Technique:

Space complexity:

Time complexity:

Limitations of Hierarchical clustering Technique:

Hierarchical clustering with Python:

References:

Suravi Mahanta的更多文章

社区洞察

其他会员也浏览了

Ghosts In The Machine: Uncovering Five Hidden Patterns In Your Data

Hierarchical Clustering: A Comprehensive Guide to Understanding and Applying This Powerful Data Analysis Technique

Hierarchical Clustering: Financial Market Analysis

Resampling Techniques: Unlocking the Hidden Potential of Your Data

K-Means Clustering - Use Cases

Introduction to K-Means Clustering

Hierarchial Clustering

Clustering

Data Science.

Data Science

2. Divisive Hierarchical clustering Technique:

Measures of distance (similarity)

Space and Time Complexity of Hierarchical clustering Technique:

Space complexity:

Time complexity:

Limitations of Hierarchical clustering Technique:

Hierarchical clustering with Python:

References:

Suravi Mahanta的更多文章

Why we need Analytics in Suppl chain management

Prophet Forecasting

Important statistics for Data science

Text Mining

Let's evaluate classification model with ROC and PR curves.

Let's find the value of K for K means in 2 minutes and by using two methods.

What is the best algorithm for classification problem?

Journey from DataBricks to Azure DataBricks

What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

社区洞察

其他会员也浏览了

Ghosts In The Machine: Uncovering Five Hidden Patterns In Your Data

Hierarchical Clustering: A Comprehensive Guide to Understanding and Applying This Powerful Data Analysis Technique

Hierarchical Clustering: Financial Market Analysis

Resampling Techniques: Unlocking the Hidden Potential of Your Data

K-Means Clustering - Use Cases

Introduction to K-Means Clustering

Hierarchial Clustering

Clustering

Data Science.

Data Science