From Noise to Knowledge: Explore the Magic of DBSCAN which is beyond Traditional Clustering.
Chandra Prakash Bathula
Adjunct Faculty at Saint Louis University | Machine Learning Practitioner | Web Developer | GenAI Developer
From Noise to Knowledge: Explore the Magic of DBSCAN which is beyond Traditional Clustering.
DBSCAN in Density Based Algorithms?: Density Based Spatial Clustering Of Applications with?Noise.
Earlier Topics:
&
The third more popular type of clustering techniques is called “Density Based Clustering”.One among the many density based algorithms is “DBSCAN”.
Theme:
=> Here, the dense regions which are “clusters” are separated by “sparse regions”.
Now, there will be a question that will strike into our mind.?
What does ‘dense’ and ‘sparse’ here mean?? & How do we mathematically quantify these terms “dense” and “sparse”?.
Here, Sparse intuitively mean that there are very few points in an area and Dense basically means that there are a lot of points in a given area surrounding in a smaller area.
=> Let’s see how to mathematically quantify them?…!
*** NOTE: “Density Based Clustering” is a Popular Clustering Scheme and one of the most important and popular Density Based Clustering algorithms is “DBSCAN”.
DBSCAN (Density Based Spatial Clustering Of Applications with Noise), the name itself says that it is based on density and works with?noise.
If we jump into the concept there are a lot of questions that will start to attack us and one among them is: How to measure Density? (The Big Question we need to deal?with…!)
Measuring Density?: Min Pts & Epsilon (“ε”)?:-
There are two hyper-parameters that are present in DBSCAN to measure density. They are:- i) Min Pts, ii) Epsilon (ε).
i) Min?Pts:-?
First let us know first what is density at a point “P”. How do we define it’s density. It can defined as the following.
The minimum no.of points that are required to form the dense region here is called 'Min?Pts'.
ii) Epsilon?(“ε”):
Epsilon (“ε”)?, is the maximum distance that is considered to be neighborhood in this?context.
The second most important concept is “Dense region”.
Dense region:-?
This is how it looks?:?
Core Points, Border Points & Noise?Points:-
So, in a given dataset, D = {Xi} with Min pts, eps as hyper-parameters we can categorize each and every point in the dataset as a “Core Point”, “Border Point” (or) a “Noise Point”. Each point can be defined as the following:
i) Core?Points:?
ii) Border?Points:-
So, here what does it mean by saying neighborhood?
The neighborhood here refers that the distance between P & Q is less than (or) equal to Epsilon(“ε”).
*** => dist(P,Q) <_ EPS.
iii) Noise?Points:-
Density Edge & Density Connected Points:-
The above mentioned two concepts are crucial ideas that we need to have an idea before we dive deep in learning “DBSCAN”. They are:-
i)Density Edge:-
So, what it means by density edge??
*** Note: There are two conditions that are to be satisfied. They are: i) The two points are to be core-points and ii) the distance between them should be less than or equal to “Epsilon(ε)”. That’s what a “Density Edge” is?..!
ii) Density Connected Points:-
=> A connection formed by “edges” is called a “path” in graph theory and here?..!
P,Q?: Core points & dist (P,Q) <_ Eps
Now, let’s see what we have covered so far before we jump into the DBSCAN algorithm, the terminology includes “Min Pts, Core Point, Noise Point, Border Point, Density Edges and Density Connected Points.”
So, far we have covered the above concepts and by using them let us dive into how the “DBSCAN” algorithm works itself. Here, the two hyper-parameters use remaining key concepts to somehow build the algorithm work which is very very noise resistant and robust clustering algorithm.
DBSCAN Algorithm:
Given all of the terms that we have learnt so far let’s learn about the core of the DBSCAN algorithm and leverage all the terms that we have covered so far.
Step 1:?
Range Query:
Range Query basically does this, to a given point it returns all the points which are within an epsilon (ε) radius.
领英推荐
*** Si = range query (Xi, D,?Eps).
Step 2:
*** Core Step is Step?3:
a) Create a new Cluster with the “P”?.
b) Add all the points that are left out, that are density connected points to “P” into this new cluster.
Now, that we have removed the noise-points and resolved all the core-points. What happens with the border-points?
Step 4:
This is how we handled all the points in a step-by-step process:
Sample Code and Implementation:-
This is a sample code for the implementation of DBSCAN algorithm with famous Mall Segmentation Cluster Data retrieved from Kaggle.
#Runs the Algorithm and Prints the cluster labels.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# Load the dataset
data = pd.read_csv('Mall_Customers.csv')
# Select features for clustering
features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
X = data[features]
# Preprocess the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.4, min_samples=5, metric='euclidean')
labels = dbscan.fit_predict(X_scaled)
# Print cluster labels
print("Cluster labels:")
print(labels)
2D & 3D Visualizations of the DBSCAN Clusters.
# Visualize the clusters in 2D
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels)
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.title('Mall Customer Segmentation (2D)')
plt.show()
# Visualize the clusters in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_scaled[:, 0], X_scaled[:, 1], X_scaled[:, 2], c=labels)
ax.set_xlabel('Age')
ax.set_ylabel('Annual Income (k$)')
ax.set_zlabel('Spending Score (1-100)')
ax.set_title('Mall Customer Segmentation (3D)')
plt.show()
And after removing noise points we can see the clusters clearly and well separated.
# Identify and remove noise points
core_samples_mask = np.zeros_like(dbscan.labels_, dtype=bool)
core_samples_mask[dbscan.core_sample_indices_] = True
noise_mask = labels == -1
cleaned_labels = labels[~noise_mask]
# Visualize the clusters after removing noise points
plt.scatter(X_scaled[~noise_mask, 0], X_scaled[~noise_mask, 1], c=cleaned_labels)
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.title('Mall Customer Segmentation without Noise Points')
plt.show()
This is just a sample code implementation without any EDA & feature importance and also data engineering.
Hyper-parameters: Min Pts &?Eps
So, the two hyper-parameters we have in the DBSCAN are Min Pts and Eps.
Now, let us see how to determine them.
1. Min?Pts:?
Typically Min Pts helps us do one of the most important things, that is to remove outliers, in this case noise-points.
i)So, the rule of thumb (or) typically always we should have our Min pts >_ d+1. Here, “d” is if each of your “Xi” belongs to a “d” dimensional space.
Typically, a good rule of thumb is to have our “Min Pts ~~ 2 * d”. Roughly equals to the two times of the dimensionality of the data.
ii) More often we tend to choose a larger value of Min Pts if your dataset is more noisy. Because at a given point “P” if it’s neighborhood within epsilon (ε) distance from this point “P”, if we have less than (or) equal to Min Pts and if “P” is not a border-point we call it a noise-point. And this is a good-logic because all the noisy-points can be removed. So, choosing a larger Min Pts helps us remove more noisy points.
*** Larger Min Pts if we know=>Dataset is more noisy. ***
iii) Consulting a Domain Expert is who we need to determine the hyper-parameter Min Pts in typical cases.
So, the rule of thumb is typically choosing Min Pts roughly equals to the twice of the dimensionality of data is the first task by taking a rough approximation.?
And larger Min Pts if we know that the dataset contains too much noise.
?And finally we need to consult a domain expert and ask the expert how many pts we can expect in a neighborhood for the given point “P” to be declared as a “noisy-point (or) not a noisy point”.
2. Epsilon?(ε):
Step 1?: For each point “xi”, we need to compute something called “di”.
Step 2?: Now, we will have something as shown here.
We need to sort the "di’s" in the increasing order. And also we need to plot these values and we need to recall the method “Elbow/Knee Method” that we have discussed in the “K-Means Clustering”. Typically, we use this for “hyper-parameter tuning”.
Advantages & Limitations Of?DBSCAN:
Advantages:?
Limitations:
Eps(ε) = 0.4 & Min Pts = 6.
Eps(ε) = 0.5 & Min Pts = 6.
Time & Space Complexity:
The average time and space complexity for DBSCAN is as follows:
Time Complexity?:?
Time Complexity = ~~O(n log n).?
Which is fairly descent and far better compared to hierarchical Clustering which is O(n2).
Space Complexity?:
So, the Time & Space complexity is very healthy and better than “Hierarchical Clustering”. But the only problem is that it is extremely sensitive to the hyper-parameters.
Interpretability:-
Typically, what happens in the DBSCAN algorithm in real world as that it is designed for the Database and Data Mining community it can work well when the data is stored in traditional databases such as oracle, MySQL etc… where range or region queries are highly optimized and as the data is stored in a database.
Jr Project Manager @ Panelmatic | Data Driven Project Manager | Data Viz Enthusiast | SLU'24
1 年Very insightful ?