登录查看更多内容

From Noise to Knowledge: Explore the Magic of DBSCAN which is beyond Traditional Clustering.

Chandra Prakash Bathula

Adjunct Faculty at Saint Louis University | Machine Learning Practitioner | Web Developer | GenAI Developer

发布日期: 2023年6月29日

+ 关注

From Noise to Knowledge: Explore the Magic of DBSCAN which is beyond Traditional Clustering.

No alt text provided for this image — DBSCAN

DBSCAN in Density Based Algorithms?: Density Based Spatial Clustering Of Applications with?Noise.

Earlier Topics:

Since, We have seen centroid based algorithm for clustering like K-Means.Centroid based?: K-Means, K-Means ++?, K-Medoids.

We have also seen a hierarchy based clustering scheme which is nothing but “Agglomerative Hierarchical Clustering” in Hierarchical Clustering.

The third more popular type of clustering techniques is called “Density Based Clustering”.One among the many density based algorithms is “DBSCAN”.

Theme:

The Core Philosophy of Density Based clustering is that in a given Dataset D = {Xi}, we want to partition the points into “Dense-regions” which are called as “Clusters” and “Sparse-regions” which may contain “Noise Points”.

=> Here, the dense regions which are “clusters” are separated by “sparse regions”.

Now, there will be a question that will strike into our mind.?

What does ‘dense’ and ‘sparse’ here mean?? & How do we mathematically quantify these terms “dense” and “sparse”?.

Here, Sparse intuitively mean that there are very few points in an area and Dense basically means that there are a lot of points in a given area surrounding in a smaller area.

=> Let’s see how to mathematically quantify them?…!

*** NOTE: “Density Based Clustering” is a Popular Clustering Scheme and one of the most important and popular Density Based Clustering algorithms is “DBSCAN”.

DBSCAN (Density Based Spatial Clustering Of Applications with Noise), the name itself says that it is based on density and works with?noise.

If we jump into the concept there are a lot of questions that will start to attack us and one among them is: How to measure Density? (The Big Question we need to deal?with…!)

There are a lot of key terms, concepts and ideas that we will learn across the algorithm. And all these concepts are crucial in learning “DBSCAN” algorithm itself.
In the Quest of defining “Density” we will learn all these concepts which are very simple.
The first task we are dealing is how do we measure density? Let’s see how to do that?…,..

Measuring Density?: Min Pts & Epsilon (“ε”)?:-

There are two hyper-parameters that are present in DBSCAN to measure density. They are:- i) Min Pts, ii) Epsilon (ε).

i) Min?Pts:-?

First let us know first what is density at a point “P”. How do we define it’s density. It can defined as the following.

*=> Density at a point P = The no. of points within a radius of a circle in 2D, a sphere in 3D and a hypersphere in 4D and so,..on. Here the radius will be called as “Epsilon (ε)”.
So, at a given point “P”. Let’s assume there is a circle that can be formed within a radius. And the radius is let’s say 2.0 now, within the radius let’s assume that there are 6 points. Then the density at this point “P” is said to be 7. Density = 7.

Because starting from the point “P” in a region (or) a sphere within the radius of eps( “ε” = epsilon) there are total of 7 points including the point “P” itself.
So, the density at a point is the total no.of points within a hypersphere of radius “eps” (ε) around the point “P”. Here, ‘eps’ (ε), is the hyper-parameter we have to give.

The minimum no.of points that are required to form the dense region here is called 'Min?Pts'.

ii) Epsilon?(“ε”):

Epsilon (“ε”)?, is the maximum distance that is considered to be neighborhood in this?context.

The second most important concept is “Dense region”.

Dense region:-?

A Circle (or) a Sphere with a radius of “Eps” (“ε”) that contains at least “Min Pts” no. of points is called “dense region”.
Example: Let’s assume the eps = 1.2 & Min pts = 12.?

This is how it looks?:?

This is how we define Dense & Sparse regions. And we define dense & sparse regions using these two hyper-parameters “Min Pts” & “Epsilon(ε)”.
And these parameters basically says minimum points that are required within a radius to form a dense region.

Core Points, Border Points & Noise?Points:-

So, in a given dataset, D = {Xi} with Min pts, eps as hyper-parameters we can categorize each and every point in the dataset as a “Core Point”, “Border Point” (or) a “Noise Point”. Each point can be defined as the following:

i) Core?Points:?

A point “P” is said to be a “Core Point”, if point “P” has greater than or equal to “Min Pts”” in an “Epsilon(ε)” radius around it.
P >_ length(Min Pts) [within an epsilon(ε) radius].
And a core point is always belongs to dense-region. Recall the “dense region” definition it always has at least “Min Pts” in it’s “Epsilon(ε)” radius. And because the core point has more (or) equal to Min Pts it belongs to the “dense region”.
=> Every Core point belongs to a dense region. And that’s the core idea of what a “Core Point” is. So, the core here in the “core-point” means it is somewhere in the dense region and this is what it means.

ii) Border?Points:-

A point “Q” is said to be border point if “Q” is not a core point. And it has the following conditions:

a) “Q” has less than Min Pts in it’s “eps”(ε) radius.
b) But. “Q” belongs to the neighborhood of another point “P” which is a core-point.

So, here what does it mean by saying neighborhood?

The neighborhood here refers that the distance between P & Q is less than (or) equal to Epsilon(“ε”).

*** => dist(P,Q) <_ EPS.

So, the point “P” itself is not a core-point but it lies in the neighborhood, (this is should be remembered) and such point is called a “Border Point”.

iii) Noise?Points:-

Any point which is neither a core-point nor a border-point is a “Noise Point”.

Density Edge & Density Connected Points:-

The above mentioned two concepts are crucial ideas that we need to have an idea before we dive deep in learning “DBSCAN”. They are:-

i)Density Edge:-

So, what it means by density edge??

From the graph theory, we know that “edge” is nothing but a connection and this term edge comes from the graph theory.

Remember whenever we represent things in a graph, we call each of them as a “vertex”. And if we connect two vertices then the connection is called the?“edge”.

Now, the density edge is defined as follows, if “P and Q” are the two “Core-Points” and if the density between “P and Q” is less than equal to “Epsilon(ε)”. Then if we connect them, that connection which is an edge is called as “Density Edge”.

*** Note: There are two conditions that are to be satisfied. They are: i) The two points are to be core-points and ii) the distance between them should be less than or equal to “Epsilon(ε)”. That’s what a “Density Edge” is?..!

ii) Density Connected Points:-

Two points are said to be density connected points if they both are core-points and they are to be connected by core-points in between (or) the path.

=> A connection formed by “edges” is called a “path” in graph theory and here?..!

If there exists a path between “P & Q” and each of the edges are connected to density edges and formed a path between “P & Q”.

P,Q?: Core points & dist (P,Q) <_ Eps

Simply, Two points are said to be “density connected points” if there exists a path that is formed by “density edges” connecting that two points. Then they are called “density connected points”.
To put it in more simple words, if there are core-points and if you can hop from one core-point to the other core-point which is within an eps distance and reach point “Q” from point “P”. Then P & Q are said to be “density connected points”.

Now, let’s see what we have covered so far before we jump into the DBSCAN algorithm, the terminology includes “Min Pts, Core Point, Noise Point, Border Point, Density Edges and Density Connected Points.”

So, far we have covered the above concepts and by using them let us dive into how the “DBSCAN” algorithm works itself. Here, the two hyper-parameters use remaining key concepts to somehow build the algorithm work which is very very noise resistant and robust clustering algorithm.

DBSCAN Algorithm:

Given all of the terms that we have learnt so far let’s learn about the core of the DBSCAN algorithm and leverage all the terms that we have covered so far.

Step 1:?

For each point of the dataset, D = {Xi}, the first step is we need to label them as a “core-point, border-point (or) a noise-point”. For all “Xi” belongs to “D={Xi}” => Label whether it is a “Core Point or Border Point or Noise Point.”
Of course when we run DBSCAN algorithm we get two hyper-parameters and they are “Min pts and Epsilon(ε)”.
Given these two hyper-parameters we will learn how to determine the right hyper-parameter later.
But, given these hyper-parameters, for every given data point we can define what the point is and label them.
Because, what we do is very simple, given a datapoint Xi, we will see how many points are available within the “epsilon(ε)” radius.
And decide them according to their definition, and everything we do here is possible with something called as “Range Query”.

Range Query:

Range Query basically does this, to a given point it returns all the points which are within an epsilon (ε) radius.

Imagine all of this data, we need to be able to say, you can give a range query (Xi, D). It should return you all the points, it should return the set of points, “Si”, which are within an epsilon (ε) radius.

领英推荐

K-means Clustering: Applications and Real-world Use…

Vrata Tech Solutions (VTS) 11 个月前

A Fun Introduction to the Concept of Bayesian…

Keith McNulty 3 个月前

Frequentist and Bayesian Statistics: What They Are and…

Chandramouli R 3 个月前

*** Si = range query (Xi, D,?Eps).

So, range queries are very very important because if you give a point “Xi”, dataset “D” and epsilon (ε) “Eps”, this range query should be able to return us all the points which are within the epsilon (ε) radius. And this range queries are typically implemented by using data structures like “K-d Tree” (which is a variant of K-NN), (or) “R* Tree” to enable this range query very very efficiently.
We can use brute force but for the matter of efficiency we use Kd-Tree like optimizations.
Basically, DBSCAN was created by the researchers in Data Mining & Data Base community. In databases, there are specialized tools that people have built to make range queries very very efficiently on data points like this. Internally they might be using data structures like “Kd-Tree” (or) “R* Tree”(or) some other optimized techniques.

Step 2:

Next, in this step we need to remove all the noise points. SO, for every point we have said whether it is a core-point, border-point (or) noise-point. Then we need to remove all the noise points from the data.
Because, noise-points belongs to sparse regions which directly means that does not belong to any clusters and hence we will have a need to remove them.

*** Core Step is Step?3:

For each and every core-point “P” that is not yet assigned to a cluster we need to perform the following iteration loop:

a) Create a new Cluster with the “P”?.

b) Add all the points that are left out, that are density connected points to “P” into this new cluster.

Now if there are some more core-points left we need to keep perform the iteration.
Now, in the second iteration we will have this "U" as the new cluster and the loop continues.
The loop terminates when each core-point is assigned to at least one cluster.

Now, that we have removed the noise-points and resolved all the core-points. What happens with the border-points?

Step 4:

Now, that in this step, we have to take each “border-point” and assign them to the nearest “core-point cluster”.
So, that’s how we will handle the border-points.

This is how we handled all the points in a step-by-step process:

First we labelled core-points, border-points and noise-points.
Then we have removed all the noise-points.
And for each core-point we are just checking out the density connected points and to check for density connected points we again have to do range querying?, because it tells about density connected points.
To identify density connected points to “P” we need to perform range querying.
So the core-operation (or) computation in all of these is that in the first step and even in step 3 b) we need “range querying” and every other computation is mostly trivial.
Range query basically takes “O(log n)” to perform the operation, n is the size of the dataset, and can be optimized and implemented.

Sample Code and Implementation:-

This is a sample code for the implementation of DBSCAN algorithm with famous Mall Segmentation Cluster Data retrieved from Kaggle.

Kaggle Link for the data: https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python

#Runs the Algorithm and Prints the cluster labels.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = pd.read_csv('Mall_Customers.csv')

# Select features for clustering
features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
X = data[features]

# Preprocess the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.4, min_samples=5, metric='euclidean')
labels = dbscan.fit_predict(X_scaled)

# Print cluster labels
print("Cluster labels:")
print(labels)

2D & 3D Visualizations of the DBSCAN Clusters.

# Visualize the clusters in 2D
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels)
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.title('Mall Customer Segmentation (2D)')
plt.show()

# Visualize the clusters in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_scaled[:, 0], X_scaled[:, 1], X_scaled[:, 2], c=labels)
ax.set_xlabel('Age')
ax.set_ylabel('Annual Income (k$)')
ax.set_zlabel('Spending Score (1-100)')
ax.set_title('Mall Customer Segmentation (3D)')
plt.show()

And after removing noise points we can see the clusters clearly and well separated.

# Identify and remove noise points
core_samples_mask = np.zeros_like(dbscan.labels_, dtype=bool)
core_samples_mask[dbscan.core_sample_indices_] = True
noise_mask = labels == -1
cleaned_labels = labels[~noise_mask]

# Visualize the clusters after removing noise points
plt.scatter(X_scaled[~noise_mask, 0], X_scaled[~noise_mask, 1], c=cleaned_labels)
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.title('Mall Customer Segmentation without Noise Points')
plt.show()

This is just a sample code implementation without any EDA & feature importance and also data engineering.

Hyper-parameters: Min Pts &?Eps

So, the two hyper-parameters we have in the DBSCAN are Min Pts and Eps.

Now, let us see how to determine them.

1. Min?Pts:?

Typically Min Pts helps us do one of the most important things, that is to remove outliers, in this case noise-points.

i)So, the rule of thumb (or) typically always we should have our Min pts >_ d+1. Here, “d” is if each of your “Xi” belongs to a “d” dimensional space.

Typically, a good rule of thumb is to have our “Min Pts ~~ 2 * d”. Roughly equals to the two times of the dimensionality of the data.

ii) More often we tend to choose a larger value of Min Pts if your dataset is more noisy. Because at a given point “P” if it’s neighborhood within epsilon (ε) distance from this point “P”, if we have less than (or) equal to Min Pts and if “P” is not a border-point we call it a noise-point. And this is a good-logic because all the noisy-points can be removed. So, choosing a larger Min Pts helps us remove more noisy points.

*** Larger Min Pts if we know=>Dataset is more noisy. ***

iii) Consulting a Domain Expert is who we need to determine the hyper-parameter Min Pts in typical cases.

Let’s take a look how it will work, so if the domain expert says that for any given point “P” if we have 12 pts within the epsilon (ε) radius and if he says 10 pts are minimum required to not be a noise-point. This is a problem specific and may have 5,10,15 (or) 100 as Min Pts and it depends on the domain.
Only an expert who knows it can say if given a point “P” within a radius epsilon (ε) “Eps” (some radius) and 10 pts within is not a noise point. In such a case, We need to choose the parameter to be “10”. If the expert says we need to have at least 80 pts in the neighborhood to be very sure then we need to choose the Min Pts to be 80.

So, the rule of thumb is typically choosing Min Pts roughly equals to the twice of the dimensionality of data is the first task by taking a rough approximation.?

And larger Min Pts if we know that the dataset contains too much noise.

?And finally we need to consult a domain expert and ask the expert how many pts we can expect in a neighborhood for the given point “P” to be declared as a “noisy-point (or) not a noisy point”.

2. Epsilon?(ε):

Given the Min Pts hyper-parameter, we have this second task is to determine the eps(ε).
Let’s assume, we have Min Pts = 7. Given this how we can determine the right epsilon (ε) “eps”.

Step 1?: For each point “xi”, we need to compute something called “di”.

di => It is the distance from the point xi to it’s nth nearest point (or) neighbor.
xi -> di?: It is the distance from the point xi to it’s 7th nearest point and this 7 comes from the given Min Pt value. So, if the Min Pts value is 10 then we should compute the distance from xi to it’s 10 th nearest neighbor or point.

Step 2?: Now, we will have something as shown here.

We need to sort the "di’s" in the increasing order. And also we need to plot these values and we need to recall the method “Elbow/Knee Method” that we have discussed in the “K-Means Clustering”. Typically, we use this for “hyper-parameter tuning”.

First thing here to be noticed is that if our “di” is high it means it can be a noisy point. Let us assume if our Min Pts = 4 and our “di” is high our chance of “P” being noise point is also high.?
In simple words, if we have a high “di” then the chance (or) probability that “xi” is noisy is also high.
After the inflection point there is a sudden increase in the “di’s” and the probability of this being noise is also will be high. And at the point where the inflection point is seen whatever the value of “di” is seen it should be taken as the “epsilon(ε)” by looking at these distances.

Advantages & Limitations Of?DBSCAN:

Advantages:?

It is highly resistant to noise in the data and also a robust which can handle the outliers which is a great advantage for the algorithm.
It can take care of the clusters with different shapes and sizes.
In DBSCAN, there is no need for declaring the no. of clusters prior.

Limitations:

This DBSCAN, algorithm is highly sensitive to hyper-parameters. With a slight change in hyper-parameter “eps” can leads to completely different clusters.

Eps(ε) = 0.4 & Min Pts = 6.

Eps(ε) = 0.5 & Min Pts = 6.

High dimensional data can lead to curse of dimensionality. If in case of euclidean distance is used and there is high dimensional data it can lead to disasters some times. So, the algorithm DBSCAN is not that well suited for high dimensional data like text-data.

Time & Space Complexity:

The average time and space complexity for DBSCAN is as follows:

Time Complexity?:?

The average time complexity for DBSCAN is roughly equals to “O(n log n)”.
Because, it typically takes “log n” time for “range query” (or) “region query” with some implementations that are optimized techniques such as “Kd-Tree (or) R* Tree” and things like that and “n” is the total data points.

Time Complexity = ~~O(n log n).?

Which is fairly descent and far better compared to hierarchical Clustering which is O(n2).

Space Complexity?:

It is still “O(n)”;
Assuming the dimensionality “d” ( n << d ); which is far less compared to the total data points and ignoring “O(n d)” to “O(n)”. And also we don’t support using the DBSCAN algorithm for high dimensional data.

So, the Time & Space complexity is very healthy and better than “Hierarchical Clustering”. But the only problem is that it is extremely sensitive to the hyper-parameters.

Interpretability:-

DBSCAN can provide the interpretable cluster results by identifying the density and connectivity and also it can identify “noise points”.

Typically, what happens in the DBSCAN algorithm in real world as that it is designed for the Database and Data Mining community it can work well when the data is stored in traditional databases such as oracle, MySQL etc… where range or region queries are highly optimized and as the data is stored in a database.

Alisha Sarfani

Jr Project Manager @ Panelmatic | Data Driven Project Manager | Data Viz Enthusiast | SLU'24

1 年

Very insightful ?

2 次回应

查看更多评论

要查看或添加评论，请登录

Chandra Prakash Bathula的更多文章

Protecting Data Privacy in AI: A Comprehensive Guide to Augini for Data Manipulation.

2024年8月24日

Protecting Data Privacy in AI: A Comprehensive Guide to Augini for Data Manipulation.

In this data-driven AI world, every recommendation, prediction, or decision made by an AI model is power-sourced by the…

3 条评论
Power of Keywords in Machine Learning and Artificial Intelligence.

2024年1月12日

Power of Keywords in Machine Learning and Artificial Intelligence.

In the vast landscape of Machine Learning (ML) and Artificial Intelligence (AI), the role of keywords is often…
Deep Learning : Neural Networks

2023年10月4日

Deep Learning : Neural Networks

History Of Neural Networks and Deep Learning In the history of tech evolution, there have been numerous inventions and…

2 条评论
Similarity Based Recommendation Systems Algorithms.

2023年8月7日

Similarity Based Recommendation Systems Algorithms.

Photo by Kelly Sikkema on Unsplash So, among the types of recommender systems we have discussed previously and among…
Collaborative based Filtering in Recommender Systems & Content Based Recommender Systems

2023年8月7日

Collaborative based Filtering in Recommender Systems & Content Based Recommender Systems

So, continuing on the previous topic. We have multiple types of Recommender Systems.
The Art & Science Of Recommendation Systems: Discover, Recommend & Engage.

2023年7月4日

The Art & Science Of Recommendation Systems: Discover, Recommend & Engage.

A Special type of Machine Algorithm that plays a key role in most of the online platforms and keeps people stick their…

4 条评论
Everything to Know about Hierarchical Clustering; Agglomerative & Divisive Hierarchical Clustering.

2023年6月26日

Everything to Know about Hierarchical Clustering; Agglomerative & Divisive Hierarchical Clustering.

Hierarchical Clustering: Agglomerative & Divisive Hierarchical Clustering. Hierarchical Clustering: Since, we have…

4 条评论
Ever Wondered How Similar patterns are identified? A Complete Guide about K-Means, K-Means ++, K-Medoids & PAM’s in K-Means Clustering.

2023年6月19日

Ever Wondered How Similar patterns are identified? A Complete Guide about K-Means, K-Means ++, K-Medoids & PAM’s in K-Means Clustering.

Ever Wondered How Similar patterns are identified? A Complete Guide about K-Means, K-Means ++, K-Medoids & PAM’s in…

7 条评论
?? Applications of Clustering in Data Mining and Machine Learning ??: ML 86.

2023年6月11日

?? Applications of Clustering in Data Mining and Machine Learning ??: ML 86.

?? Applications of Clustering in Data Mining and Machine Learning ?? ?? Clustering is a powerful technique primarily…
Quantifying Excellence: Demystifying Clustering Metrics in ML and AI; Machine Learning Concept 85

2023年6月9日

Quantifying Excellence: Demystifying Clustering Metrics in ML and AI; Machine Learning Concept 85

???? Metrics Explored: Understanding the Dunn Index in Clustering ?? ?? When it comes to evaluating the performance of…

See all articles

From Noise to Knowledge: Explore the Magic of DBSCAN which is beyond Traditional Clustering.

DBSCAN in Density Based Algorithms?: Density Based Spatial Clustering Of Applications with?Noise.

Theme:

Measuring Density?: Min Pts & Epsilon (“ε”)?:-

i) Min?Pts:-?

ii) Epsilon?(“ε”):

Core Points, Border Points & Noise?Points:-

i) Core?Points:?

ii) Border?Points:-

iii) Noise?Points:-

Density Edge & Density Connected Points:-

i)Density Edge:-

ii) Density Connected Points:-

DBSCAN Algorithm:

Step 1:?

Range Query:

领英推荐

Step 2:

*** Core Step is Step?3:

Step 4:

Sample Code and Implementation:-

Hyper-parameters: Min Pts &?Eps

1. Min?Pts:?

2. Epsilon?(ε):

Advantages & Limitations Of?DBSCAN:

Advantages:?

Limitations:

Time & Space Complexity:

Space Complexity?:

Interpretability:-

Chandra Prakash Bathula的更多文章

Protecting Data Privacy in AI: A Comprehensive Guide to Augini for Data Manipulation.

Power of Keywords in Machine Learning and Artificial Intelligence.

Deep Learning : Neural Networks

Similarity Based Recommendation Systems Algorithms.

Collaborative based Filtering in Recommender Systems & Content Based Recommender Systems

The Art & Science Of Recommendation Systems: Discover, Recommend & Engage.

Everything to Know about Hierarchical Clustering; Agglomerative & Divisive Hierarchical Clustering.

Ever Wondered How Similar patterns are identified? A Complete Guide about K-Means, K-Means ++, K-Medoids & PAM’s in K-Means Clustering.

?? Applications of Clustering in Data Mining and Machine Learning ??: ML 86.

Quantifying Excellence: Demystifying Clustering Metrics in ML and AI; Machine Learning Concept 85

社区洞察

其他会员也浏览了

The Bayesian Approach to Data Analysis and Prediction. Implications for the Superbowl

Elastic Net Regression: Combining Both Ridge & Lasso

CLUSTER ANALYSIS

K-means clustering

RANDOM FOREST MODEL(RFM)

Frequentist vs Bayesian: Why the War is Over for CRO

Bayesian probabilistic forecasts using categorical information | Part 1

Why Gaussian Distribution is so fundamental to Statistics?

Understanding Gradient Descent in Linear Regression.

Role of Kernel PCA in Non-Linear Dimensionality Reduction