登录查看更多内容

Understanding the basics of Data Clustering

Gautam Kumar

Senior Lead Engineer

发布日期: 2020年3月30日

Clustering

Clustering is the task of dividing the population or data points into a few groups such that data points in the same groups are like other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

In the terms of data analytics, when we plot the data points on a X-Y plane, the distance between two points can be computed easily using the formula

Here, ?X and ?Y are the differences of X & Y coordinates of the two points.

Now, clustering is a method to segregate the datapoints is a way that the distance between the datapoints in one group is lesser than the same from the other groups. The datapoints which are closer to each other are formed as a group whereas the datapoints which are relatively with a greater distance, are not included in the group.

Here, the distance between the datapoints inside one group is called inter cluster distance whereas the distance between the datapoints of two different groups is called intra cluster distance.

As we can see here that there is no fixed shape in which a group can be defined. The group of datapoints can be of any shape and size.

The clustering technique is mainly used with unsupervised learning method where we do not have a labelled data. We need to draw references from datasets consisting of input data without labelled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of provided datapoints.

Clustering Methods

1. Density-Based Methods: These methods consider the clusters as the dense region having some similarity and different from the lower dense region of the space. These methods have good accuracy and ability to merge two clusters.

2. Hierarchical Based Methods: The clusters formed in this method forms a tree type structure based on the hierarchy. New clusters are formed using the previously formed one. It is divided into two categories.

· Agglomerative (bottom up approach).

· Divisive (top down approach).

3. Partitioning Methods: These methods partition the objects into k clusters and each partition forms one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example K-means, CLARANS (Clustering Large Applications based upon randomized Search) etc. This method is also referred as Centroid based clustering as it organizes the data into non-hierarchical method described above.

4. Distribution based Methods: This clustering approach assumes data is composed of distributions, such as Gaussian distributions. In the example below, the distribution-based algorithm clusters data into three Gaussian distributions. As distance from the distribution's center increases, the probability that a point belongs to the distribution decreases. The bands show that decrease in probability.

5. Fuzzy Clustering: In this type of clustering, the data points can belong to more than one cluster. Each component present in the cluster has a membership coefficient that corresponds to a degree of being present in that cluster. Fuzzy Clustering method is also known as a soft method of clustering.

K-Means Clustering

There are multiple ways to cluster the data, but K-Means algorithm is the most used algorithm. Which tries to improve the inter group similarity while keeping the groups as far as possible from each other.

Basically K-Means runs on distance calculations, which again uses “Euclidean Distance” for this purpose. Euclidean distance calculates the distance between two given points using the following formula:

Euclidean Distance =

Above formula captures the distance in 2-Dimensional space but the same is applicable in multi-dimensional space as well with increase in number of terms getting added. “K” in K-Means represents the number of clusters in which we want our data to divide into. The basic restriction for K-Means algorithm is that your data should be continuous in nature. It won’t work if data is categorical in nature.

Algorithm

K-Means is an iterative process of clustering; which keeps iterating until it reaches the best solution or clusters in our problem space. Following pseudo example talks about the basic steps in K-Means clustering which is generally used to cluster our data

Start with number of clusters we want e.g., 3 in this case. K-Means algorithm start the process with random centers in data, and then tries to attach the nearest points to these centers.

Algorithm then moves the randomly allocated centers to the means of created groups.

In the next step, data points are again reassigned to these newly created centers.

Steps 2 & 3 are repeated until no member changes their association/groups.

Advantages

· Fast, robust and easier to understand.

· Relatively simple to implement.

· Scales to large data sets.

· Guarantees convergence.

· Can warm-start the positions of centroids.

· Easily adapts to new examples.

· Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

Disadvantages

· The learning algorithm requires a prior specification of the number of cluster centers.

· The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be able to resolve that there are two clusters.

· The learning algorithm is not invariant to non-linear transformations i.e. with different representation of data we get different results (data represented in form of cartesian co-ordinates and polar co-ordinates will give different results).

· Euclidean distance measures can unequally weight underlying factors.

· The learning algorithm provides the local optima of the squared error function.

· Randomly choosing of the cluster center cannot lead us to the fruitful result.

· Applicable only when mean is defined i.e. fails for categorical data.

· Unable to handle noisy data and outliers.

· Algorithm fails for non-linear data set.

Example

Below is a simple python code to show how K-Means is used:

Now, we will import K-Means from sci-kit learnings and apply the same on this data:

要查看或添加评论，请登录

Gautam Kumar的更多文章

Treating outliers on a dataset

2022年12月23日

Treating outliers on a dataset

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In…
What is Cloud Computing

2021年3月15日

What is Cloud Computing

The most simplistic definition of cloud computing is the delivery of on-demand IT services over the internet. The…

1 条评论
An Introduction to Lambda Function

2020年4月16日

An Introduction to Lambda Function

Functions are basically piece of codes which execute only when we invoke them. For any programming language, functions…
Understanding Support Vector Machine

2020年4月5日

Understanding Support Vector Machine

Support Vector Machine: An Introduction I have talked about Linear regression and Classification on my prior articles…
Classification in Data Science

2020年3月31日

Classification in Data Science

What is Classification? Although classification can be performed on both structured and unstructured data, it is mainly…
Multicollinearity - understanding the relationship between variables

2020年3月23日

Multicollinearity - understanding the relationship between variables

Multicollinearity Multicollinearity or simply collinearity is defined by the phenomenon in which two or more…
Dimension Reduction - Principal Component Analysis (aka PCA)

2020年3月20日

Dimension Reduction - Principal Component Analysis (aka PCA)

Being in an era of data flowing from every here and there, we often come across scenarios that we gather way too much…

2 条评论
Understanding the ROC & AUC

2020年3月15日

Understanding the ROC & AUC

Introduction In any type of machine learning, we need to calculate the accuracy of the model for performance…
Linear Regression

2020年3月15日

Linear Regression

When it comes to supervised machine learning, there are two types of learning algorithms: Regression – this basically…

See all articles

Understanding the basics of Data Clustering

Gautam Kumar

Senior Lead Engineer

K-Means Clustering

Algorithm

Gautam Kumar的更多文章

社区洞察

其他会员也浏览了

Unlocking the Power of Data: Exploring the World of Data Science

Hierarchical Clustering: Financial Market Analysis

Introduction to K-Means Clustering

Data Science Best Practices

Basic Building Blocks of K-Means Clustering Algorithms

Data science's future scope

Data Science Notes _ Part 1

Hierarchical Clustering: A Comprehensive Guide to Understanding and Applying This Powerful Data Analysis Technique

Enhancing Data Analytics with Excel and GenAI: A Modern Approach to Insight-Driven Decisions

?? Mastering Cross-Validation and Model Evaluation Techniques in Data Science

K-Means Clustering

Algorithm

Gautam Kumar的更多文章

Treating outliers on a dataset

What is Cloud Computing

An Introduction to Lambda Function

Understanding Support Vector Machine

Classification in Data Science

Multicollinearity - understanding the relationship between variables

Dimension Reduction - Principal Component Analysis (aka PCA)

Understanding the ROC & AUC

Linear Regression

社区洞察

其他会员也浏览了

Unlocking the Power of Data: Exploring the World of Data Science

Hierarchical Clustering: Financial Market Analysis

Introduction to K-Means Clustering

Data Science Best Practices

Basic Building Blocks of K-Means Clustering Algorithms

Data science's future scope

Data Science Notes _ Part 1

Hierarchical Clustering: A Comprehensive Guide to Understanding and Applying This Powerful Data Analysis Technique

Enhancing Data Analytics with Excel and GenAI: A Modern Approach to Insight-Driven Decisions

?? Mastering Cross-Validation and Model Evaluation Techniques in Data Science