Types of Clustering Methods
Shashank Sharma
Network Security : Cisco | Linux , Cloud : Redhat | Writer | Public Speaker | Content Writing | Open Source Enthusiast
Types of Clustering Methods
The clustering methods are broadly divided into?Hard clustering?(data point belonging to only one group) and?Soft Clustering?(data points can belong to another group also). But there are also other various approaches of Clustering exist. Below are the main clustering methods used in Machine learning:
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the?centroid-based method. The most common example of partitioning clustering is the?K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined groups. The cluster center is created in such a way that the distance between the data points of one cluster is minimum as compared to another cluster centroid. A centroid is an average of all data objects in a partition, while the medoid is the most representative point of a cluster. The fundamental requirements of the partitioning-based methods are each cluster must contain at least one data object, and each data object must belong to exactly one cluster. In this category of clustering, various methods have been developed[3][4]. A distance measure is one of the feature spaces used to identify the similarity or dissimilarity of patterns between data objects.
Some of the well-known methods are kmeans , k-medoids;
i)???????????????????Partitioning around Medoids (PAM)
ii)?????????????????Clustering LARge Applications (CLARA)
iii)???????????????Clustering Large Applications based upon RANdomized Search (CLARANS).
Density-Based Clustering
centroid is an average of all data objects in a partition, while the medoid is the most representative point of a cluster. The fundamental requirements of the partitioning-based methods are each cluster must contain at least one data object, and each data object must belong to exactly one cluster. In this category of clustering, various methods have been developed[3][4]. A distance measure is one of the feature spaces used to identify similarity or dissimilarity of patterns between data objects. Some of the well known methods are kmeans, k-medoids; i) Partitioning around Medoids (PAM) ii) Clustering LARge Applications (CLARA) iii) Clustering Large Applications based upon RANdomized Search (CLARANS).
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high dimensions.
There are two different parameters to calculate the density-based clustering
Eps: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
Density reachable: ?A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a sequence chain of a point i1,…., in, i1 = j, pn = i such that ii?+ 1 is directly density reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o such that both i and j are considered as density reachable from o with respect to Eps and MinPts.
DBSCAN algorithm can be abstracted in the following steps:
?
1.?????Find all the neighbor points within eps and identify the core points or visited with more than MinPts neighbors.
2.?????For each core point if it is not already assigned to a cluster, create a new cluster.
3.?????Find recursively all its density connected points and assign them to the same cluster as the core point.?
A point?a?and?b?are said to be density connected if there exist a point?c?which has a sufficient number of points in its neighbors and both the points?a?and?b?are within the?eps distance. This is a chaining process. So, if?b?is neighbor of?c,?c?is neighbor of?d,?d?is neighbor of?e, which in turn is neighbor of?a?implies that?b?is neighbor of?a.
4.?????Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster are noise.
?
Major Features of Density-Based Clustering
The primary features of Density-based clustering are given below.
Disadvantage Of K-MEANS:
K-Means forms spherical clusters only. This algorithm fails when data is not spherical ( i.e. same variance in all directions).?
K-Means algorithm is sensitive towards outlier. Outliers can skew the clusters in K-Means in very large extent.?
K-Means algorithm requires one to specify the number of clusters a priory etc.
Basically, DBSCAN algorithm overcomes all the above-mentioned drawbacks of K-Means algorithm. DBSCAN algorithm identifies the dense region by grouping together data points that are closed to each other based on distance measurement.
?
Distribution Model-Based Clustering
?
It is a clustering model that can fit data based on the likelihood that it is likely to be part of an identical distribution. The clustering that is done could be either normal as well as gaussian. Gaussian distribution can be more prevalent in the case of a fixed number of distributions, and all the data that is to come will be incorporated into it so that data distribution can be maximized.
Additionally, Distribution-based clustering generates clusters that rely on concisely specified mathematical models for the data, which is a high-risk assumption for certain data distributions. This model is able to work well with synthetic data as well as diversely sized clusters. However, this model could have problems if the constraints were not applied to reduce the complexity of the model.
In the distribution model-based clustering method, the data is divided based on the probability of how a dataset belongs to a particular distribution. The grouping is done by assuming some distributions commonly?Gaussian Distribution.
The example of this type is the?Expectation-Maximization Clustering algorithm?that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the unlabeled datasets into a cluster and also known as?hierarchical cluster analysis?or HCA.
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to create a tree-like structure, which is also called a?dendrogram. The observations or any number of clusters can be selected by cutting the tree at the correct level. The most common example of this method is the?Agglomerative Hierarchical algorithm.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they both differ depending on how they work. As there is no requirement to predetermine the number of clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
Why hierarchical clustering?
As we already have other?clustering?algorithms such as?K-Means Clustering, then why we need hierarchical clustering? So, as we have seen in the K-means clustering that there are some challenges with this algorithm, which are a predetermined number of clusters, and it always tries to create the clusters of the same size. To solve these two challenges, we can opt for the hierarchical clustering algorithm because, in this algorithm, we don't need to have knowledge about the predefined number of clusters.
Agglomerative Hierarchical clustering
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the datasets into clusters, it follows the?bottom-up approach. It means, this algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest pair of clusters together. It does this until all the clusters are merged into a single cluster that contains all the datasets.
This hierarchy of clusters is represented in the form of the dendrogram.
How the Agglomerative Hierarchical clustering Work?
The working of the AHC algorithm can be explained using the below steps:
Step-2:?Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2 clusters.
Step-4:?Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below images:
?
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points, and the x-axis shows all the data points of the given dataset.
?
?
Fuzzy Clustering
Fuzzy?clustering is a type of soft method in which a data object may belong to more than one group or cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership to be in a cluster.?Fuzzy C-means algorithm?is the example of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Comparison to hard clustering
In non-fuzzy clustering (also known as hard clustering), data are divided into distinct clusters, where each data point can only belong to exactly one cluster. In fuzzy clustering, data points can potentially belong to multiple clusters. For example, an apple can be red or green (hard clustering), but an apple can also be red AND green (fuzzy clustering). Here, the apple can be red to a certain degree as well as green to a certain degree. Instead of the apple belonging to green [green = 1] and not red [red = 0], the apple can belong to green [green = 0.5] and red [red = 0.5]. These value are normalized between 0 and 1; however, they do not represent probabilities, so the two values do not need to add up to 1.
Advantages:
Disadvantages:
?
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
n
θ=[a,b]
?
?
In the real world often you don't know the "true" parameters, but you get to observe data. Next up, we will explore how we can use data to estimate the model parameters.
It turns out there isn't just one way to estimate the value of parameters. There are two main schools of thought: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP). Both of these schools of thought assume that your data are independent and identically distributed (IID) samples:?X1,X2,…Xn.
?
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is a probabilistic based approach to determine values for the parameters of the model. MLE is a widely used technique in machine learning, time series, panel data and discrete data.
Maximum Likelihood Estimation?(MLE) is a probabilistic based approach to determine values for the parameters of the model. Parameters could be defined as blueprints for the model because based on that the algorithm works. MLE is a widely used technique in machine learning,?time series, panel data and discrete data. The motive of MLE is to maximize the likelihood of values for the parameter to get the desired outcomes.
What is the likelihood?
The likelihood function measures the extent to which the data provide support for different values of the parameter. It indicates how likely it is that a particular population will produce a sample. For example, if we compare the likelihood function at two-parameter points and find that for the first parameter the likelihood is greater than the other it could be interpreted as the first parameter being a more plausible value for the learner than the second parameter. More likely it could be said that it uses a hypothesis for concluding the result. Both frequentist and Bayesian analyses consider the likelihood function. The likelihood function is different from the probability density function.
Difference between likelihood and probability density function
Likelihood?describes how to find the best distribution of the data for some feature or some situation in the data given a certain value of some feature or situation, while probability describes how to find the chance of something given a sample distribution of data. Let’s understand the difference between the likelihood and probability density function with the help of an example.
Consider a dataset containing the weight of the customers. Let’s say the mean of the data is 70 & the standard deviation is 2.5.
When?Probability?has to be calculated for any situation using this dataset, then the mean and standard deviation of the dataset will be constant. Let’s say the probability of weight > 70 kg has to be calculated for a random record in the dataset, then the equation will contain weight, mean and standard deviation. Considering the same dataset, now if we need to calculate the probability of weight > 100 kg, then only the height part of the equation be changed and the rest would be unchanged.
But in the case of Likelihood, the equation of the conditional probability flips as compared to the equation in the probability calculation i.e mean and standard deviation of the dataset will be varied to get the maximum likelihood for weight > 70 kg.
Working of Maximum Likelihood Estimation
The maximization of the likelihood estimation is the main objective of the MLE. Let’s understand this with an example. Consider there is a binary classification problem in which we need to classify the data into two categories either 0 or 1 based on a feature called “salary”.?
So MLE will calculate the possibility for each data point in salary and then by using that possibility, it will calculate the likelihood of those data points to classify them as either?0 or 1. It will repeat this process of likelihood until the learner line is best fitted. This process is known as the maximization of likelihood.
The above explains the scenario, as we can see there is a threshold of 0.5 so if the possibility comes out to be greater than that it is labelled as 1 otherwise 0. Let’s see how MLE could be used for classification.
Maximum likelihood estimation in machine learning
MLE is the base of a lot of supervised learning models, one of which is?Logistic regression. Logistic regression maximum likelihood technique to classify the data. Let’s see how Logistic regression uses MLE. Specific MLE procedures have the advantage that they can exploit the properties of the estimation problem to deliver better efficiency and numerical stability. These methods can often calculate explicit confidence intervals. The parameter “solver” of the logistic regression is used for selecting different solving strategies for classification for better MLE formulation.
The maximum likelihood approach provides a persistent approach to parameter estimation as well as provides mathematical and optimizable properties. With a hands-on implementation of this concept in this article, we could understand how Maximum Likelihood Estimation works and how it is used as a backbone of logistic regression for classification.
Important Points
“The?goal of?the maximum likelihood principle is to fit an optimal statistical distribution to some data.?This makes the data easier to work with, makes it more general, allows us to see if new data follows the same distribution as the previous data, and lastly,?it allows us to classify unlabelled data points.”
领英推荐
Maximum a posteriori estimation
In?Bayesian statistics, a?maximum a posteriori probability?(MAP)?estimate?is an estimate of an unknown quantity, that equals the?mode?of the?posterior distribution. The MAP can be used to obtain a?point estimate?of an unobserved quantity on the basis of empirical data. It is closely related to the method of?maximum likelihood?(ML) estimation, but employs an augmented?optimization objective?which incorporates a?prior distribution?(that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as a?regularization?of maximum likelihood estimation.
”MAP” Youtube se padh lena guyzz. Google par kuch dhang ka nahi mil raha.?MLE VS MAP bhi padh lena. ????
?
?
?
?
?
?
?
?Unit 2 Bitchesssssssssssssssssssssssssss
?
Clustering?
It is basically a type of?unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples.?
Clustering?is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.?
For ex– The data points in the graph below clustered together can be classified into one single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below picture.?
Applications of Clustering in different fields??
·??????Marketing:?It can be used to characterize & discover customer segments for marketing purposes.
·??????Biology:?It can be used for classification among different species of plants and animals.
·??????Libraries:?It is used in clustering different books on the basis of topics and information.
·??????Insurance:?It is used to acknowledge the customers, their policies and identifying the frauds.
·??????????????City Planning:?It is used to make groups of houses and to study their values based on their geographical locations and other factors present.?
Types of Clustering Methods
The clustering methods are broadly divided into?Hard clustering?(datapoint belongs to only one group) and?Soft Clustering?(data points can belong to another group also). But there are also other various approaches of Clustering exist. Below are the main clustering methods used in Machine learning:
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the?centroid-based method. The most common example of partitioning clustering is the?K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined groups. The cluster center is created in such a way that the distance between the data points of one cluster is minimum as compared to another cluster centroid. A centroid is an average of all data objects in a partition, while the medoid is the most representative point of a cluster. The fundamental requirements of the partitioning based methods are each cluster must contain at least one data object, and each data objects must belong to exactly one cluster. In this category of clustering, various methods have been developed[3][4]. A distance measure is one of the feature space used to identify similarity or dissimilarity of patterns between data objects.
Some of the well known methods are kmeans , k-medoids;
i)???????????????????Partitioning around Medoids (PAM)
ii)?????????????????Clustering LARge Applications (CLARA)
iii)???????????????Clustering Large Applications based upon RANdomized Search (CLARANS).
Density-Based Clustering
centroid is an average of all data objects in a partition, while the medoid is the most representative point of a cluster. The fundamental requirements of the partitioning based methods are each cluster must contain at least one data object, and each data objects must belong to exactly one cluster. In this category of clustering, various methods have been developed[3][4]. A distance measure is one of the feature space used to identify similarity or dissimilarity of patterns between data objects. Some of the well known methods are kmeans, k-medoids; i) Partitioning around Medoids (PAM) ii) Clustering LARge Applications (CLARA) iii) Clustering Large Applications based upon RANdomized Search (CLARANS).
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high dimensions.
There are two different parameters to calculate the density-based clustering
Eps: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
Density reachable: ?A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a sequence chain of a point i1,…., in, i1 = j, pn = i such that ii?+ 1 is directly density reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o such that both i and j are considered as density reachable from o with respect to Eps and MinPts.
DBSCAN algorithm can be abstracted in the following steps:
?
1.?????Find all the neighbor points within eps and identify the core points or visited with more than MinPts neighbors.
2.?????For each core point if it is not already assigned to a cluster, create a new cluster.
3.?????Find recursively all its density connected points and assign them to the same cluster as the core point.?
A point?a?and?b?are said to be density connected if there exist a point?c?which has a sufficient number of points in its neighbors and both the points?a?and?b?are within the?eps distance. This is a chaining process. So, if?b?is neighbor of?c,?c?is neighbor of?d,?d?is neighbor of?e, which in turn is neighbor of?a?implies that?b?is neighbor of?a.
4.?????Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster are noise.
?
Major Features of Density-Based Clustering
The primary features of Density-based clustering are given below.
Disadvantage Of K-MEANS:
K-Means forms spherical clusters only. This algorithm fails when data is not spherical ( i.e. same variance in all directions).?
K-Means algorithm is sensitive towards outlier. Outliers can skew the clusters in K-Means in very large extent.?
K-Means algorithm requires one to specify the number of clusters a priory etc.
Basically, DBSCAN algorithm overcomes all the above-mentioned drawbacks of K-Means algorithm. DBSCAN algorithm identifies the dense region by grouping together data points that are closed to each other based on distance measurement.
?
Distribution Model-Based Clustering
?
It is a clustering model that can fit data based on the likelihood that it is likely to be part of an identical distribution. The clustering that is done could be either normal as well as gaussian. Gaussian distribution can be more prevalent in the case of a fixed number of distributions, and all the data that is to come will be incorporated into it so that data distribution can be maximized.
Additionally, Distribution-based clustering generates clusters that rely on concisely specified mathematical models for the data, which is a high-risk assumption for certain data distributions. This model is able to work well with synthetic data as well as diversely sized clusters. However, this model could have problems if the constraints were not applied to reduce the complexity of the model.
In the distribution model-based clustering method, the data is divided based on the probability of how a dataset belongs to a particular distribution. The grouping is done by assuming some distributions commonly?Gaussian Distribution.
The example of this type is the?Expectation-Maximization Clustering algorithm?that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the unlabeled datasets into a cluster and also known as?hierarchical cluster analysis?or HCA.
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to create a tree-like structure, which is also called a?dendrogram. The observations or any number of clusters can be selected by cutting the tree at the correct level. The most common example of this method is the?Agglomerative Hierarchical algorithm.
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they both differ depending on how they work. As there is no requirement to predetermine the number of clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
Why hierarchical clustering?
As we already have other?clustering?algorithms such as?K-Means Clustering, then why we need hierarchical clustering? So, as we have seen in the K-means clustering that there are some challenges with this algorithm, which are a predetermined number of clusters, and it always tries to create the clusters of the same size. To solve these two challenges, we can opt for the hierarchical clustering algorithm because, in this algorithm, we don't need to have knowledge about the predefined number of clusters.
Agglomerative Hierarchical clustering
The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the datasets into clusters, it follows the?bottom-up approach. It means, this algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest pair of clusters together. It does this until all the clusters are merged into a single cluster that contains all the datasets.
This hierarchy of clusters is represented in the form of the dendrogram.
How the Agglomerative Hierarchical clustering Work?
The working of the AHC algorithm can be explained using the below steps:
Step-2:?Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2 clusters.
Step-4:?Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below images:
?
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points, and the x-axis shows all the data points of the given dataset.
?
?
Fuzzy Clustering
Fuzzy?clustering is a type of soft method in which a data object may belong to more than one group or cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership to be in a cluster.?Fuzzy C-means algorithm?is the example of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Comparison to hard clustering
In non-fuzzy clustering (also known as hard clustering), data are divided into distinct clusters, where each data point can only belong to exactly one cluster. In fuzzy clustering, data points can potentially belong to multiple clusters. For example, an apple can be red or green (hard clustering), but an apple can also be red AND green (fuzzy clustering). Here, the apple can be red to a certain degree as well as green to a certain degree. Instead of the apple belonging to green [green = 1] and not red [red = 0], the apple can belong to green [green = 0.5] and red [red = 0.5]. These value are normalized between 0 and 1; however, they do not represent probabilities, so the two values do not need to add up to 1.
Advantages:
Disadvantages:
?
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning: