Types of CLustering Algorithm
Shashank Sharma
Network Security : Cisco | Linux , Cloud : Redhat | Writer | Public Speaker | Content Writing | Open Source Enthusiast
Types of Clustering Algo
K-means clustering is the most commonly used clustering algorithm. It's a centroid-based algorithm and the simplest unsupervised learning algorithm.
This algorithm tries to minimize the variance of data points within a cluster. It's also how most people are introduced to unsupervised machine learning.
K-means is best used on smaller data sets because it iterates over?all?of the data points. That means it'll take more time to classify data points if there are a large amount of them in the data set.
Since this is how k-means clusters data points, it doesn't scale well.
DBSCAN clustering algorithm
DBSCAN stands for density-based spatial clustering of applications with noise. It's a density-based clustering algorithm, unlike k-means.
This is a good algorithm for finding outliners in a data set. It finds arbitrarily shaped clusters based on the density of data points in different regions. It separates regions by areas of low-density so that it can detect outliers between the high-density clusters.
This algorithm is better than k-means when it comes to working with oddly shaped data.
DBSCAN uses two parameters to determine how clusters are defined:?minPts?(the minimum number of data points that need to be clustered together for an area to be considered high-density) and?eps?(the distance used to determine if a data point is in the same area as other data points).
Choosing the right initial parameters is critical for this algorithm to work.
Gaussian Mixture Model algorithm
One of the problems with k-means is that the data needs to follow a circular format. The way k-means calculates the distance between data points has to do with a circular path, so non-circular data isn't clustered correctly.This is an issue that Gaussian mixture models fix. You don’t need circular shaped data for it to work well.The Gaussian mixture model uses multiple Gaussian distributions to fit arbitrarily shaped data.There are several single Gaussian models that act as hidden layers in this hybrid model. So the model calculates the probability that a data point belongs to a specific Gaussian distribution and that's the cluster it will fall under.
????????????????????????????????????????????????????????????????????OR
GMMs are a generalization of Gaussian distributions and can be used to represent any data set that can be clustered into multiple Gaussian distributions. The Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mix of Gaussian distributions with unknown parameters. A Gaussian mixture model can be used for clustering, which is the task of grouping a set of data points into clusters. GMMs can be used to find clusters in data sets where the clusters may not be clearly defined. Additionally, GMMs can be used to estimate the probability that a new data point belongs to each cluster. Gaussian mixture models are also relatively robust to outliers, meaning that they can still yield accurate results even if there are some data points that do not fit neatly into any of the clusters. This makes GMMs a flexible and powerful tool for clustering data. It can be understood as a probabilistic model where Gaussian distributions are assumed for each group and they have means and covariances which define their parameters. GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ). A Gaussian distribution is defined as a continuous probability distribution that takes on a bell-shaped curve. Another name for Gaussian distribution is the normal distribution.
GMM has many applications, such as density estimation, clustering, and image segmentation. For density estimation, GMM can be used to estimate the probability density function of a set of data points. For clustering, GMM can be used to group together data points that come from the same Gaussian distribution. And for image segmentation, GMM can be used to partition an image into different regions.
Gaussian mixture models can be used for a variety of use cases, including identifying customer segments, detecting fraudulent activity, and clustering images. In each of these examples, the Gaussian mixture model is able to identify clusters in the data that may not be immediately obvious. As a result, Gaussian mixture models are a powerful tool for data analysis and should be considered for any clustering task.
What are the differences between Gaussian mixture models and other types of clustering algorithms such as K-means?
·??????A Gaussian mixture model is a type of clustering algorithm that assumes that the data point is generated from a mixture of Gaussian distributions with unknown parameters. The goal of the algorithm is to estimate the parameters of the Gaussian distributions, as well as the proportion of data points that come from each distribution. In contrast, K-means is a clustering algorithm that does not make any assumptions about the underlying distribution of the data points. Instead, it simply partitions the data points into K clusters, where each cluster is defined by its centroid.
?
What are the scenarios when Gaussian mixture models can be used?
·??????Gaussian mixture models can be used in a variety of scenarios, including when data is generated by a mix of Gaussian distributions when there is uncertainty about the correct number of clusters, and when clusters have different shapes. In each of these cases, the use of a Gaussian mixture model can help to improve the accuracy of results. For example, when data is generated by a mix of Gaussian distributions, using a Gaussian mixture model can help to better identify the underlying patterns in the data. In addition, when there is uncertainty about the correct number of clusters, the use of a Gaussian mixture model can help to reduce the error rate.
What are some real-world examples where Gaussian mixture models can be used?
·??????Finding patterns in medical datasets: GMMs can be used for segmenting images into multiple categories based on their content or finding specific patterns in medical datasets. They can be used to find clusters of patients with similar symptoms, identify disease subtypes, and even predict outcomes. In one recent study, a Gaussian mixture model was used to analyze a dataset of over 700,000 patient records. The model was able to identify previously unknown patterns in the data, which could lead to better treatment for patients with cancer.
?
BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm that performs hierarchical clustering over large data sets. With modifications, it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation-maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database.
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively", beating DBSCAN by two months. The BIRCH algorithm received the SIGMOD 10 year test of time award in 2006.
Basic clustering algorithms like K means and agglomerative clustering are the most commonly used clustering algorithms. But when performing clustering on very large datasets, BIRCH and DBSCAN are the advanced clustering algorithms useful for performing precise clustering on large datasets. Moreover, BIRCH is very useful because of its easy implementation. BIRCH is a clustering algorithm that clusters the dataset first in small summaries, then after small summaries get clustered. It does not directly cluster the dataset. That is why BIRCH is often used with other clustering algorithms; after making the summary, the summary can also be clustered by other clustering algorithms.
It is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with the centroids being read off the leaf. And these centroids can be the final cluster centroid or the input for other cluster algorithms like Agglomerative Clustering.
Problem with Previous Clustering Algorithm
Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a dataset was too large to fit in main memory. Furthermore, most of BIRCH's predecessors inspect all data points (or all currently existing clusters) equally for each clustering decision. They do not perform heuristic weighting based on the distance between these data points. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of additional IO (input/output) operations
Stages of BIRCH
BIRCH is often used to complement other clustering algorithms by creating a summary of the dataset that the other clustering algorithm can now use. However, BIRCH has one major drawback it can only process metric attributes. A metric attribute is an attribute whose values can be represented in Euclidean space, i.e., no categorical attributes should be present. The BIRCH clustering algorithm consists of two stages:
Due to this two-step process, BIRCH is also called?Two-Step Clustering.
Algorithm
The tree structure of the given data is built by the BIRCH algorithm called the Clustering feature tree (CF tree). This algorithm is based on the CF (clustering features) tree. In addition, this algorithm uses a tree-structured summary to create clusters.
In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those nodes that have several sub-clusters can be called CF subclusters. These CF subclusters are situated in no-terminal CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds necessary information of given data for further hierarchical clustering. This prevents the need to work with whole data given as input. The tree cluster of data points as CF is represented by three numbers (N, LS, SS).
?
?
There are mainly four phases which are followed by the algorithm of BIRCH.
Two of them (resize data and refining clusters) are optional in these four phases. They come in the process when more clarity is required. But scanning data is just like loading data into a model. After loading the data, the algorithm scans the whole data and fits them into the CF trees
n condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering, it sends CF trees for clustering using existing clustering algorithms. Finally, refining fixes the problem of CF trees where the same valued points are assigned to different leaf nodes.
Cluster Features
BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to represent a larger set of data points. These summary statistics constitute a CF and represent a sufficient substitute for the actual data for clustering purposes.
A CF is a set of three summary statistics representing a set of data points in a single cluster. These statistics are as follows:
CF Tree
The building process of the CF Tree can be summarized in the following steps, such as:
Step 1:?For each given record, BIRCH compares the location of that record with the location of each CF in the root node, using either the linear sum or the mean of the CF. BIRCH passes the incoming record to the root node CF closest to the incoming record.
Step 2:?The record then descends down to the non-leaf child nodes of the root node CF selected in step 1. BIRCH compares the location of the record with the location of each non-leaf CF. BIRCH passes the incoming record to the non-leaf node CF closest to the incoming record.
Step 3:?The record then descends down to the leaf child nodes of the non-leaf node CF selected in step 2. BIRCH compares the location of the record with the location of each leaf. BIRCH tentatively passes the incoming record to the leaf closest to the incoming record.
Step 4:?Perform one of the below points (i) or (ii):
Advantages of BIRCH
It is local in that each clustering decision is made without scanning all data points and existing clusters. It exploits the observation that the data space is not usually uniformly occupied, and not every data point is equally important.
It uses available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does not require the whole data set in advance.
?
CURE(Clustering Using Representatives)
·??????It is a hierarchical based clustering technique, that adopts a middle ground between the centroid based and?the all-point extremes. Hierarchical clustering is a type of clustering, that starts with a single point cluster, and moves to merge with another cluster, until the desired number of clusters are formed.
·??????It is used for identifying the spherical and non-spherical clusters.
·??????It is useful for discovering groups and identifying interesting distributions in the underlying data.
·??????Instead of using one point centroid, as in most of data mining algorithms, CURE uses a set of well-defined representative points, for efficiently handling the clusters and eliminating the outliers.
?Six steps in CURE algorithm:?
领英推荐
The Procedure of CURE algorithm
Various stages of the Cure algorithm are described as follows:
?§ Draw a random sample s: to handle large dataset, random sampling is done. It reduces the execution time and resolves the problem of short main memory. This filters the outliers which improve the quality of the cluster.
§ Partition the sample into p partitions: the size of each partition is s/p.
§ For each partition, partially clustering is done until the clusters in each partition are equal to s/pq.
§ Eliminate the outliers: The outliers form very small clusters, so they are easy to observe. So it is easy to identify and eliminate those clusters.
§ Cluster the partial clusters to form the required number of clusters.
§ Label data on disk: The remaining data points which were excluded from the random sample are assigned to the clusters having representative point closest to them. For handling the dataset of large size, Cure uses two techniques: sampling and partition. A random sampling of size s is used to overcome the problem of shortage of main memory. Instead of the whole dataset, a sample of the dataset of size n which can get fit into the memory is selected for partial clustering. Partition of the data sample is done for faster execution of the algorithm, where the sample space is partitioned into p partitions, where the size of each partition is s/p. The algorithm uses two data structure: k-d tree and heap. The representative points are stored in the k-d tree. The heap stores the data points, one from each cluster
EM Algorithm
The EM (Expectation-Maximization) algorithm is one of the most commonly used terms in machine learning to obtain maximum likelihood estimates of variables that are sometimes observable and sometimes not. However, it is also applicable to unobserved data or sometimes called latent. It has various real-world applications in statistics, including obtaining the?mode of the posterior marginal distribution of parameters in machine learning and data mining applications.
In most real-life applications of machine learning, it is found that several relevant learning features are available, but very few of them are observable, and the rest are unobservable. If the variables are observable, then it can predict the value using instances. On the other hand, the variables which are latent or directly not observable, for such variables Expectation-Maximization (EM) algorithm plays a vital role to predict the value with the condition that the general form of probability distribution governing those latent variables is known to us. In this topic, we will discuss a basic introduction to the EM algorithm, a flow chart of the EM algorithm, its applications, advantages, and disadvantages of EM algorithm, etc.
What is an EM algorithm?
The Expectation-Maximization (EM) algorithm is defined as the combination of various unsupervised machine learning algorithms, which is used to determine the?local maximum likelihood estimates (MLE)?or?maximum a posteriori estimates (MAP)?for unobservable variables in statistical models. Further, it is a technique to find maximum likelihood estimation when the latent variables are present. It is also referred to as the?latent variable model.
A latent variable model consists of both observable and unobservable variables where observable can be predicted while unobserved are inferred from the observed variable. These unobservable variables are known as latent variables.
Key Points:
EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such as the?k-means clustering algorithm. Being an iterative approach, it consists of two modes. In the first mode, we estimate the missing or latent variables. Hence it is referred to as the?Expectation/estimation step (E-step). Further, the other mode is used to optimize the parameters of the models so that it can explain the data more clearly. The second mode is known as the?maximization-step or M-step.
The primary goal of the EM algorithm is to use the available observed data of the dataset to estimate the missing data of the latent variables and then use that data to update the values of the parameters in the M-step.
What is Convergence in the EM algorithm?
Convergence is defined as the specific situation in probability based on intuition, e.g., if there are two random variables that have very less difference in their probability, then they are known as converged. In other words, whenever the values of given variables are matched with each other, it is called convergence.
Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step, Expectation Step, Maximization Step, and convergence Step. These steps are explained as follows:
?
Advantages of EM algorithm
Disadvantages of EM algorithm
Parameter Estimation
We have learned many different distributions for random variables and all of those distributions had parameters: the numbers that you provide as input when you define a random variable. So far when we were working with random variables, we either were explicitly told the values of the parameters, or, we could divine the values by understanding the process that was generating the random variables.
What if we don't know the values of the parameters and we can't estimate them from our own expert knowledge? What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data.
These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this: (1) specify a probabilistic model that has parameters. (2) Learn the value of those parameters from data.
Before we dive into parameter estimation, first let's revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution. In the case of a Bernoulli random variable, the single parameter was the value p. In the case of a Uniform random variable, the parameters are the?a?and?b?values that define the min and max value. Here is a list of random variables and the corresponding parameters. From now on, we are going to use the notation?θ?to be a vector of all the parameters:
?
Distribution
Parameters
Bernoulli(p)??????????????????????????????????????????????????????????????????????????????????????????
θ=p
Poisson(λ)
θ=λ
Uniform(a,b)
θ=[a,b]
?
?
In the real world often you don't know the "true" parameters, but you get to observe data. Next up, we will explore how we can use data to estimate the model parameters.
It turns out there isn't just one way to estimate the value of parameters. There are two main schools of thought: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP). Both of these schools of thought assume that your data are independent and identically distributed (IID) samples:?X1,X2,…Xn.
?
Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is a probabilistic based approach to determine values for the parameters of the model. MLE is a widely used technique in machine learning, time series, panel data and discrete data.
Maximum Likelihood Estimation?(MLE) is a probabilistic based approach to determine values for the parameters of the model. Parameters could be defined as blueprints for the model because based on that the algorithm works. MLE is a widely used technique in machine learning,?time series, panel data and discrete data. The motive of MLE is to maximize the likelihood of values for the parameter to get the desired outcomes.
What is the likelihood?
The likelihood function measures the extent to which the data provide support for different values of the parameter. It indicates how likely it is that a particular population will produce a sample. For example, if we compare the likelihood function at two-parameter points and find that for the first parameter the likelihood is greater than the other it could be interpreted as the first parameter being a more plausible value for the learner than the second parameter. More likely it could be said that it uses a hypothesis for concluding the result. Both frequentist and Bayesian analyses consider the likelihood function. The likelihood function is different from the probability density function.
Difference between likelihood and probability density function
Likelihood?describes how to find the best distribution of the data for some feature or some situation in the data given a certain value of some feature or situation, while probability describes how to find the chance of something given a sample distribution of data. Let’s understand the difference between the likelihood and probability density function with the help of an example.
Consider a dataset containing the weight of the customers. Let’s say the mean of the data is 70 & the standard deviation is 2.5.
When?Probability?has to be calculated for any situation using this dataset, then the mean and standard deviation of the dataset will be constant. Let’s say the probability of weight > 70 kg has to be calculated for a random record in the dataset, then the equation will contain weight, mean and standard deviation. Considering the same dataset, now if we need to calculate the probability of weight > 100 kg, then only the height part of the equation be changed and the rest would be unchanged.
But in the case of Likelihood, the equation of the conditional probability flips as compared to the equation in the probability calculation i.e mean and standard deviation of the dataset will be varied to get the maximum likelihood for weight > 70 kg.
Working of Maximum Likelihood Estimation
The maximization of the likelihood estimation is the main objective of the MLE. Let’s understand this with an example. Consider there is a binary classification problem in which we need to classify the data into two categories either 0 or 1 based on a feature called “salary”.?
So MLE will calculate the possibility for each data point in salary and then by using that possibility, it will calculate the likelihood of those data points to classify them as either?0 or 1. It will repeat this process of likelihood until the learner line is best fitted. This process is known as the maximization of likelihood.
The above explains the scenario, as we can see there is a threshold of 0.5 so if the possibility comes out to be greater than that it is labelled as 1 otherwise 0. Let’s see how MLE could be used for classification.
Maximum likelihood estimation in machine learning
MLE is the base of a lot of supervised learning models, one of which is?Logistic regression. Logistic regression maximum likelihood technique to classify the data. Let’s see how Logistic regression uses MLE. Specific MLE procedures have the advantage that they can exploit the properties of the estimation problem to deliver better efficiency and numerical stability. These methods can often calculate explicit confidence intervals. The parameter “solver” of the logistic regression is used for selecting different solving strategies for classification for better MLE formulation.
The maximum likelihood approach provides a persistent approach to parameter estimation as well as provides mathematical and optimizable properties. With a hands-on implementation of this concept in this article, we could understand how Maximum Likelihood Estimation works and how it is used as a backbone of logistic regression for classification.
Important Points
“The?goal of?the maximum likelihood principle is to fit an optimal statistical distribution to some data.?This makes the data easier to work with, makes it more general, allows us to see if new data follows the same distribution as the previous data, and lastly,?it allows us to classify unlabelled data points.”
Maximum a posteriori estimation
In?Bayesian statistics, a?maximum a posteriori probability?(MAP)?estimate?is an estimate of an unknown quantity, that equals the?mode?of the?posterior distribution. The MAP can be used to obtain a?point estimate?of an unobserved quantity on the basis of empirical data. It is closely related to the method of?maximum likelihood?(ML) estimation, but employs an augmented?optimization objective?which incorporates a?prior distribution?(that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as a?regularization?of maximum likelihood estimation