Data Mining with Cluster Analysis

Data Mining with Cluster Analysis

The Cluster analysis is technique of statistical analysis and one of the method of data mining that consist of dividing the unlabeled data or data point into different clusters by identify patterns and relationships within the datasets. Some of the areas that rely on cluster analysis to get insight of data are Customer Segmentation, Credit Risk Management, Fraud Detection, Spam filtering, Intrusion Detection, and others.

Methodologies

Data are grouped by subset in order that each subset share common pattern, in generally, some measure defined as distance. This can mean, if you have ?objects and each one measured by each m variable, group the objects in c classes.

This technique can be applied in 2 problems:

·?????? Data partition: there is a set of data, and it has been desired to group in subset having homogeneity within subset and heterogeneity between subsets.

·?????? Classification of variables: groups variables to reduce their size.

?We start by n x m matrix (n number of objects and m number of variables), or n x n that correspond to the distance of objects.

Let Ξ be sample of ?objects, ?where, our goal is to generate “c” or regions so that we have clusters ?in order to have:

And the matrix provides the values of each variable for each object

The jth of the matrix Χ contains the values of each variable for the jth individual, while the ith column shows the values belonging to the ith variable across the entire individual in the sample. The classification for both type of grouping are carried out in above matrix while those that apply to variables are applied to the transposed matrix. These techniques are highly necessary as an exhaustive search of all partitions is often expensive and are given by a second-class Stirling number. If we want to classify n observations into c groups, the Stirling number of the second type is:

If the number of clusters are known beforehand for n observations, the possible number of clusters will be:

Association Measures

As cluster analysis group data based on distances and similarity, we need to define and set conditions to use in order to group objects:

The greater the distance, the less similar the individuals are. The higher the proximity value, the more similar the individuals are.

Gower shows that it is possible to go from distance to similarity as long as the matrix of similarity coefficients, , is semi-defined positive

Quantitative Variables

The most common type of multivariate categorical data is where variables indicate the presence or absence of a characteristic condition (generally represented by 1 and 0, respectively). Many similarity measures have been proposed for such data. So, let’s have possible conditions:

a represents the n times (variables) that individuals i and j simultaneously assume the value 1.

b represents the n times that individual i assumes the value 1 and individual j assumes the value 0.

c represents the n times that individual i assumes the value 0 and individual j assumes the value 1.

d represents the n times that individual i and j simultaneously assume the value 0

Similarity Measures Formula

Case of Mixed Variables

Gower (1971) proposed that instead of having measures for each type and later combine, let have distance as

Type of Clusters

There are hierarchical and non-hierarchical methods.

hierarchical Method

The objective of these methods is to group clusters to form a new one or separate an existing one to give rise to two others, so that the distance function is minimized, or some similarity measure is maximized. Hierarchical methods can be divided into agglomerative and divisive

Agglomerative Hierarchical clustering: It starts at individual leaves and successfully merges clusters together. It’s a Bottom-up approach. Divisive Hierarchical clustering: It starts at the root and recursively split the clusters. It’s a top-down approach.

In Agglomerative Hierarchical clustering: in the first step, we have as many clusters as the number of points and we gradually merge more and more clusters until all points are in one clusters. In each step the cluster that are close together are always merged.

To know how closest points are, we need to calculate distance between two points using one of these methods: Euclidean Distance, Minkowski Distance, Manhattan or City Block Distance, Chebychev or Maximum Distance and Mahalanobis Distance.

We also need to link the points using one of the follow methods

Other methods are: Centroid-based methods, Ward's method, Divisive or dissociative methods

Non-hierarchical methods

In this method, observations are iteratively assigned in cluster until balance is reached. In many cases, the number of clusters is known beforehand. The difference from Non-hierarchical methods are: Individuals can be changed between clusters, so an individual misclassified in a previous cluster can be reclassified in a later stage; If the data structure is known in advance, there will be no problem in choosing the cluster number. If it doesn't, you can try different partitions or first use a hierarchical technique to help decide the number of clusters

Type of non-hierarchical methods:

? Relocation or partition methods:

a) K-Means

b) K-Medoids

? Density methods:

a) Based on topological density: Mean Shift and DBSCAN

b) Based on probability density: EM-GMM

Cluster evaluation statistics

It is important to evaluate the quality of results from different algorithm. Is good also to know the optimum number of Clusters. Internal validation measures generally reflect compactness (how close objects are within the same cluster) and separation of group partitions (how well separated a cluster is from other clusters).

Generally, most indexes used for internal cluster validation combine compactness (C) and separation (S) measures, such as the following

Below, three of the most used validation coefficients:

?Silhouette coefficient by Rousseeuw (1987)

The maximum index value is used to determine the optimal number of groups in the data for hierarchical methods.

Observations with a large value of S(i) reproduce good clustering, a small S(i) (close to 0) indicate that the observations are between two groups, and observations with a negative value of S(i) are probably still in the wrong group.

Gap Statistic Coefficient: was published by R. Tibshirani, G. Walther, and T. Hastie (Stanford University, 2001). The basic idea is to compare the change in within-cluster dispersion with that expected under a reference distribution for an increasing number of clusters

Dunn Index: (Dunn 1974) defines the relationships between the inter-cluster minimum and the intra-cluster maximum distance

?Calinski-Harabasz (CH) Index

Agglomerative Coefficient: is a global clustering measure associated with hierarchical agglomerative methods. In other words, it does not compare the adequacy of the number of clusters chosen, but it allows comparing dendrogram structures, or what is the same, dendrograms produced by different hierarchical methods

.If for each j we define dj as the quotient between the dissimilarity with the first cluster to which it was joined and the dissimilarity in the last union of the algorithm. The closer to 1, the better structure the dendrogram obtained will have. It is also the average width of the flag chart (or filled proportion of this chart). The Agglomerative Coefficient grows with the number of observations, so it should not be used to compare datasets with very different sample sizes.


要查看或添加评论,请登录

José Jaime Comé的更多文章

  • Machine Learning: Predicting outcomes using Binary Logistic Regression

    Machine Learning: Predicting outcomes using Binary Logistic Regression

    Logistic regression is a statistical model that is used for binary classification by linear combination of data of one…

  • Prediction Model using Autoregressive Integrated Moving Average (ARIMA)

    Prediction Model using Autoregressive Integrated Moving Average (ARIMA)

    An autoregressive integrated moving average (ARIMA) is a statistical analysis model that predict values based on…

  • Comparing means of different groups (Analysis of Variance)

    Comparing means of different groups (Analysis of Variance)

    Analysis of Variance (ANOVA) is collection of statistical tests used to analyze the difference between means of more…

    2 条评论
  • Linear Discriminant Analysis

    Linear Discriminant Analysis

    Linear discriminant analysis (LDA) group data into categories, as such, this technique is used for dimensionality…

    1 条评论
  • Factor Analysis

    Factor Analysis

    Factor analysis is a statistical method used to describe variability among large number of observed, correlated…

    1 条评论
  • Principal Component Analysis (PCA)

    Principal Component Analysis (PCA)

    The number of features or dimensions in a dataset can lead to issues such as overfitting, increasing computation…

    1 条评论
  • Data Governance

    Data Governance

    While Data management is part of the overall management of data. Data governance in short is just documentation…

社区洞察

其他会员也浏览了