Dimensionality Reduction
Curse Of Dimensionality
It refers to a phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.
Dimensionality Reduction
In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables.
Why Dimensionality Reduction is important ?
Data comes in formats like video, audio, images, texts etc., with huge number of features. Are all features are relevant to gain insight from ? NO, not all features are important or relevant. Based on business requirement or the redundancy in the nature of data captured, we have to reduce the feature size through Feature selection and Feature Extraction. These techniques not only reduce computation cost but it also helps in avoiding the misclassification because of highly correlated variable.
How to overcome Curse of Dimensionality ?
There are number of ways of Dimensionality reduction such as feature selection and Feature Extraction.
- Principal Component Analysis
- Random Projection
- Independent Component Analysis
- Missing Value Ratio
- Low Variance Filter
- Backward Feature Elimination
- Forward Feature Construction
- High Correlation Filter
Let’s look at the image shown above. It shows 2 dimensions X1 and X2, which are let us say measurements of a vehicle in KM (X1) and Miles (X2). Now, if you were to use both these dimensions in machine learning, they will convey similar information and introduce a lot of noise in system, so you are better off with just using one dimension. Here we have converted the dimension of data from 2D (from X1 and X2) to 1D (PC1), which has made the data relatively easier to explain.
Principal Component Analysis
Principal Components Analysis means components which are able to explain the maximum amount of variance of the features with respect to target variable, if we include all feature as components then we get the variance of 1.
PCA transforms all the interrelated variable into uncorrelated variable. Each uncorrelated variable is a Principal Component and each components is a linear combination of original variable.
Each uncorrelated variable or components holds feature information which is explained as variance. Each component with its variance add up to 1. Since each principal component is combination of original variable, some principal components explains more variance than others.
The variance explained by one principal component is uncorrelated with other principal components which means with each component we are learning or explaining a new feature. Now raises a question, how many components will be able to explain the maximum variance?. We don’t have any text book method for calculating the number of components for a given number of feature or variables.But We can maintain a variance threshold which needs to explained by the variance of the components.
Consider we have set a threshold variance of 0.8, and if have eight components with a variance as follows 0.3, 0.25, 0.15, 0.1, 0.08, 0.08, 0.07, 0.07. then we can notice 0.3 is a component with maximum variance and is called as First Principal Component. Now since the threshold is kept at 0.8, we can add up components untill it reaches a variance of 0.8.
By adding first 3 components, we have variance explained at 0.7 and by including 4th component we reach a variance of 0.8. So we can including 4 components instead of eight components, thus reducing the dimension from 8 to 4.
Random Projection
The technique is similar to PCA but the number of components is selected automatically while retaining the maximum information if not mentioned. Random Projection is a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing times and smaller model sizes.
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset. Thus random projection is a suitable approximation technique for distance based method. The projection in PCA happens to capture the spread of variance, while in Random Projection the projection capture the distance between one vs all points and efficiently reducing the dimension.
Independent Component Analysis
ICA is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. ICA defines a generative model for the observed multivariate data, which is typically given as a large database of samples.
It is widely applied on mixed sounds for removing or segregating overlapping sound or noise.
Missing Value Ratio
In a Dataset, We have various columns and each column contains values but if data columns contains missing values then we have think about the feature selection based on Missing value ratio i.e. we can set a threshold for number of Missing value a column may contain and if we have ratio of Missing value greater than the threshold then we can drop the feature.
Higher the threshold, more aggressive the drop in features.
Low Variance Filter
It is similar to PCA Conceptually i.e. if a column carries very little information or has variance lower than a threshold value then we can drop feature i.e. variance value acts as Filter for Feature Selection.
Variance is range dependent, so normalization is required before applying this technique.
Backward Feature Elimination
In Simple terms, If a model is trained on n-input feature and error rate is calculated, then again if model is trained on n-1 feature and error rate is calculated, now if error rate is increased by small value then the feature is dropped from the dataset.
Backward feature Elimination can be performed iteratively to get better feature.
Forward Feature Construction
In this Feature Selection process, we train a model with one feature and calculate the performance measure. We keeping adding feature, one by one and calculate the performance if the performance decreases with increase in Feature, we should drop the feature and if the performance increases with increase in Feature, We iteratively add feature to the model.
High Correlation Filter
Here, If the columns present in the dataset are high correlated then the information becomes redundant and we drop these highly redundant variables from features.
We can calculate the ‘correlation coefficient’ between Numerical columns / variables.We can calculate the ‘correlation coefficient’ between Nominal columns / variables.
We can use the ‘Pearson product moment coefficient’ between Numerical columns / variables.We can use the ‘Pearson Chi squared’ value between Nominal columns / variables.
Before doing correlation operation, Perform normalization on the columns as correlation is scale sensitive.
Note
Both Forward Feature Construction and Backward Feature Elimination are computationally expensive tasks.