Feature Selection and Dimensionality Reduction

Feature Selection and Dimensionality Reduction

1. Introduction

?This article is a consolidation of my 7 posts on LinkedIn involving the topic of: Feature Selection and Dimensionality Reduction.

Feature Selection vs Dimensionality Reduction:

Datasets are often high dimensional, containing a large number of features, although the relevancy of each feature for?analysing this data is not always clear.

One shouldn't just throw everything at your machine learning model and rely on your training process to determine which features are actually useful – I have discussed this in my posts before. Thus, it is imperative to carry out feature selection and | or dimensionality reduction to reduce the number of features in a dataset. Whilst both ‘feature selection’ and ‘dimensionality reduction’ are used for reducing the number of features in a dataset, there is an important difference;

  • Feature selection is simply selecting and excluding given features WITHOUT changing them
  • Whereas Dimensionally Reduction transforms the features into a lower dimension

?Feature selection identifies the features that best represent the relationship amongst all in the feature space as well as the target that the model will try to predict. Feature selection methods remove the features that do not influence the outcome. This reduces the size of the feature space, hence reducing the resource requirements for processing the data and model complexity too.

2.???Feature Selection Methods:

Feature selection methods could be classified into: Unsupervised and Supervised Feature Selection. In the Unsupervised Feature Selection, the target variable relationship is not considered. Here, one determines the correlation between features. That is: if you have (say) 2 features and they are highly correlated, then, you obviously do not need both of these features.

?For supervised feature selection, it is going to look at the target relationship – thus the relationship between each of the features and the target (or) the label is going to be used in the feature selection. The methods that fall under the supervised feature selection include Filter Methods, Wrapper Methods and Embedded Methods.?

?Filter Methods:

?As stated above, the Filter Methods look at the corelated features and selects the best subset that you can give to your machine learning model. Popular filter methods include Pearson’s correlation – this is the correlation between the features and between the features and the target label. Thus, in Filer Methods we’re going to start with all of the features and we’re going to select the best subset that we will give to the Machine Learning Model and that’s going to give us the performance of the model with this subset of the features.

?We will get a Correlation matrix which will tell us how the features are related to each other and the target variable. The Correlation will fall in the range of -1 and +1 where +1 is a Highly positive correlation and -1 is highly negative correlation. Some of the correlations often used include:

  • Pearson’s correlation
  • Kendall Tau Rank Correlation
  • Spearman’s Rank Correlation

No alt text provided for this image

Figure: Calculating the Error in PCA

3.???Univariate Feature Selection in SKLearn:


The classes in the?sklearn.feature_selection?module can be used for feature selection / dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets. SKLearn has the following routines for feature selection:


  • ???????SelectKBest
  • ??????SelectPercentil
  • ???????GeenricUnivariateSelect
  • SelectKBest?removes all but the?k?highest scoring features
  • SelectPercentile?removes all but a user-specified highest scoring percentage of features

?4.???Wrapper Methods

Popular Wrapper methods for selecting the sub-set that is spoken of above include.

  • Forward Selection
  • Backward Selection
  • Recursive Feature Elimination

How do the above Methods work?

?Forward Selection:

Forward selection is an iterative “greedy” method. This is because we select one feature at a time, pass it to the machine learning model and evaluate the performance. We repeat the process increasing the features in every iteration until no improvement in performance of the model is seen. At this point, we know we have generated the best subset of all the features. Thus, it is termed as: Forward Elimination.

?Backward Selection:

Backward elimination – it’s just the reverse of forward selection. As one might guess it, in backward elimination we start with “all of the features” and we evaluate the model performance when **removing** each feature at a time. We tend to get better performance when removing one feature at a time and continue until there is no further improvement.


Recursive feature elimination:

In Recursive feature elimination, we use a model to evaluate feature importance. Random Forest Classifier is one of the model types wherein we can evaluate the feature importance.?Firstly, we select the desired number of features and fit the model. The model ranks the features by importance and then we discard the least important features. We repeat until the desired number of features remain. Recursive Feature Selection often turns out to be best performing amongst all.

No alt text provided for this image

5.???Embedded Methods for Feature Selection

Embedded Methods are again a supervised method for feature selection.?The method assigns score and discards features scored lower by feature importance. Looking at in SKLearn the feature importance class is in-built in Tree based models (e.g., RandomForestClassifier). Feature importance is available as a property in feature_importances_. We can then use SelectFromModel to select features from the trained model based on assigned feature importance’s.

?Extracting feature importance’s:

Once Sklearn has been imported, the data cleaned, and the model instantiated as well as for on the training data, model.feature_importances_is what you need;

The methods for extracting from linear and logistic regression is a bit different.

No alt text provided for this image

Figure: Measuring Feature Importance


6.???Dimensionality Reduction

As mentioned, / described in my posts earlier, datasets are often high dimensional, containing a large number of features, although all of the features are many times not relevant to the problem. It might thus become imperative to carry out feature elimination or dimensionality reduction of the data set. I have discussed some of the techniques/methods for feature elimination, such as: Filter Methods (like Pearson’s correlation), Univariate Feature Elimination, Wrapper methods and Embedded Methods from Part 1 through Part 3. Let’s now look at Dimensionality Reduction.

In contrast to the above methods for feature elimination (which just “knock off” features of less importance) WITHOUT causing any change in the features; dimensionality reduction “TRANSFORMS” the features into a lower dimension.

Motivation of Dimensionality Reduction:

The motivation behind dimensionality reduction is twofold:

1)??Firstly, as you transform the problem into a lower dimensional space, you reduce the total data that is stored in the computer memory as well as you speed up the learning algorithm as you’re reducing the feature space thus solving for a smaller number of parameters (weights) in your model.

2)??Second motivation being: Data Visualization – it is not easy to visualize data on more than three dimensions – we can reduce the data to 3 or less dimensions in order to plot it, we find the new features: z1, z2 (and perhaps z3) that can effectively “summarize” all of the other features.

?Intuition of Dimensionality Reduction:

Let us say that we collected a dataset with several features – below we plot just two of the features. It is clear from the figure (Figure 1 below) that instead of having 2 features we can actually have just one feature along the dimension z1.

Similarly, we can reduce data from 3 dimensional to 2D (see Figure 2 below) – **it should be underscored that in a typical example we may have a 1000-dimensional problem which we’re reducing to fewer dimensions (Say) to 100 dimensions, however we cannot visualize such problems thus the intuition is best felt using such examples** – here (in Figure 2) we can project the data from 3 dimensions to a 2-dimensional plane as highlighted (along z1 and z2 directions). THUS, we have transformed the problem from coordinate system x1, x2, x3 to z1, z2 coordinate system.

Thus, with the above simple examples, dimensionality reduction helps in reducing the computer resources and importantly helps in optimizing the data pipelines.

No alt text provided for this image

Figure: An intuitive feel of Dimensionality Reduction

7.???Data Visualization:

Reducing the computer resources, speeding up the training process and optimizing data pipelines is one form of motivation for carrying out Dimensionality Reduction. Another motivation to carry out Dimensionality Reduction is Data Visualisation.

Considering the application of dimensionality reduction to the use case involving Data visualisation; it is not easy to visualize data that is more than three dimensions. We can reduce the dimensions of our data to 3 or less in order to plot it.

We need to find new features: z1, z2 (and, perhaps z3) that can effectively summarize all the other features. It is not directly interpretable what the new transformed features will denote – it is left up to the Machine Learning Engineer to deduce that considering the problem physics.

[One common use case I have come across in area of Fracture Mechanics wherein I have a component modelled in a finite element software in 3 dimensions, but Fracture Mechanics analytical solutions are available for 2 dimensional plates. Dimensionality Reduction using an algorithm as PCA (see below) could well be suited in a such a use case for a crack propagation analysis centred around area of interest]

Considering another common example here outside the area of Structural Mechanics to aid data visualization:

Hundreds of features related to a country's economic system may all be combined into one feature that you call "Economic Activity."

Using dimensionality reduction, we can summarize the above into 2 features as below and as mentioned it is left up to the Engineer to physically interpret the features. See Figures 3,4,5 below

Dimensionality Reduction: Principal Component Analysis

The most popular dimensionality reduction algorithm is Principal Component Analysis (PCA).

As described in Part 4 of this series of posts, given two features, x1 and x2 we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature. The same can be done with three features, where we map them to a plane.

The goal of PCA is to reduce the average of all the distances of every feature to the projection line. This is the projection error and is illustrated in Figure 6

For a generic case we do as follows:

Reduce from n-dimension to k-dimension:?so, find k vectors u(1), u(2),… u(k) onto which to project the data so as to minimize the projection error.

No alt text provided for this image

Figure: An example of Dimensionality Reduction to aid Data Visualization

8. Mathematical interpretation of Principal Component Analysis and the Principal Component Analysis Algorithm

The above sections covered more on the physical interpretation of the Principal Component Analysis (PCA) method for dimensionality reduction. Let us get into a bit of Linear Algebra and understand some of the mathematical background of PCA!

Principal component analysis (PCA) is a standard tool in modern data analysis used in diverse fields from neuroscience to computer graphics and engineering in general - because it is one of the simplest (hence, coolest!) methods for extracting relevant information from "confusing" data sets.

With minimal effort (3/4 lines of Python/Matlab code) PCA provides a roadmap for how to reduce a complex data set to a lower dimension to reveal many times hidden, simplified structures that often underlie it.

?The goal of principal component analysis is to identify the most meaningful basis to re-express a?data set. The hope is that this new basis will filter out the noise and reveal hidden structure - thus resulting in dimensionality reduction.

PCA Framework: Change of basis

Thus, the question that PCA precisely asks is: Is there another basis, which is a linear combination of the original basis, that best re-expresses our data ? Thus, PCA aims to re-express the original data X as a linear combination of the basis P. That is,

PX = Y

What is the best choice of P in PCA?

In the above equation, P is the stretch and the rotation one would give to the vectors along the columns of X (considering that we have organized all the features along the columns of X and the numbers of rows of X constitute the number of training examples - in the data visualization example, the number of rows corresponding to the number of countries considered in the example) so that they get aligned to a new basis or the principal directions.

We would want that the new basis be such that where we have the spread/variance of the data (with respect to the mean) as much as possible. This enables us to remove those dimensions where the data is almost flat. This decreases the dimension of the data whilst keeping the variance (or spread) among the data as close as possible to the original data.


Algorithm to implement PCA: See image / figure below

Mean normalization in PCA:

It is important before implementing the algorithm for PCA, we carry out a mean normalization. This is important in PCA since PCA is variance maximizing exercise as described above. It projects your original data onto directions which maximize the variance.

No alt text provided for this image

Figure: Coding the Principal Component Analysis algorithm


9. How to choose the number of principal components - number of reduced dimensions?

This is the final post in series of discussions on: “Feature Selection and Dimensionality Reduction” and the emphasis here in calculating the error in PCA.

PCA helps in reducing the solution time because by transforming the features into a lower dimension one solves for a lesser number of parameters thus accelerating the learning process. But how do we evaluate that the error in PCA is small so that no important information is lost whilst projecting the data thus reducing the dimensions?

I had mentioned in the post 4, that, there will surely be a projection error when one “transforms” the features into lower dimensions – again shown below in Figure 1. It should be noted that the visualization is straight forward here because you’re simply transforming the problem from 2 dimensions to a single dimension, but this will surely ‘NOT’ be the case for a real business problem!

To calculate the error in PCA one divides the projection error by the total variance as shown below and then limit the ratio to < 0.01. That is,

  • Calculate the average square projection error – we take the square so that positive and negative numbers do not cancel out and we also put a high ‘penalty’ on large errors by squaring.
  • Calculate the total variation in the data
  • Choose ‘k’ (the number of dimensions) so that the ratio of the average squared projection error to the variation < = 0.01

Common misuse of PCA:

  • PCA to avoid over-fitting – One should not attempt PCA to avoid overfitting - in order to tackle over-fitting, go for regularization but ‘NOT’ PCA!

  • Do not ‘pre-plan’ PCA before attempting to solve the problem with the intended features! Solve your problem in logical ‘steps’ and do not tend to ‘oversimplify’ things before getting the bigger picture


Ajay Taneja的更多文章
