Why, How and When to apply Feature Selection
Feature Selection is a very critical component in a Data Scientist’s workflow. When presented data with very high dimensionality, models usually choke because
- Training time increases exponentially with number of features.
- Models have increasing risk of overfitting with increasing number of features.
Feature Selection methods helps with these problems by reducing the dimensions without much loss of the total information. It also helps to make sense of the features with enhances the future data collection ability and strategy.
In this article, I discuss following feature selection techniques and their traits.
- Filter Methods
- Wrapper Methods and
- Embedded Methods.
Filter Methods
Filter Methods considers the relationship between features and the target variable to compute the importance of features.
F Test
F Test is a statistical test used to compare between models and check if the difference is significant between the model.
F-Test does a hypothesis testing model X and Y where X is a model created by just a constant and Y is the model created by a constant and a feature.
The least square errors in both the models are compared and checks if the difference in errors between model X and Y are significant or introduced by chance.
F-Test is useful in feature selection as we get to know the significance of each feature in improving the model.
Scikit learn provides the Selecting K best features using F-Test.
sklearn.feature_selection.f_regression
For Classification tasks
sklearn.feature_selection.f_classif
There are some drawbacks of using F-Test to select your features. F-Test checks for and only captures linear relationships between features and labels. A highly correlated feature is given higher score and less correlated features are given lower score.
- Correlation is highly deceptive as it doesn’t capture strong non-linear relationships.
2. Using summary statistics like correlation may be a bad idea, as illustrated by Anscombe’s quartet.
Mutual Info
Mutual Information between two variables measures the dependence of one variable to another. If X and Y are two variables, and
- If X and Y are independent, then no information about Y can be obtained by knowing X or vice versa. Hence their mutual information is 0.
- If X is a deterministic function of Y, then we can determine X from Y and Y from X with mutual information 1.
- When we have Y = f(X,Z,M,N), 0 < mutual information < 1
We can select our features from feature space by ranking their mutual information with the target variable.
Advantage of using mutual information over F-Test is, it does well with the non-linear relationship between feature and target variable.
Sklearn offers feature selection with Mutual Information for regression and classification tasks.
sklearn.feature_selection.mututal_info_regression sklearn.feature_selection.mututal_info_classif
Variance Threshold
This method removes features with variation below a certain cutoff.
The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
sklearn.feature_selection.VarianceThreshold
Variance Threshold doesn’t consider the relationship of features with the target variable.
Wrapper Methods
Wrapper Methods generate models with a subsets of feature and gauge their model performances.
Forward Search
This method allows you to search for the best feature w.r.t model performance and add them to your feature subset one after the other.
For data with n features,
->On first round ‘n’ models are created with individual feature and the best predictive feature is selected.
->On second round, ‘n-1’ models are created with each feature and the previously selected feature.
->This is repeated till a best subset of ‘m’ features are selected.
Recursive Feature Elimination
As the name suggests, this method eliminates worst performing features on a particular model one after the other until the best subset of features are known.
For data with n features,
->On first round ‘n-1’ models are created with combination of all features except one. The least performing feature is removed
-> On second round ‘n-2’ model is created by removing another feature.
Wrapper Methods promises you a best set of features with a extensive greedy search.
But the main drawbacks of wrapper methods is the sheer amount of models that needs to be trained. It is computationally very expensive and is infeasible with large number of features.
Embedded Methods
Feature selection can also be acheived by the insights provided by some Machine Learning models.
LASSO Linear Regression can be used for feature selection. Lasso Regression is performed by adding an extra term to the cost function of Linear Regression.
This apart from preventing overfitting also reduces the coefficients of less important features to zero.
Tree based models calculates feature importance for they need to keep the best performing features as close to the root of the tree. Constructing a decision tree involves calculating the best predictive feature.
The feature importance in tree based models are calculated based on Gini Index, Entropy or Chi-Square value.
Feature Selection as most things in Data Science is highly context and data dependent and there is no one stop solution for Feature Selection. The best way to go forward is to understand the mechanism of each methods and use when required.
I mainly use feature selection techniques to get insights about the features and their relative importance with the target variable. Please comment below on which feature selection technique do you use.