Feature Selection for faster analytics
Nishank Arora
Project Manager - Capgemini | Ex-Svam | Microsoft Certified Data Analyst Associate
Feature selection is an important aspect when analyzing datasets with large number of features.
It is one of the most important concepts in Machine learning which is a process of selecting relevant features/ attributes (such as a column in tabular data) that are most relevant for the modelling and business objective of the problem and ignoring the irrelevant features from the data set.
Selecting relevant features and ignoring the others becomes critically important when dealing huge datasets as this impacts the time for analysing data.
Benefits of Feature Engineering on to your Dataset:
- Reduce Overfitting
- Improves Accuracy
- Reduce Training Time
Since we now how important feature selection in data science, let’s just start applying various feature selection techniques:
Method 1: Remove features which standard deviation as zero (constants).
import pandas as pd
import numpy as np
data = pd.read_csv('./dataset.csv')
print("Original data shape- ",data.shape)
# Remove Features with zero standard deviation
constant_features = [feat for feat in data.columns if data[feat].std() == 0]
data.drop(labels=constant_features, axis=1, inplace=True)
print("Reduced feature dataset shape-",data.shape)
Method 2: Calculate the no of features which has low variance and remove them.
from sklearn.feature_selection import VarianceThreshold
sel= VarianceThreshold(threshold=0.18)
sel.fit(df)
mask = sel.get_support()
reduced_df = df.loc[:, mask]
print("Original data shape- ",df.shape)
print("Reduced feature dataset shape-",reduced_df.shape)
print("Dimensionality reduced from {} to {}.".format(df.shape[1], reduced_df.shape[1]))
Method 3: Highly correlated features should be removed
Features could be removed using Threshold value ie remove those features which has a correlation coefficient >0.8
import seaborn as sns
import numpy as np
corr=df_iter.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
# Add the mask to the heatmap
sns.heatmap(corr, mask=mask, center=0, linewidths=1, annot=True, fmt=".2f")
plt.show()
corr_matrix = df_iter.corr().abs()
# Create a True/False mask and apply it
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
tri_df = corr_matrix.mask(mask)
# List column names of highly correlated features (r >0.5 )
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.5)]
# Drop the features in the to_drop list
reduced_df = df_iter.drop(to_drop, axis=1)
print("The reduced_df dataframe has {} columns".format(reduced_df.shape[1]
Method 4: Find out the coefficients with respect to features using logistic regression. Remove those features which have low lr_coef.
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#calculating the coeff with respect to columns
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Perform a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.25, random_state=0)
# Create the logistic regression model and fit it to the data
lr = LogisticRegression()
lr.fit(X_train, y_train)
# Calculate the accuracy on the test set
acc = accuracy_score(y_test, lr.predict(X_test))
print("{0:.1%} accuracy on test set.".format(acc))
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))
Method 5: Calculating the feature importance using XGBoost .
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.
import xgboost as xgb
housing_dmatrix = xgb.DMatrix(X,y)
# Create the parameter dictionary: params
params = {"objective":"reg:linear","max_depth":"4"}
# Train the model: xg_reg
xg_reg = xgb.train(dtrain=housing_dmatrix,params=params,num_boost_round=10)
# Plot the feature importances
xgb.plot_importance(xg_reg)
Method 6: Feature Importance using Extra tree classifier.
Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module) can be used to compute feature importances, which in turn can be used to discard irrelevant features
X = df.iloc[:,0:370] #independent columns
y = df.iloc[:,-1] #target column
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_)
#use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()
Method 7: Recursive Feature Elimination (RFE)
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
from sklearn.feature_selection import RFE
rfe = RFE(estimator=RandomForestClassifier(random_state=0),n_features_to_select=3,step=2,verbose=1)
rfe.fit(X_train,y_train)
mask=rfe.support_
X_new=X.loc[:,mask]
print(X_new.columns)
Method 8: Univariate Feature Selection (ANOVA)
This works by selecting the best features based on the univariate statistical tests (ANOVA). The methods based on F-test estimate the degree of linear dependency between the two random variables. They assume a linear relationship between the feature and the target. These methods also assume that the variables follow a Gaussian Distribution.
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_classif, f_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile
df= pd.read_csv('./train.csv')
X = df.drop(['ID','TARGET'], axis=1)
y = df['TARGET']
df.head()
# Calculate Univariate Statistical measure between each variable and target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
univariate = f_classif(X_train.fillna(0), y_train)
# Capture P values in a series
univariate = pd.Series(univariate[1])
univariate.index = X_train.columns
univariate.sort_values(ascending=False, inplace=True)
# Plot the P values
univariate.sort_values(ascending=False).plot.bar(figsize=(20,8))
# Select K best Features
k_best_features = SelectKBest(f_classif, k=10).fit(X_train.fillna(0), y_train)
X_train.columns[k_best_features.get_support()]
# Apply the transformed features to dataset
X_train = k_best_features.transform(X_train.fillna(0))
X_train.shape
Dimension Reduction Techniques
PCA (Principle Component Analysis):-
The original data has 9 columns. In this section, the code projects the original data which is 9 dimensional into 2 dimensions. I should note that after dimensionality reduction, there usually isn’t a particular meaning assigned to each principal component. The new components are just the two main dimensions of variation.
from sklearn.decomposition import PCA
dt=pd.read_csv('./dataset.csv')
X=dt.iloc[0:,0:-1]
y=dt.iloc[:,-1]
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
print("Dimension of dataframe before PCA",dt.shape)
print("Dimension of dataframe after PCA",principalDf.shape)
print(principalDf.head())
finalDf = pd.concat([principalDf, y], axis = 1)
print("finalDf")
print(finalDf.head())
#Visualize 2D Projection
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = [0, 1]
colors = ['r', 'g']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['Class'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
The explained variance tells you how much information (variance) can be attributed to each of the principal components. This is important as while you can convert 371-dimensional space to 2-dimensional space, you lose some of the variance (information) when you do this. By using the attribute explained_variance_ratio_, you can see that the first principal component contains 88.85% of the variance and the second principal component contains 0.06% of the variance. Together, the two components contain 88.91% of the information.