登录查看更多内容

Feature Selection for faster analytics

Nishank Arora

Project Manager - Capgemini | Ex-Svam | Microsoft Certified Data Analyst Associate

发布日期: 2019年5月29日

Feature selection is an important aspect when analyzing datasets with large number of features.

It is one of the most important concepts in Machine learning which is a process of selecting relevant features/ attributes (such as a column in tabular data) that are most relevant for the modelling and business objective of the problem and ignoring the irrelevant features from the data set.

Selecting relevant features and ignoring the others becomes critically important when dealing huge datasets as this impacts the time for analysing data.

Benefits of Feature Engineering on to your Dataset:

Reduce Overfitting
Improves Accuracy
Reduce Training Time

Since we now how important feature selection in data science, let’s just start applying various feature selection techniques:

Method 1: Remove features which standard deviation as zero (constants).

import pandas as pd
import numpy as np

data = pd.read_csv('./dataset.csv')
print("Original data shape- ",data.shape)

# Remove Features with zero standard deviation
constant_features = [feat for feat in data.columns if data[feat].std() == 0]
data.drop(labels=constant_features, axis=1, inplace=True)

print("Reduced feature dataset shape-",data.shape)

Method 2: Calculate the no of features which has low variance and remove them.

from sklearn.feature_selection import VarianceThreshold
sel= VarianceThreshold(threshold=0.18)
sel.fit(df)
mask = sel.get_support()
reduced_df = df.loc[:, mask]
print("Original data shape- ",df.shape)
print("Reduced feature dataset shape-",reduced_df.shape)

print("Dimensionality reduced from {} to {}.".format(df.shape[1], reduced_df.shape[1]))

Method 3: Highly correlated features should be removed

Features could be removed using Threshold value ie remove those features which has a correlation coefficient >0.8

import seaborn as sns
import numpy as np
corr=df_iter.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
# Add the mask to the heatmap
sns.heatmap(corr, mask=mask,  center=0, linewidths=1, annot=True, fmt=".2f")

plt.show()

corr_matrix = df_iter.corr().abs()
# Create a True/False mask and apply it
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
tri_df = corr_matrix.mask(mask)
# List column names of highly correlated features (r >0.5 )
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.5)]
# Drop the features in the to_drop list
reduced_df = df_iter.drop(to_drop, axis=1)

print("The reduced_df dataframe has {} columns".format(reduced_df.shape[1]

Method 4: Find out the coefficients with respect to features using logistic regression. Remove those features which have low lr_coef.

from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#calculating the coeff with respect to columns 
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Perform a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.25, random_state=0)
# Create the logistic regression model and fit it to the data
lr = LogisticRegression()
lr.fit(X_train, y_train)
# Calculate the accuracy on the test set
acc = accuracy_score(y_test, lr.predict(X_test))
print("{0:.1%} accuracy on test set.".format(acc))

print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

Method 5: Calculating the feature importance using XGBoost .

Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.

import xgboost as xgb
housing_dmatrix = xgb.DMatrix(X,y)
# Create the parameter dictionary: params
params = {"objective":"reg:linear","max_depth":"4"}
# Train the model: xg_reg
xg_reg = xgb.train(dtrain=housing_dmatrix,params=params,num_boost_round=10)
# Plot the feature importances

xgb.plot_importance(xg_reg)

Method 6: Feature Importance using Extra tree classifier.

Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module) can be used to compute feature importances, which in turn can be used to discard irrelevant features

X = df.iloc[:,0:370]  #independent columns
y = df.iloc[:,-1]    #target column 
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) 
#use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')

plt.show()

Method 7: Recursive Feature Elimination (RFE)

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

from sklearn.feature_selection import RFE
rfe = RFE(estimator=RandomForestClassifier(random_state=0),n_features_to_select=3,step=2,verbose=1)
rfe.fit(X_train,y_train)
mask=rfe.support_
X_new=X.loc[:,mask]

print(X_new.columns)

Method 8: Univariate Feature Selection (ANOVA)

This works by selecting the best features based on the univariate statistical tests (ANOVA). The methods based on F-test estimate the degree of linear dependency between the two random variables. They assume a linear relationship between the feature and the target. These methods also assume that the variables follow a Gaussian Distribution.

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_classif, f_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile
df= pd.read_csv('./train.csv')
X = df.drop(['ID','TARGET'], axis=1)
y = df['TARGET']

df.head()

# Calculate Univariate Statistical measure between each variable and target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
univariate = f_classif(X_train.fillna(0), y_train)
# Capture P values in a series
univariate = pd.Series(univariate[1])
univariate.index = X_train.columns
univariate.sort_values(ascending=False, inplace=True)
# Plot the P values

univariate.sort_values(ascending=False).plot.bar(figsize=(20,8))

# Select K best Features
k_best_features = SelectKBest(f_classif, k=10).fit(X_train.fillna(0), y_train)
X_train.columns[k_best_features.get_support()]

# Apply the transformed features to dataset 
X_train = k_best_features.transform(X_train.fillna(0))
X_train.shape

Dimension Reduction Techniques

PCA (Principle Component Analysis):-

The original data has 9 columns. In this section, the code projects the original data which is 9 dimensional into 2 dimensions. I should note that after dimensionality reduction, there usually isn’t a particular meaning assigned to each principal component. The new components are just the two main dimensions of variation.

from sklearn.decomposition import PCA
dt=pd.read_csv('./dataset.csv')
X=dt.iloc[0:,0:-1]
y=dt.iloc[:,-1]
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
print("Dimension of dataframe before PCA",dt.shape)
print("Dimension of dataframe after PCA",principalDf.shape)
print(principalDf.head())
finalDf = pd.concat([principalDf, y], axis = 1)
print("finalDf")
print(finalDf.head())

#Visualize 2D Projection
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = [0, 1]
colors = ['r', 'g']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['Class'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
               , finalDf.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

The explained variance tells you how much information (variance) can be attributed to each of the principal components. This is important as while you can convert 371-dimensional space to 2-dimensional space, you lose some of the variance (information) when you do this. By using the attribute explained_variance_ratio_, you can see that the first principal component contains 88.85% of the variance and the second principal component contains 0.06% of the variance. Together, the two components contain 88.91% of the information.

要查看或添加评论，请登录

Nishank Arora的更多文章

Analyzing US Gun Violence Data

2019年5月29日

Analyzing US Gun Violence Data

In this post, we will be analyzing US gun violence data using the data set available on a popular site Kaggle. Before…
IPL matches?—?An interesting Data Analysis

2019年5月29日

IPL matches?—?An interesting Data Analysis

Introduction We generate a lot of data today in our daily lives considering the data-hungry smart devices be it your…

1 条评论

Feature Selection for faster analytics

Nishank Arora

Project Manager - Capgemini | Ex-Svam | Microsoft Certified Data Analyst Associate

Method 1: Remove features which standard deviation as zero (constants).

Method 2: Calculate the no of features which has low variance and remove them.

Method 3: Highly correlated features should be removed

Method 4: Find out the coefficients with respect to features using logistic regression. Remove those features which have low lr_coef.

Method 5: Calculating the feature importance using XGBoost .

Method 6: Feature Importance using Extra tree classifier.

Method 7: Recursive Feature Elimination (RFE)

Method 8: Univariate Feature Selection (ANOVA)

Dimension Reduction Techniques

PCA (Principle Component Analysis):-

Nishank Arora的更多文章

社区洞察

其他会员也浏览了

Delivering The Right Level Of Analytical Detail

EDA Cheat Sheet

Avoiding Errors of Interpretation: the case of Selby & Ainsty

PCA - Principal Component Analysis

Exploring Tree Traversal: Pre-order, In-order, Post-order, and Level-order.

Different random forest packages in R

Sculpting Data: A Journey from Raw to Refined

Part 2: Predicting results, and working with Command Boards using Machine Learning

Building A Simple Linear Regression Model.

Unlocking Insights: How Everyday Charts Boost Business Understanding and Decision-Making

Method 1: Remove features which standard deviation as zero (constants).

Method 2: Calculate the no of features which has low variance and remove them.

Method 3: Highly correlated features should be removed

Method 4: Find out the coefficients with respect to features using logistic regression. Remove those features which have low lr_coef.

Method 5: Calculating the feature importance using XGBoost .

Method 6: Feature Importance using Extra tree classifier.

Method 7: Recursive Feature Elimination (RFE)

Method 8: Univariate Feature Selection (ANOVA)

Dimension Reduction Techniques

PCA (Principle Component Analysis):-

Nishank Arora的更多文章

Analyzing US Gun Violence Data

IPL matches?—?An interesting Data Analysis

社区洞察

其他会员也浏览了

Delivering The Right Level Of Analytical Detail

EDA Cheat Sheet

Avoiding Errors of Interpretation: the case of Selby & Ainsty

PCA - Principal Component Analysis

Exploring Tree Traversal: Pre-order, In-order, Post-order, and Level-order.

Different random forest packages in R

Sculpting Data: A Journey from Raw to Refined

Part 2: Predicting results, and working with Command Boards using Machine Learning

Building A Simple Linear Regression Model.

Unlocking Insights: How Everyday Charts Boost Business Understanding and Decision-Making