Feature Selection using Filter Methods

Feature Selection using Filter Methods

Having a large number of features in your data? Does this make your model good or bad??

An ideal model is the one that gives the best predictive accuracy. The model should give good predictions while the interpretation and usage of the model should not be complex. With many features the complexity(time/space) of any classifier or regressor increases while the performance could decrease. When the model has a good predictive accuracy it will also have a good interpretations. Feature Selection is one such technique to achieve this.

Feature Selection can use supervised or unsupervised techniques. In simple terms supervised methods use target/dependent variables and unsupervised methods do not use target/dependent variables.

Supervised feature selection methods may further be classified into three groups:

Classification of supervised Feature Selection

1.?Embedded Methods: Here feature selection is implemented by algorithms that have their own built-in feature selection methods. Algorithms perform automatic feature selection during training.

2. Wrapper Methods: Wrapper methods evaluate multiple models using procedures that add and/or remove features to find the optimal combination that maximizes model performance. The model which performs best with an optimal subset of features will be chosen.

3.?Filter Methods: This method makes use of statistical techniques for selecting the features.

Filter Methods

It is a univariate feature selection technique where the predictive power of each variable is evaluated and the relationship between input and output variable of interest is considered for this method.

Issue with filter methods:

Filter methods are univariate in nature, because statistical measures are applied considering one input variable at a time. This means that interaction between the predictors are not considered. Let’s say we are using correlation coefficient to select the features with this if we select all features with high correlation to the target what about the correlation between the features. This causes the problem of multi-collinearity which is against the assumptions of some of the predictive models (Linear/Logistic Regression).

Based on the type of input and output variable there are different statistical measures that can be used for Feature Selection.

Statistical Measures

1.???Continuous Input and Continuous Output

When the output variable is continuous in nature we can fit a predictive regression model. It becomes necessary to eliminate the redundant variables while fitting a regression model as it would cause problems such as overfitting. A parsimonious model gives good predictions.

About the data: Consider the data set that concerns the hardening of cement. In particular, the researchers were interested in learning how the composition of the cement affected the heat evolved during the hardening of the cement. Therefore, they measured and recorded the following data on 13 batches of cement. Variables of this model were,

1.?????Response?y: Heat evolved in calories during hardening of cement on a per gram basis

2.?????Predictor?x1: % of tricalcium aluminate

3.?????Predictor?x2: % of tricalcium silicate

4.?????Predictor?x3: % of tetracalcium alumino ferrite

5.?????Predictor?x4: % of dicalcium silicate

?a. Using pandas Pearson Correlation:

Correlation Coefficient (r) is a measure of linear association between the response and input variable. The basic idea is that input variables with a closer relation give more information or in statistical terms the variability in the response variable can be better explained by the model which higher r value. The value of this measure lies between +1 and -1.

  • ?If the value is closer to +1 then it can be said there is a strong positive relation between response and input variable.
  • If the value is closer to -1 then it can be said there is a strong negative relation between response and input variable.
  • If the value is closer to 0 then it can be said there is a weak or no relation between response and input variable.

cm = pd.read_excel('cement.xlsx')
corr = cm.corr()
cm.head()
        
Data





Eliminate all features that have correlation coefficient less than 0.6 which is selected as the threshold.

columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(1):
? ? for j in range(i+1, corr.shape[0]):
? ? ? ? if abs(corr.iloc[i,j]) < 0.6:
? ? ? ? ? ? if columns[j]:
? ? ? ? ? ? ? ? columns[j] = False
selected_columns = cm.columns[columns]
selected_columns.shape


new_cm = cm[selected_columns]
new_cm.head()        
No alt text provided for this image





b.?Using f_regression() function from scikit-learn machine library

F_regression technique for feature selection in the sklearn feature selection module uses these principles:

1.?A regressor of interest and the response variable is orthogonalized while other regressors are held constant to reduce the multicollinearity.

2.?Correlation Coefficient is computed of this regressor of interest.

3.?This will be used in computing of the F - score and then the p-value which will be returned.

Sklearn class selects features according to the k highest score. If there are more than one variables, the same is done considering one variable at a time.

def featureSelection(X_train, y_train, X_test):

? ? # configure to select all features
? ? fs = SelectKBest(score_func=f_regression, k='all')
? ? # learn relationship from training data
? ? fs.fit(X_train, y_train)
? ? # transform train input data
? ? X_train_fs = fs.transform(X_train)
? ? # transform test input data
? ? X_test_fs = fs.transform(X_test)

? ? return X_train_fs, X_test_fs, fs



y = cm.iloc[:, 0]
X = cm.iloc[:,1:]

labels = ['x1','x2','x3','x4']

#Training the model
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

#Feature Selection
X_train_fs, X_test_fs, fs = featureSelection(x_train, y_train, x_test)

#printing the f scores

for i in range(len(fs.scores_)):
  print('Feature %s: %f' % (labels[i], fs.scores_[i]))        
No alt text provided for this image


Regression model is built with selected features and the results obtained are compared with the model with all features.

No alt text provided for this image

Complete code available in github:

https://github.com/runaveigas/Feature-Selection-using-Filter-Methods

2. Categorical Input and Continuous Output

Students t-test is usually used when we want to check if the two samples were drawn from the same population or not and ANOVA when more than two categorical variables are involved. These techniques can also be adopted for Feature Selection.?

a.????Students t-test for Feature Selection:

When we have a binary classification problem t test can be used to select features. The idea is that a large t-statistic value with a smaller p – value would provide sufficient evidence that the distribution of values for each of the examined classes are distinct and the variable may have enough discriminative power to be included in the classification model.

Null Hypothesis:?There is no significant difference between the means of two groups.

Alternate Hypothesis:?There is a significant difference between the means of two groups.

About the data: Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

df = pd.read_csv('data.csv')
df.drop(['id','Unnamed: 32'],axis = 1,inplace = True)

df.columns        
No alt text provided for this image

Selecting features whose p value is > 0.05

new_features = []
for x in df.columns[1:]:
? ? pvalue = stats.ttest_ind(df.loc[df.diagnosis==1][x],df.loc[df.diagnosis==0]   [x])[1]
? ? if pvalue < 0.05:
? ? ? ? new_features.append(x)? ??

new_df = df[new_features]


A = new_df.columns
B = df.columns

print('?olumns whose p-value was >0.05 are:\n', 
   list(set(A).symmetric_difference(set(B))))        
No alt text provided for this image

b.?Using ANOVA F- Test

Analysis of Variance is a statistical method which is used to check the means of two or more groups that are significantly different from each other.

The scikit-learn machine library provides an implementation of the ANOVA F-test in the f classif() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

# split into input (X) and output (y) variables
X = df.iloc[:,1:]
y = df.iloc[:,:1]


#Split into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)


select = SelectKBest(score_func=f_classif, k=8)
new = select.fit_transform(X_train,y_train)


#printing the features that have been selected using get_support()
cols = select.get_support(indices=True)


#Printing the scores of the selected columns
for i in range(len(cols)):
? ? print('Feature %d: %f' % (cols[i], select.scores_[i]))        
No alt text provided for this image




# Creating a new dataframe with the selected columns
features_df_new = df.iloc[:,cols]

features_df_new.columns        
No alt text provided for this image

3. Categorical Input and Categorical Output

A categorical variable has a measurement scale consisting of a set of categories. For example an incoming email can be ‘spam’ or ‘not spam’. When both input and output is categorical in nature we have different methods for selecting features.?

Chi-Square Test for Independence

A table that cross-classifies variables say X and Y with rows(r) and columns(c) where each cell represent the count of the variables is called Contingency Table. Chi-Square Test compares two variables in a contingency table to check if there are associated or not.

?Expected Frequencies are computed for the variables and compared with the observed. If there is a huge difference between the two then the variables are not associated with each other.

The scikit-learn machine library provides chi2() function in the SelectKBest class using which we can select k best features. It computes the Chi-Squared stats between each non-negative feature and class.?

About the Data: Titanic dataset is considered to implement Chi-Square test for feature selection.

No alt text provided for this image
# split into input (X) and output (y) variables

X = df.iloc[:,1:]
y = df.iloc[:,:1]


#Split into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)


select = SelectKBest(score_func=chi2, k=3)
new = select.fit_transform(X_train,y_train)


#printing the features that have been selected using get_support()
cols = select.get_support(indices=True)


#Printing the scores of the selected columns
for i in range(len(cols)):
? ? print('Feature %d: %f' % (cols[i], select.scores_[i]))


# Creating a new dataframe with the selected columns
features_df_new = df.iloc[:,cols]

features_df_new.head(3)        
No alt text provided for this image





Complete code available in github:

https://github.com/runaveigas/Feature-Selection-using-Filter-Methods

References:

[1] Max Kuhn, Kjell Johnson, Feature Engineering and Selection, A practical approach for Predictive Models.

[2] Jason Brownlee, Data Preparation for Machine Learning.

[3] Penn State, Stat 501, Regression Methods, https://online.stat.psu.edu/stat501/

Author: Runa Veigas



Brian D'souza

Energy Business Analyst at Amazon Web Services (AWS)

3 年

Well explained! Super insightful.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了