Feature selection using Boruta algorithm

Feature selection is an important step in building a predictive model. We need to ensure the model does not get influenced by irrelevant features in the data set and pick up patterns from noise. For this reason, we remove the unnecessary or noisy features which in turn reduces the dimension of the data set. Some popular choices for dimensionality reduction are Principal Component Analysis(PCA), Linear Discriminant Analysis(LDA), T-distributed Stochastic Neighbor Embedding (t-SNE) etc. There is another very interesting approach for feature selection called Boruta algorithm that is probably the easiest to interpret and implement from scratch(programmatically).

Boruta Algorithm

Boruta algorithm was developed as a package in R back in 2010. The algorithm is designed as a wrapper for any classification algorithm that can return importance scores for all features. The features are deemed less relevant for model building by a statistical test and are removed in iteration. The following example on a dummy data set will make it easier to explain:

Case study on dummy data of students

Assume student data with three features study_hours_in_a_week, height(in cms), weight(in kg) and a target variable marks. The dummy data set has 5 observations in each column. Code snippet and output for the same is shown below:

No alt text provided for this image

For ease of understanding, lets call study_hours_in_a_week, height(in cms), weight(in kg) as X(independent variables) and marks as y(dependent variable). The goal of building a predictive model on this data set is to predict the marks scored by a student based on the available features. Trying to predict a student's academic performance from his/her weight and height seems strange but it will help understand this example better. Also, in a real world data set, we usually have more than just three features which makes it difficult to rule them out individually and will also require very strong and detailed domain knowledge. This is why we will use a feature selection algorithm that can automate this process for us.

The convenient way to go about this would be to pick a classifier that returns feature importance, decide on a qualifying threshold for all features and select only those features that have the feature importance value greater than this threshold value. Since Random Forest is one such classifier where we can obtain feature importance, we can combine it with Boruta algorithm quite effortlessly. Boruta works smoothly without any input from the user which solves the problem of deciding on a threshold for feature importance.

It works on the following very straightforward and effective principles:

The algorithm checks for value addition by features to the predictive power of the model by pitting it against a shuffled version of itself. This means it adds a column for every feature in the data set with randomized values. To make things clearer, following is the code and data set after we shuffle each feature and add them in the data frame column wise:

No alt text provided for this image
No alt text provided for this image

The next step is to fit the random forest regressor to this data, not with the aim of making an accurate prediction but to get feature importance values of all six features. Once we have the importance of every original feature, we compare it with a threshold value that is decided by Boruta algorithm. The threshold as explained by the algorithm is the highest importance value recorded among the shuffled or randomized features. When the importance of a feature is lesser than this threshold value, we drop this feature from the model building process. The logic behind this is that any feature adds value to the predictive power of a model if it can be of higher importance than the best performing shuffled feature. This way of calculating threshold seems fair because if our features are less important than randomized variables, we are basically feeding garbage to the machine learning model which works on 'garbage in, garbage out'. The code and output for the mentioned steps is shown below:

No alt text provided for this image
No alt text provided for this image

The threshold comes out to be approximately 17% (maximum of 10%, 12% and 17%). Hence, from the above result, we can see that two of the three features(study_hours_in_a_week, height(in cms)) made the cut for feature selection, whereas weight(12%) had a lower importance score than the threshold. This seems fair because a student's weight does not look like a decisive factor in his/her academic performance. However, we cannot(yet) draw any conclusions with certainty from this because there is a chance that this was a statistical fluke and weight is actually a contributing factor in predicting the marks of a student. There is also a possibility that the features study_hours_in_a_day and height(in cms) got lucky in this trial. To gain surety on this, we need to check if we get similar results on majority of the runs of this process.

Any machine learning project is an iterative process where we try out different models with different subsets of data after using multiple preprocessing techniques to get the best results. Similarly, we will rely on iterations to be more confident about the results of previous process. Quite predictably, we can say that trusting the outcomes of 20 trials will be a more reliable measure than simply drawing insights after one trial. The following code along with the output repeats this process 20 times and calculates on how many instances each of the original features(study_hours_in_a_day, height(in cms) and weight(in kg)) can have a feature importance value greater than the threshold to check if we are getting similar results:

No alt text provided for this image
No alt text provided for this image




The above output shows that in 20 trials, there were 14 instances where study_hours_in_a_week had feature importance greater than the threshold, 3 instances for height(in cms) and just 1 instance for weight(in kg). Had the instances for the weight and height features been zero, we could have easily discarded them from the data set. The situation becomes a bit tricky now that all three features have proven to be of importance in at least one trial. To solve this conundrum, we need to decide a threshold of acceptance i.e. a value above which the feature can be considered for model building process. This is where it gets interesting because the algorithm finds a workaround without taking any user input for threshold.

Consider a scenario where no feature selection algorithm exists and we have to rely on our knowledge to keep or discard a feature. In such a case, we can simply say that the feature is equally likely to be important or irrelevant for making any predictions. This means that in every trial, there is a 50% chance of a feature coming out as important and since every trial can have just two outcomes for a feature(important or not important), the series of trials follows a binomial distribution. In Python, the probability mass function of a binomial distribution can be computed easily using scipy library's binom.pmf function. The code snippet in the next block plots a graph showing binomial distribution based on trials to help us understand which features should be kept or discarded.

No alt text provided for this image
No alt text provided for this image

In the above plot, we can see three regions highlighted by colours red, blue and green described in the legend. Boruta does not define a hard threshold on basis of which we can easily discard or keep the features. Instead, it creates three areas - Discard(red), Speculative(blue) and Keep(green) to help us identify the important features. These areas are defined by selecting the two extreme points of distribution called tails of the distribution. Hence, after completing 20 trials we can draw the following conclusions:

  1. To predict the academic performance of a student, the feature study_hours_in_a_day is critical, whereas the feature weight(in kg) is irrelevant.
  2. Since height(in cms) falls in blue area, it falls under speculative region and the algorithm cannot make a statement about this with certainty. The feature can be kept just to be safe or can be dropped based application of domain knowledge.

These were the final results taken from implementing the Boruta feature selection algorithm on our dummy data set. The code included is an implementation from scratch(more or less) in Python. However, there is a library available in Python called BorutaPy that can easily do this with fewer lines of code. Here is the link to the iPython notebook for your reference.

Conclusion

Feature selection is a critical step in any machine learning project. If done carefully, it helps in removing noise which avoids the situation of the model picking up on incorrect patterns. Boruta helps us perform exactly this and with the trials exercise, it proves it's own results to be statistically significant.

References

  1. Feature Selection with Boruta package.
  2. Boruta Package Paper.
  3. Boruta explained.
Aditya Sinha

Business Intelligence Engineer II at Amazon

4 年

What a coincidence! I came across Boruta while working on a project of mine this semester. Was working with around 60 features and needed to cut that down. Still don't know what is under the hood but look forward to reading your article to understand.

回复
Pranav Gujarathi

AI Evangelist | Gen AI

4 年

Interesting Shashwat Siddhant, have you looked at Lime and Shapely?

回复

要查看或添加评论,请登录

Shashwat Siddhant的更多文章

  • Reducing bias in AI with Diversity & Inclusion

    Reducing bias in AI with Diversity & Inclusion

    The AI revolution is upon us. This is what we have been hearing from various tech evangelists and researchers…

    3 条评论

社区洞察

其他会员也浏览了