Feature Selection In Machine Learning Version 1.0('Layman words') !!
What is Feature Selection?
Let's understand in 'Layman Words' !!
Yesterday while talking to an individual & asked her out for Tea & Sutta. She answered me, you know what I have my friend who is in Bangalore & not going out with him as well.
Here is my answer to her?
You know him already but I am an unknown variable which can be the best selection for your model, let's explore the features & observe the correlation and all things can be taken forward based on under-fitting or over-fitting!! (she is like you are 'Mad' as you can relate everything to Data Science & reading all this concept I am fed up.;))
What we can conclude from the above conversation?
Data Science is not new at all, we all are using somehow by relating our life scenario. First try to relate your surroundings with Data science, the time you understand the stuff that is the day you move ahead.
Before going forward as always there will be some newbies, so let's discuss the difference between predictors & target variable.
Let me ask you on which factor it depends how much your father is going to spend every month, here your dad salary is the target variable or dependent variable & amount spent monthly will depend on your household expenditure, your tuition fee, your pocket money and may others factor as well which keep on changing according to needs, so all these factors are known to be as predictors or independent variable.
Machine Learning works on the simple rule - if you think in a more completed way, it will be more confusing to start with else if you put garbage(wet & dry both) in your dustbin, then it will smell like hell which can be bad for health kept for a week but if you classify dry & wet separately then it will be more convenient even if kept for a week. In the end, we can conclude - if you put garbage in, then you will get only garbage to come out(garbage means noise in data).
In machine learning, feature selection is the process to choose a variable that makes sense to your business that will be useful in predicting the target variable(Y). Before proceeding randomly without knowing your problem statement it will make you lead to no-where, So it is considered the good practice to understand which features are important while building predictive models.
When we deal with real-world datasets commonly you will find columns that are nothing but noise.
For sure just because of such variables they will be occupying more of space, time & the computational resources it is going to cost, especially with large datasets.
Always it's hard to understand when we have a variable that makes sense to the business but we are not pretty much sure whether it will help in predicting target variable (Y) or not. Another important & crucial point that every individual should be aware of while dealing with the problem is that if a feature that could be useful in one ML algorithm (say a random forest) may not be that much effective with another ( like a decision tree).
Most of the time it is possible if the variable doesn't make any sense to explain the response variable (Y), which can be more useful if combined with other predictors. In other words, a variable must have a low correlation value with Y but in the presence of other variables, it can help to explain some other patterns or unexpectedly relation which in turn can be more useful to proceed with, that other variables can't explain at all.
In most cases, it is almost hard to decide whether to include or exclude such variables.
We are going to discuss strategies that can help you out to fix such problems & most importantly you can analyze which particular variable is important or not & how much it is contributing to your required model.
Note:: It is always best to select the variables that relate to business logic but hard to find the correlation & for sure you will find such cases which are not correlated but if combined with other predictors can give you meaningful result for your target variable Y.
Advantage of Feature Selection !!
1.)Training of the machine learning model will be faster.
2.) The complexity of your model will be reduced & it will be easier to interpret.
3.) Accuracy of your model will be improved automatically if you select the correct features or combine with other predictors which can give you meaningful results for your target variable Y.
4.) Important but not the last if you make your hand dirty on feature selection then it will reduce the over-fitting.And it takes a ton of practice to get expertise with.
Keep on reading & from tomorrow we will be making our hand dirty with coding to understand 'feature selection' deeply.
Happy Learning & Keep Supporting!!!