Handling missing values in Machine learning dataset
Deepak Kumar
Propelling AI To Reinvent The Future ||Author|| 150+ Mentorship|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing
Why to read this?
Missing data is a well-known problem in data science. If you are interested to know about feature engineering methods for handling this, then this document helps.
Technical explanation
In statistics, imputation is the process of replacing missing data with substituted values.
Impact of missing data
Missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.
Benefit of imputation
Imputation preserves all cases by replacing missing data with an estimated value based on other available information.
Imputation Techniques
No single imputation approach fit to all problems. Instead, based on the problem at hand, we need to decide right approach. Below are useful approaches for the same.
Dropping rows with null values
Ensure that it will not cause to lose generalizability in the models we build
Dropping features with high nullity
Before dropping features outright, consider subsetting the part of the dataset that this value is available for and checking its feature importance when it is used to train a model in this subset. If in doing so you disover that the variable is important in the subset it is defined, consider making an effort to retain it.
Take help of Statistics to approximate values
Mean substitution(refer below pic), Regression technique are few examples of statistical methods. This document provides such different approaches.
Reference
Thanks to these helping hands
https://en.wikipedia.org/wiki/Imputation_(statistics) https://www.kaggle.com/residentmario/simple-techniques-for-missing-data-imputation https://expertseoinfo.com/missing-data-imputation-feature-engineering/ https://images.app.goo.gl/4QtWY4SvKVJuVQqu8 https://images.app.goo.gl/hZvmaMtzY7hzVmv86