Handling missing values in Machine learning dataset
https://images.app.goo.gl/hZvmaMtzY7hzVmv86

Handling missing values in Machine learning dataset

Why to read this?

Missing data is a well-known problem in data science. If you are interested to know about feature engineering methods for handling this, then this document helps.

Technical explanation

In statistics, imputation is the process of replacing missing data with substituted values.

No alt text provided for this image
Impact of missing data

Missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.

Benefit of imputation

Imputation preserves all cases by replacing missing data with an estimated value based on other available information.

Imputation Techniques

No single imputation approach fit to all problems. Instead, based on the problem at hand, we need to decide right approach. Below are useful approaches for the same.

Dropping rows with null values

Ensure that it will not cause to lose generalizability in the models we build

Dropping features with high nullity

Before dropping features outright, consider subsetting the part of the dataset that this value is available for and checking its feature importance when it is used to train a model in this subset. If in doing so you disover that the variable is important in the subset it is defined, consider making an effort to retain it.

Take help of Statistics to approximate values

Mean substitution(refer below pic), Regression technique are few examples of statistical methods. This document provides such different approaches.

No alt text provided for this image


Reference
Thanks to these helping hands
https://en.wikipedia.org/wiki/Imputation_(statistics)

https://www.kaggle.com/residentmario/simple-techniques-for-missing-data-imputation

https://expertseoinfo.com/missing-data-imputation-feature-engineering/

https://images.app.goo.gl/4QtWY4SvKVJuVQqu8

https://images.app.goo.gl/hZvmaMtzY7hzVmv86

要查看或添加评论,请登录

Deepak Kumar的更多文章

  • Role of DBSCAN in machine learning

    Role of DBSCAN in machine learning

    Why to read this? Density-based spatial clustering of applications with noise (DBSCAN)is a well-known data clustering…

  • Choice between multithreading and multi-processing: When to use what

    Choice between multithreading and multi-processing: When to use what

    Introduction Single threaded and single process solution is normal practice. For example, if you open the text editor…

  • Artificial Narrow Intelligence

    Artificial Narrow Intelligence

    About ANI ANI stands for "Artificial Narrow Intelligence." ANI refers to artificial intelligence systems that are…

  • Federated learning and Vehicular IoT

    Federated learning and Vehicular IoT

    Definition Federated Learning is a machine learning paradigm that trains an algorithm across multiple decentralised…

  • An age old proven technique for image resizing

    An age old proven technique for image resizing

    Why to read? Anytime, was you curious to know how you are able to zoom small resolution picture to bigger size?…

    1 条评论
  • Stock Market Volatility Index

    Stock Market Volatility Index

    Why? Traders and investors use the VIX index as a tool to gauge market sentiment and assess risk levels. It can help…

  • The case for De-normalisation in Machine learning

    The case for De-normalisation in Machine learning

    Why? The need for inverse normalization arises when you want to interpret or use the normalized data in its original…

    1 条评论
  • Kubernetes complements Meta-verse

    Kubernetes complements Meta-verse

    Motivation The #metaverse is a virtual world or space that exists on the #internet . It's like a big interconnected…

    1 条评论
  • Which one offers better Security- OSS or Proprietary software

    Which one offers better Security- OSS or Proprietary software

    Motivation World is using so many OSS. Apache Kafka is a core part of our infrastructure at LinkedIn Redis is core part…

  • Why chatGPT/LLM should have unlearning capability like human has..

    Why chatGPT/LLM should have unlearning capability like human has..

    Executive Summary Do you know, chatGPT/LLM has this open problem to solve. This problem(unlearn) has potential to…

    1 条评论

社区洞察

其他会员也浏览了