Why it's important to handle missing data
Anubhav Shukla
Full-stack Web Developer || GraphQl, MongoDB, Express, React, Node ( G-MERN ) || Python Fanatic || UI/UX Designer
Missing data is troublesome for both humans and machines. Let me explain this with an example:-
Here’s a hypothetical situation: Imagine you found some ancient book (don't ask me from where). You start reading that book. After reading a few chapters you got very interested in that book. You are at page 31 and you turn the page and WHAT!! you see that 3 pages are missing. The book jumps from page number 31 to page number 35. You shrug it off and continue with the book while also making assumptions about what could’ve possibly happened on pages 32-34. You continue reading the book and you encounter more missing pages, except now, the book jumps from page 68 to page 76. You can’t believe your eyes, as you start to read page 77. The story is no longer adding up.?
Somehow you manage to reach the end of the book but then you realize this is not the end as there are 12 more missing pages from the end. And that's how you finish reading the book. Now if someone asks you for the summary of that book will you confidently tell the summary? You can but it will not be completely accurate.
A similar thing happens with our models. Machine learning models try to read the datasets which are like the story, and each row is like a chapter. If your dataset has missing values, your model won’t be able to fully understand what’s going on and may make false predictions.
Now let's understand the reason behind the missing data
Now there are 3 types of missing data
- Missing Completely At Random (MCAR): This happens if all the variables and observations have the same probability of being missing.Example: Imagine you are collecting data on online shopping habits, and some users accidentally close the browser before completing the survey. Now the missing data is completely random and does not depend upon any other observation factor.
- Missing At Random (MAR): This happens if the probability of the value being missing is related to the value of the variable or other variables in the dataset. Example: Suppose the dataset contains information on various health indicators, lifestyle factors, and medical history, including glucose levels. Some individuals have missing glucose values. The missingness may be related to an observed factor, such as the frequency of medical check-ups. Participants who visit the doctor more frequently may have their glucose levels recorded, while those who visit less often may have missing glucose values. Now if the missing glucose was unrelated to any other observation then this will be an example of MCAR. However, since the missing glucose values are related to other observed factors in the dataset, but not directly to the glucose levels themselves (If it was related to themselves then this will be an example of MNAR). For instance, if the missing values are associated with factors like the frequency of medical check-ups, and this information is available in the dataset, then the missingness is considered at random with respect to glucose levels once the frequency of medical check-ups is taken into account.
- Missing Not At Random (MNAR): This happens if the probability of being missing is completely different for different values of the same variable, and these reasons can be unknown to us. MNAR is considered to be the most difficult scenario among the three types of missing data. Example: You are collecting data to see how much salary increases with age. But some people decide not to disclose their age. Now People who choose not to disclose their age may have a reason for doing so that is connected to their age. For example, individuals in higher age brackets may be more hesitant to disclose their age (No Offence). The decision not to disclose age is likely related to the individuals' actual age, which is an unobserved factor.
领英推è
What can you do about the data that’s missing in your dataset?
- Removing the missing data: This is a quick and simple method. But what if you have small datasets? Removing the data can lead to the loss of valuable information.
- Imputation: According to Wikipedia, In statistics, imputation is the process of replacing missing data with substituted values.
Importance of Unrelated Variables in Machine Learning
In machine learning, the handling of data goes beyond the immediate task at hand. Consider this: data that seems unrelated or unimportant during initial training may hold unforeseen significance during testing or future model iterations. Here's why it matters:
1. Storage for Future Insights:
Collecting data can be challenging, and discarding seemingly unimportant information might not be the wisest move. Storing such data for later use could unveil new insights, refine models, or contribute to future analyses. A forward-thinking approach to data management ensures adaptability.
2. Dynamic Importance:
The importance of certain features can evolve. What appears less influential during training may become pivotal in testing. Regularly reassess the relevance of features and update models to stay responsive to changing dynamics.
Just be careful with the data. Understand the implications of your decisions. If you are discarding data, know what that data means and whether it's wise to let it go. In the data-driven world, informed choices are the pillars of robust models.
Can't wait to read it! ??