ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Why it's important to handle missing data

Anubhav Shukla

Full-stack Web Developer || GraphQl, MongoDB, Express, React, Node ( G-MERN ) || Python Fanatic || UI/UX Designer

å‘å¸ƒæ—¥æœŸ: 2024å¹´1æœˆ25æ—¥

Missing data is troublesome for both humans and machines. Let me explain this with an example:-

Hereâ€™s a hypothetical situation: Imagine you found some ancient book (don't ask me from where). You start reading that book. After reading a few chapters you got very interested in that book. You are at page 31 and you turn the page and WHAT!! you see that 3 pages are missing. The book jumps from page number 31 to page number 35. You shrug it off and continue with the book while also making assumptions about what couldâ€™ve possibly happened on pages 32-34. You continue reading the book and you encounter more missing pages, except now, the book jumps from page 68 to page 76. You canâ€™t believe your eyes, as you start to read page 77. The story is no longer adding up.?

Somehow you manage to reach the end of the book but then you realize this is not the end as there are 12 more missing pages from the end. And that's how you finish reading the book. Now if someone asks you for the summary of that book will you confidently tell the summary? You can but it will not be completely accurate.

A similar thing happens with our models. Machine learning models try to read the datasets which are like the story, and each row is like a chapter. If your dataset has missing values, your model wonâ€™t be able to fully understand whatâ€™s going on and may make false predictions.

Now let's understand the reason behind the missing data

Now there are 3 types of missing data

Missing Completely At Random (MCAR): This happens if all the variables and observations have the same probability of being missing.Example: Imagine you are collecting data on online shopping habits, and some users accidentally close the browser before completing the survey. Now the missing data is completely random and does not depend upon any other observation factor.
Missing At Random (MAR): This happens if the probability of the value being missing is related to the value of the variable or other variables in the dataset. Example: Suppose the dataset contains information on various health indicators, lifestyle factors, and medical history, including glucose levels. Some individuals have missing glucose values. The missingness may be related to an observed factor, such as the frequency of medical check-ups. Participants who visit the doctor more frequently may have their glucose levels recorded, while those who visit less often may have missing glucose values. Now if the missing glucose was unrelated to any other observation then this will be an example of MCAR. However, since the missing glucose values are related to other observed factors in the dataset, but not directly to the glucose levels themselves (If it was related to themselves then this will be an example of MNAR). For instance, if the missing values are associated with factors like the frequency of medical check-ups, and this information is available in the dataset, then the missingness is considered at random with respect to glucose levels once the frequency of medical check-ups is taken into account.
Missing Not At Random (MNAR): This happens if the probability of being missing is completely different for different values of the same variable, and these reasons can be unknown to us. MNAR is considered to be the most difficult scenario among the three types of missing data. Example: You are collecting data to see how much salary increases with age. But some people decide not to disclose their age. Now People who choose not to disclose their age may have a reason for doing so that is connected to their age. For example, individuals in higher age brackets may be more hesitant to disclose their age (No Offence). The decision not to disclose age is likely related to the individuals' actual age, which is an unobserved factor.

é¢†è‹±æŽ¨è

When Bias Overpowers Data: Recognizing and Mitigating Bias in Model Performance Metrics

When Bias Overpowers Data: Recognizing and Mitigatingâ€¦

Iain Brown PhD 3 å‘¨å‰

Conquering Class Imbalance: Techniques and Strategies for Robust Models

Conquering Class Imbalance: Techniques and Strategiesâ€¦

Iain Brown PhD 12 ä¸ªæœˆå‰

??? ?????????? ?????? - ??Revolutionizing Healthcare Landscape with Unlocking Next-Level Information Retrieval & Search??

??? ?????????? ?????? - ??Revolutionizing Healthcareâ€¦

Biswajeet Sahu 11 ä¸ªæœˆå‰

What can you do about the data thatâ€™s missing in your dataset?

Removing the missing data: This is a quick and simple method. But what if you have small datasets? Removing the data can lead to the loss of valuable information.
Imputation: According to Wikipedia, In statistics, imputation is the process of replacing missing data with substituted values.

Importance of Unrelated Variables in Machine Learning

In machine learning, the handling of data goes beyond the immediate task at hand. Consider this: data that seems unrelated or unimportant during initial training may hold unforeseen significance during testing or future model iterations. Here's why it matters:

1. Storage for Future Insights:

Collecting data can be challenging, and discarding seemingly unimportant information might not be the wisest move. Storing such data for later use could unveil new insights, refine models, or contribute to future analyses. A forward-thinking approach to data management ensures adaptability.

2. Dynamic Importance:

The importance of certain features can evolve. What appears less influential during training may become pivotal in testing. Regularly reassess the relevance of features and update models to stay responsive to changing dynamics.

Just be careful with the data. Understand the implications of your decisions. If you are discarding data, know what that data means and whether it's wise to let it go. In the data-driven world, informed choices are the pillars of robust models.

John Goliash

1 å¹´

Can't wait to read it! ??

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Anubhav Shuklaçš„æ›´å¤šæ–‡ç«

Mastering the Sigmoid Function: From Predictive Models to Probability Mapping

2024å¹´2æœˆ11æ—¥

Mastering the Sigmoid Function: From Predictive Models to Probability Mapping

Sigmoid Function The sigmoid function, also known as the logistic function, is a mathematical function that maps anyâ€¦

Why it's important to handle missing data

Anubhav Shukla

Full-stack Web Developer || GraphQl, MongoDB, Express, React, Node ( G-MERN ) || Python Fanatic || UI/UX Designer

Now let's understand the reason behind the missing data

é¢†è‹±æŽ¨è

What can you do about the data thatâ€™s missing in your dataset?

Importance of Unrelated Variables in Machine Learning

Anubhav Shuklaçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

The data age | The exponential growth of data will change the world

New Year's Resolutions for your Knowledge Graphs

The F1 Score: A Comprehensive Measure of Classification Performance

The future is analytical

AI/GenAI to analyze contact center volume drivers

Types of DAX Functions in Power BI

S.M.A.R.T. data is what you need (forget FAIR!)

Power of Analytics: Beyond The Realm Of Imagination

Comparison of Dimensionality Reduction Methods

Uncovering Unique Insights from Data: The Power of Thinking Outside the Box

Now let's understand the reason behind the missing data

é¢†è‹±æŽ¨è

What can you do about the data thatâ€™s missing in your dataset?

Importance of Unrelated Variables in Machine Learning

Anubhav Shuklaçš„æ›´å¤šæ–‡ç«

Mastering the Sigmoid Function: From Predictive Models to Probability Mapping

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

The data age | The exponential growth of data will change the world

New Year's Resolutions for your Knowledge Graphs

The F1 Score: A Comprehensive Measure of Classification Performance

The future is analytical

AI/GenAI to analyze contact center volume drivers

Types of DAX Functions in Power BI

S.M.A.R.T. data is what you need (forget FAIR!)

Power of Analytics: Beyond The Realm Of Imagination

Comparison of Dimensionality Reduction Methods

Uncovering Unique Insights from Data: The Power of Thinking Outside the Box

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†