How to handle null values and outliers
Brett Long
Psychology Student at ODU | Remote Learning & Development Specialist | Cybersecurity, Data Analytics & Web Dev Instructor | US ARMY Vet | Boosting Course Pass Rates by 30% | SaaS Education
One of the many decisions you must make as a data analyst is handling null values and outliers.
Null values are data points that are missing or empty. They can occur for various reasons, such as human error, data collection issues, or technical problems.
Outliers are data points that fall outside of the normal range of values. Errors, unusual events, or fraudulent activity can cause them.
Both null values and outliers can have a significant impact on data analysis. If they are not handled properly, they can lead to inaccurate or misleading results.
Here are some tips on how to handle null values and outliers:
Null values
Outliers
Using domain knowledge to handle null values and outliers
Domain knowledge is essential for handling null values and outliers effectively. With domain knowledge, you can better understand what the values mean and how they may impact your data analysis.
For example, if you are analyzing a dataset of customer purchase data, you might know that customers are more likely not to provide their email address if they make a small purchase. This knowledge would help you to understand that the email address variable is MAR.
You could then decide to handle the missing email addresses by imputing them with the mean or median email address. However, if you know that customers who do not provide their email addresses are less likely to be repeat customers, you may remove them from the dataset altogether.
Domain knowledge can also help you to identify legitimate outliers. For example, if you are analyzing a dataset of sales data, you might know that a particular customer is a high roller and often places large orders. If you see a data point that shows that the customer placed a very large order, you would know that this is a legitimate data point, even though it is an outlier.
By using your domain knowledge, you can handle null values and outliers in a way that preserves the accuracy and integrity of your data analysis.
Types of missing values
There are three main types of missing values:
领英推荐
Handling null values
The best way to handle null values depends on the type of missing values and the goals of the analysis. For MCAR and MAR data, imputation methods such as mean, median, and regression imputations can be used to fill in the missing values. For MNAR data, imputation methods may not be effective, and removing the rows or columns with missing values from the dataset may be necessary.
Handling outliers
Outliers can also be handled in a variety of ways. One common approach is to remove the outliers from the dataset. However, this can reduce the power of the analysis and may lead to biased results. Another approach is to transform the outliers using a method such as log transformation. This can make the outliers less influential in the analysis without removing them completely.
Using domain knowledge to handle null values and outliers
Domain knowledge is essential for handling null values and outliers effectively. With domain knowledge, you can better understand what the values mean and how they may impact your data analysis.
For example, if you are analyzing a dataset of customer purchase data, you might know that customers are more likely not to provide their email address if they make a small purchase. This knowledge would help you to understand that the email address variable is MAR. You could then decide to handle the missing email addresses by imputing them with the mean or median email address. However, if you know that customers who do not provide their email addresses are less likely to be repeat customers, you may remove them from the dataset altogether.
Domain knowledge can also help you to identify legitimate outliers. For example, if you are analyzing a dataset of sales data, you might know that a particular customer is a high roller and often places large orders. If you see a data point that shows that the customer placed a very large order, you would know that this is a legitimate data point, even though it is an outlier.
Using the missingno package in Python
The missingno package is a Python library that provides various tools for handling missing values. The missingno.matrix() function creates a heatmap that visualizes the missing values in a dataset. The missingno.heatmap() function creates a heatmap that shows the correlation between the missing values in different variables.
You can use the missingno package to help you identify and understand the missing values in your dataset. Once you have identified the missing values, you can use the information you have gained to decide how to handle them.
Treating missing values as an ML problem and subsampling
Another way to handle missing values is to treat it as an ML problem and subsample. This means that you would create a new dataset by randomly selecting a subset of the original dataset. The subset should be chosen to distribute the missing values similarly to the original dataset.
You can then use a machine learning algorithm to predict the missing values in the subset. Once the missing values have been predicted, you can merge the subset with the original dataset to create a complete dataset.
Subsampling can be a useful way to handle missing values, but it is important to note that it can also reduce the power of the analysis. Therefore, it is important to evaluate the results of the analysis carefully.
Additional tips for handling null values and outliers
By following these tips, you can handle null values and outliers in a way that produces accurate and reliable results.
Student
1 年AI makes some bizarre pictures….