Logistics Business - Manage your Data NOW or Dig your Graves #4
#4 - Do you have an Imputation Strategy
Real-world data contain missing values and bad quality values for various reasons. The reasons could be
1.?????Manual Data capture leading to data errors
2.?????Inadequate controls for inaccurate data
3.?????Inadequate automated data capturing tools deployed.
4.?????Indiscipline / Process not followed
5.?????Inadequate training of data entry operations.
Incomplete and Inadequate Quality of data can be identified using data debt models ( see previous article in the series #3 “Have you created your Data Balance Sheet”)
Using incomplete datasets or training a model with a dataset that has missing values can drastically impact the analytical outputs and machine learning model’s quality.
Strategy for Resolution of missing datasets is not an IT problem but a business issue.
Resolution of missing data sets is referred to as Imputation and all Logistics Business should have a strategy around it. ?
Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information.?
Businesses need to infer those missing values from the existing part of the data.
There are multiple strategies that can be deployed to address the missing data problem. The paras below will elucidate on some of these strategies
1.?????Complete Case Analysis(CCA):-
This is a quite straightforward method of handling the Missing Data, which directly removes the rows that have missing data i.e we consider only those rows where have complete data i.e data is not missing. This method is also popularly known as “Listwise deletion”. This works well where Data is Missing At Random, missing data is not more than 5% – 6% of the dataset and if the removed data will not bias the output.
?
Pros
Cons
2- Do Nothing:
That’s an easy one. In some cases the application can be coded to handle the missing data and factor them in the logic of the code. In some other cases one can have the option to just ignore them.
Pros:
·???????Easy and fast.
Cons:
·???????High dependency on identification and building logics to neglect them
·???????Not very accurate.at all times
3- Imputation Using (Mean/Median) Values:
This works by calculating the mean/median of the non-missing values and then replacing the missing values.
Pros:
·???????Easy and fast.
·???????Works well with small & medium numerical datasets.
Cons:
·???????Doesn’t factor the correlations between features..
·???????Will give poor results on categorical data.
·???????Not very accurate
·???????Doesn’t account for the uncertainty in the imputations.
4- Imputation Using (Most Frequent) or (Zero/Constant/ Arbitrary) Values:
Most Frequent?is statistical strategy to impute missing values with the most frequent values within the available data. This works best if Data is Missing At Random.
领英推荐
Zero or Constant?or Arbitrary imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify. This works best if Data is not Missing At Random
Pros:
·???????Works well with categorical data.
Cons:
·????It also doesn’t factor the correlations between features.
·???????It can introduce bias in the data.
·???????Can distort original variable distribution.
·???????Arbitrary values can create outliers.
·???????Extra caution required in selecting the Arbitrary value.
·???????The higher the percentage of missing values, the higher will be the distortion.
·???????May lead to over-representation of a particular category.
5- Imputation Using k-NN:
The?k?nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any missing data points. This means that the missing data is assigned a value based on how closely it resembles the points in the available data set. This can be very useful in making predictions about the missing values.
Pros:
·???????Can be much more accurate than the mean, median or most frequent imputation methods..
Cons:
·???????Computationally expensive. It works by storing the whole training dataset in memory.
·???????K-NN is quite sensitive to outliers in the data
6- Imputation Using Deep Learning:
This method works very well with categorical and non-numerical features too. It uses a library that learns Machine Learning models using Deep Neural Networks to impute missing values in a data frame.
Pros:
·???????Quite accurate compared to other methods.
Cons:
·???????Can be quite slow with large datasets.
·???????You have to define the columns that contain information about the target column that will be imputed.
7- Multiple imputation
All of the techniques discussed so far are what one might call "single imputation": each value in the dataset is filled in exactly once. In general, the limitation with single imputation is that because these techniques find maximally likely values, they do not generate entries which accurately reflect the distribution of the underlying data.
We should naturally expect to see some variability in single imputation models like extreme values, outliers, and records which do not completely fit the "pattern" of the data.
In multiple imputation we generate missing values from the dataset many times. The individual outputs are then pooled together into the final imputed dataset, with the values chosen to replace the missing data being drawn from the combined results in some way. In other words, multiple imputation breaks imputation out into three steps: imputation (multiple times), analysis (staging how the results should be combined), and pooling (integrating the results into the final imputed matrix).
Pros:
·???????Most accurate compared to other methods.
Cons:
·???????Can be quite slow with large datasets.
·???????Requires a learning cycle before the results are found to be highly accurate.
·???????Computationally expensive. It works by storing the whole training dataset in memory.
Some Other Imputation Methods that can be followed :
Stochastic regression imputation:
It tries to predict the missing values by regressing it from other related variables in the same dataset plus some random residual value.
Extrapolation and Interpolation:
It tries to estimate values from other observations within the range of a discrete set of known data points.
Hot-Deck imputation:
Works by randomly choosing the missing value from a set of related and similar variables.
In conclusion,?there is no perfect way to compensate for the missing values in a dataset. Each strategy can perform better for certain datasets and missing data types but may perform much worse on other types of datasets.?There are some set rules to decide which strategy to use for particular types of missing values, but beyond that, you should experiment and check which model works best for your dataset.
Optimising Life | OTM Specialist | Certified Yoga Instructor | Author | Personal Finance Enthusiast | Content Creator
2 年Very True!