Machine Learning – Do we have multiple options for exploring missing data using Python?
It is really important to understand the data before we work with any machine learning task. Is it possible to get invalid, missing or junk value in the data?
Yes, very much possible in the real-world datasets.
In the Machine Learning work flow, one of the important activity is to handle this invalid data.
Missing data can lead to a biased Machine Learning model because we will not be able to analyse the relationship with other variables correctly. This can lead to wrong prediction hence we need to ensure missing/invalid data needs to be handled properly before we enter into the Machine Learning model.
Okay – Then how to handle the invalid data?
There are multiple ways or approaches to explore and correct the invalid data.
Let us explore those ways using “Python”.
Where is the data?
For our understanding, let us start with creating a simple dataframe using Python:
- import pandas as pd
- import numpy as np
- dataValues = {'Serial': pd.Series([1,2,3,4,5,6]),
'Age': pd.Series([19,43,51,np.nan,60,38]),
'Salary': pd.Series([20000,14000,23000,45000,12500,np.nan]),
'Dept' : pd.Categorical([np.nan,'IT','Sales','Operations','IT','Marketing']) }
- df1 = pd.DataFrame(dataValues, columns = ['Serial','Age','Salary','Dept'])
- df1
I’ve added some invalid data (NaN: Not A Number) in the above dataframe. In reality, we will have zeros, NaN, NA, None and other junk values.
However, for this article we will explore NaN type invalid data only.
How to identify NaN type invalid data?
Since we created a simple data frame, it is easy to see the invalid data. If you have a large data set with hundreds of rows and columns, how we do know if we have these invalid data?
We have multiple ways to find out this.
- df1.describe()
For the numeric columns (features) the above Python command will display the summary information.
Summary Serial Age Salary
count 6.000000 5.000000 5.000000
mean 3.500000 42.200000 22900.000000
std 1.870829 15.417522 13078.608489
min 1.000000 19.000000 12500.000000
25% 2.250000 38.000000 14000.000000
50% 3.500000 43.000000 20000.000000
75% 4.750000 51.000000 23000.000000
max 6.000000 60.000000 45000.000000
Please note that the ‘Serial’ column has the number 6.000 as count and other two columns ‘Age’ and ‘Salary’ got 5.000 as the count. This indicates that we have one invalid data for these two columns.
· Another way to find out the invalid data is to use isnull with the dataframe.
- df1.isnull()
Serial Age Salary Dept
0 False False False True
1 False False False False
2 False False False False
3 False True False False
4 False False False False
5 False False True False
We can see some True value for Salary and Dept columns for isnull() results. This indicates that they have one null value each for the two columns.
If we have large dataset, it is difficult to go through the above results to specifically look for true value. Isn’t it? What we can do?
We can also use sum() to find out the number of isnull occurrences.
- df1.isnull().sum()
Serial 0
Age 1
Salary 1
Dept 1
dtype: int64
The above result indicates that ‘Serial’ column got 0 null and other two columns got 1 null column each.
Do we have other ways?
Yes, we can also use another command assert.
We will get AssertionError if we have null values in the dataframe.
- assert pd.notnull(df1).all().all()
Traceback (most recent call last):
File "<ipython-input-20-8be1a459a92f>", line 1, in <module>
assert pd.notnull(df1).all().all()
AssertionError
If I try to check the null value for a specific column (‘Serial’):
- assert pd.notnull(df1['Serial']).all().all()
Since ‘Serial’ column does not have any null or missing data, there is no error.
But if we try the above command for Age or Salary columns, we will get ‘AssertionError’.
- assert pd.notnull(df1['Age']).all().all()
assert pd.notnull(df1['Age']).all().all()
AssertionError
So, using one of the above command we can find out whether invalid or missing data exists in the dataset (we have tried only for NaN in this article).
What we can do with this NaN type invalid data?
Can we just leave these invalid data as it is? No.
The Machine Learning algorithm might not give accurate results with these invalid data.
What to do then?
There are many options available.
Some well-known options are:
- Remove the rows that contain these invalid data
- Update the invalid data with some statistical methods (imputation)
- Predict the invalid/missing data with machine learning algorithm
Depending on the business need and data analysis, some of the above options can be used.
We are not going into the details of the above options.
For this article, we will look into the second option only using Python commands.
Mean imputation is one of the most frequently used methods and it replaces the missing data for a required column by the mean of all known values of that column.
In addition to this, we have other options available such as forward-fill, backward-fill to update the missing data (depending up on the business need and the data analysis results).
Let us use “mean” as the selected statistical option for this article.
Yes, as you guessed, we have multiple ways in Python to fill mean values.
Here is the dataframe before we apply “mean”:
Let us apply the below Python command:
· df1.fillna(df1.mean(), inplace=True)
The output for the above command is:
Please note that NaN in the ‘Age’ column got updated with value 51.0 and 22900.0 for the ‘Salary’ column.
Since the column ‘Dept’ is not a numeric column, the above mean did not get applied for the NaN value. For this, we can use bfill to get the next value to fill the NaN for the above ‘Dept’ column.
· df1.fillna(method='bfill')
Now the output will look like:
Do we have more options to update the missing data?
Yes. Let us create another simple numeric data frame.
· df1 = pd.DataFrame(dataValues, columns = ['Serial','Age','Salary'])
· df1
· from sklearn.preprocessing import Imputer
· imputer = Imputer(missing_values= 'NaN', strategy='mean')
· dfValues = df2.values
· arrayValues=imputer.fit_transform(dfValues)
· df3 = pd.DataFrame(arrayValues, columns = ['Serial','Age','Salary'])
· df3
The output of the above Python command will be:
If we check whether df3 has any missing or null values using the below command:
· df3.isnull().sum()
Serial 0
Age 0
Salary 0
dtype: int64
Obviously, all the NaN got updated with “mean” for these columns hence we got 0 for all the column.
One more way to fill “mean” to these columns:
Let us create some new dataframe as below.
· df4 = pd.DataFrame(dataValues, columns = ['Serial','Age','Salary'])
· df4
Now use the below Python command which will update ‘mean’ values to the NaN columns.
· df4.fillna(df4.mean(), inplace=True)
We can also fill missing data in ‘Age’ with each ‘Dept’s mean value of ‘Age’ (instead of the whole column mean value).
For this, let us create a new set of datavalues.
· dataValues5 = {'Serial': pd.Series([1,2,3,4,5,6]),
'Age': pd.Series([20,30,40,np.nan,60,70]),
'Salary': pd.Series([100,200,300,400,500,np.nan]),
'Dept' : pd.Categorical(['Operations','IT','Operations','Operations','IT','IT']) }
· df5 = pd.DataFrame(dataValues5, columns = ['Serial','Age','Salary','Dept'])
· df5
Use the below Python command to update the missing in ‘Age’ with each ‘Dept’s mean value of ‘Age’.
· df5["Age"].fillna(df5.groupby("Dept")["Age"].transform("mean"), inplace=True)
· df5
You can see the ‘Age’ with NaN was updated with mean of the other ‘Operations’ dept ‘Age’ values (30 in this case).
Similarly, we can update the ‘Salary’ group by ‘Dept’.
· df5["Salary"].fillna(df5.groupby("Dept")["Salary"].transform("mean"), inplace=True)
· df5
Please note that the ‘Salary’ with NaN got updated with the ‘mean’ from the ‘IT’ Dept (mean of 200 and 500).
These are some example Python commands to handle the missing data with ‘mean’ imputation.
Please continue to explore Machine Learning process with Python way.
Happy Learning.