Data processing with Python
Angad Gupta ,MIEEE, BITS-Pilani
Renewable Energy | Clean Tech | DR | VPP| DERMS|EV
Understanding of data structure, the finding of missing values as well as handling Missing Values with Continous & categorical type of variables (Mean, median, last occurred value, any value)
Why are data preprocessing required?
Data preprocessing is crucial in any data mining process as it directly impacts the success rate of the project. This reduces the complexity of the data under analysis as data in the real-world is unclean. Data is said to be unclean if it is missing attribute, attribute values, contain noise or outliers, and duplicate or wrong data. The presence of any of these will degrade the quality of the results.
Some of the examples are below:
There are several reasons for the incompleteness of data, noisy data, duplicate records, etc.
Example of identifying missing values, handling missing values etc
The above dataset is used for an explanation of the example. From the above dataset, we can note that:-
- Incompleteness: Region & Online Shopper feature contains many "NaN" Values
- Noisy: Income feature contains inconsistent salary -10 & -500 which is not possible
- Inconsistent: Birthdate column is the same for all and we can see if the age is 49 then how come birth date is 01-Jan-202 is possible?
Let's understand the data and handle it suitably
- Importing of required Libraries
2. Reading the dataset
3. Viewing the structure of the dataset
- Here data. shape() methods show the number of columns & rows available in the dataset
- data.info() method shows the number of columns, name of columns, no of data entries, type of data. from here we can say that Age id having 10 entries whereas dataset is having a total of 12 rows, it means it has issues with 2 rows of data
More detailed view of the missing values
- data.isnull().values.any() : shows that is there any value exist or not ? , In our case its True
- data.isnull().sum() : shows no of rows, is having null values
- data.isnull() : Shows the items details
Dataset summary, which shows some of the required statistics of the continuous columns
Handling of NULL values:
- Dropping of null values rows: here we can say that the rows for China have been deleted due to Null values, which may be a loss of data pattern.
This method commonly used to handle null values. Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 70-75% of missing values. This method is advised only when there are enough samples in the data set. One has to make sure that after we have deleted the data, there is no addition of bias. Removing the data will lead to loss of information which will not give the expected results while predicting the output.
Pros:
- Complete removal of data with missing values results in robust and highly accurate model
- Deleting a particular row or a column with no specific information is better, since it does not have a high weightage
Cons:
- Loss of information and data
- Works poorly if the percentage of missing values is high (say 30%), compared to the whole dataset
2. Replacing With Mean/Median/Mode
Here we can see Missing values in Age column replaced by mean values (i.e. 134.5) and Income column with median (i.e. 76800)
Pros:
- This is a better approach when the data size is small
- It can prevent data loss which results in removal of the rows and columns
Cons:
- Imputing the approximations add variance and bias
- Works poorly compared to another multiple-imputations method
Handling Missing values with CATEGORICAL data
- Replacing Missing values with Most occurred Word
- Replacing Missing values with Common word "Unknown"
You may also like to have a look
- Data Exploration using Pandas
- Data Visualization in Python (Different types of plots)
- Data Engineer Vs Data Analyst Vs Data Scientist
- Renewable Energy optimization with Big Data, Machine Learning, and Artificial Intelligence
#Data #Preprocessing #missing values#python #replacingbymean#replacingbymedian#categoricalvariable#ContinousVariable#Angad
Chief Manager- Large Corporate
4 å¹´Nice. But can u explain more on which missing value data set to use mean, media, mode.?