How to preprocess data used in generative AI Cleaning, transforming and encoding
Python programming with AI libraries
Introduction
Data preprocessing is an essential step to ensure the quality and accuracy of machine learning models.
This process involves transforming data into its most useful and clean format before feeding it to a model, and is carried out in three main steps:
Step 1: Data cleaning.
It consists of handling missing or anomalous values that could skew the results. Eliminating duplicates, handling null or absent values, and correcting errors in the data.
A.- Input Data: Here you have the data of 10 customers, age, sex, product purchased (clothing, electronics, household items), and monthly spending.
The data is assigned in the variable “data” an array called “object”.
B.- Padding and cleaning:
C.- Output Data:
The DataFrame df is now clean:
Step 2: Data Transformation.
Allows you to adjust the scales of the data to normalize or standardize it, making it easier to train the model. Adjustments to the data such as scaling, normalization, and feature transformation to make it more interpretable by the model.
A.- Normalization.- Normalize the Monthly_Spend column:
B.- Output Data:
The Monthly_Spend column now has values between 0 and 1, facilitating comparisons and calculations, making the data more consistent for machine learning models that are sensitive to scale.
Step 3: Encoding Categorical Data with the
One-Hot Encoding Method
A.- Binary encoding: Finally, categorical variables are converted into a binary format that is more suitable for machine learning algorithms, by representing each category as a unique column in the dataset: pd.get_dummies is applied to the Gender and Product_Category columns:
B.- Output Data.- The resulting DataFrame, df_encoded, contains:
The final data represents the result after processing: cleaning, transforming, and applying One-Hot Encoding to the original dataset.
Final Step.- Meaning of the Output Data.
Each row still represents a customer, but the data has been adjusted to make it useful in applications such as the following:
1.- Improved Comparability:
Normalizing Age and Monthly_Spend ensures that these features are comparable to each other and to the encoded variables.
2.- Preparation for Machine Learning Models:
3.- Multidimensional Analysis:
Practical Use of Output Data
1.- Demographic Analysis:
2.- Customer Segmentation:
3.- Prediction:
In summary, the final data is optimally structured to be used in advanced analytics and predictive applications.
Consequences of not cleaning data before training
Poor model performance:
Increased risk of overfitting: