How to preprocess data used in generative AI
     Cleaning, transforming and encoding

How to preprocess data used in generative AI Cleaning, transforming and encoding

Python programming with AI libraries

Introduction

Data preprocessing is an essential step to ensure the quality and accuracy of machine learning models.

This process involves transforming data into its most useful and clean format before feeding it to a model, and is carried out in three main steps:

  • Data cleaning.
  • Data transformation.
  • Data encoding.

Step 1: Data cleaning.

It consists of handling missing or anomalous values that could skew the results. Eliminating duplicates, handling null or absent values, and correcting errors in the data.

A.- Input Data: Here you have the data of 10 customers, age, sex, product purchased (clothing, electronics, household items), and monthly spending.

The data is assigned in the variable “data” an array called “object”.

B.- Padding and cleaning:

  • Handling missing values: The Age column is filled with the mean, while Monthly_Spend is filled with the median, keeping the data consistent.
  • Removing duplicates: Ensures that each record is unique.


C.- Output Data:

The DataFrame df is now clean:

  • The Age and Monthly_Spend columns have no null values.
  • There are no duplicate rows, and they are ready for the next stage of transformation.

Step 2: Data Transformation.

Allows you to adjust the scales of the data to normalize or standardize it, making it easier to train the model. Adjustments to the data such as scaling, normalization, and feature transformation to make it more interpretable by the model.

A.- Normalization.- Normalize the Monthly_Spend column:

  • MinMaxScaler is used to scale the Monthly_Spend values between 0 and 1.


  • This helps all values to be in a uniform range, reducing the impact of magnitudes on subsequent analysis.


B.- Output Data:

The Monthly_Spend column now has values between 0 and 1, facilitating comparisons and calculations, making the data more consistent for machine learning models that are sensitive to scale.

Step 3: Encoding Categorical Data with the

One-Hot Encoding Method

A.- Binary encoding: Finally, categorical variables are converted into a binary format that is more suitable for machine learning algorithms, by representing each category as a unique column in the dataset: pd.get_dummies is applied to the Gender and Product_Category columns:

  • Gender: Two binary columns are created: Gender_F (1 if gender is F, 0 otherwise) and Gender_M.
  • Product_Category: As many columns are created as there are unique categories in Product_Category, with binary values for each category.

B.- Output Data.- The resulting DataFrame, df_encoded, contains:

  • Original columns except the encoded categorical ones.
  • New binary columns representing the categories in Gender and Product_Category.
  • True equals 1 and False equals 0.

The final data represents the result after processing: cleaning, transforming, and applying One-Hot Encoding to the original dataset.

Final Step.- Meaning of the Output Data.

Each row still represents a customer, but the data has been adjusted to make it useful in applications such as the following:

1.- Improved Comparability:

Normalizing Age and Monthly_Spend ensures that these features are comparable to each other and to the encoded variables.

2.- Preparation for Machine Learning Models:

  • Categorical data has been transformed using One-Hot Encoding, which is crucial for statistical and machine learning models that cannot directly handle non-numeric variables.
  • For example, a model can use these columns to predict monthly spending based on product category, region, or age.

3.- Multidimensional Analysis:

  • It can now be observed how different combinations of categories (Product_Category, Region) or genders are associated with specific age and monthly spending patterns.
  • Example: Identify whether women in North America spend more on Electronics.

Practical Use of Output Data

1.- Demographic Analysis:

  • Which regions have customers with the highest monthly spending?
  • Which product categories are more popular among men or women?

2.- Customer Segmentation:

  • Identify groups with similar spending patterns for specific marketing campaigns.

3.- Prediction:

  • Predict monthly spending or the likelihood of a customer purchasing in a specific category.

In summary, the final data is optimally structured to be used in advanced analytics and predictive applications.

Consequences of not cleaning data before training

Poor model performance:

  • If the data is not properly cleaned, the model may learn incorrect or irrelevant patterns, leading to poor performance.
  • This may result in the generation of low-quality or inaccurate content, negatively affecting the usefulness of the generative AI system.

Increased risk of overfitting:

  • Uncleaned data may contain noise and outliers that the model might try to adjust, which can lead to overfitting.
  • This means that the model becomes overly complex and overfits the training data, losing its ability to generalize to new, unseen data.
  • As a result, the generative model might fail when faced with real-world situations.

要查看或添加评论,请登录

Carlos Sampson的更多文章

社区洞察