登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

How to preprocess data used in generative AI Cleaning, transforming and encoding

Carlos Sampson

Development of blockchain algorithms, crypto finance and AI agents.

发布日期: 2025年3月6日

+ 关注

Python programming with AI libraries

Introduction

Data preprocessing is an essential step to ensure the quality and accuracy of machine learning models.

This process involves transforming data into its most useful and clean format before feeding it to a model, and is carried out in three main steps:

Data cleaning.
Data transformation.
Data encoding.

Step 1: Data cleaning.

It consists of handling missing or anomalous values that could skew the results. Eliminating duplicates, handling null or absent values, and correcting errors in the data.

A.- Input Data: Here you have the data of 10 customers, age, sex, product purchased (clothing, electronics, household items), and monthly spending.

The data is assigned in the variable “data” an array called “object”.

B.- Padding and cleaning:

Handling missing values: The Age column is filled with the mean, while Monthly_Spend is filled with the median, keeping the data consistent.
Removing duplicates: Ensures that each record is unique.

C.- Output Data:

The DataFrame df is now clean:

The Age and Monthly_Spend columns have no null values.
There are no duplicate rows, and they are ready for the next stage of transformation.

Step 2: Data Transformation.

Allows you to adjust the scales of the data to normalize or standardize it, making it easier to train the model. Adjustments to the data such as scaling, normalization, and feature transformation to make it more interpretable by the model.

A.- Normalization.- Normalize the Monthly_Spend column:

MinMaxScaler is used to scale the Monthly_Spend values between 0 and 1.

This helps all values to be in a uniform range, reducing the impact of magnitudes on subsequent analysis.

B.- Output Data:

The Monthly_Spend column now has values between 0 and 1, facilitating comparisons and calculations, making the data more consistent for machine learning models that are sensitive to scale.

Step 3: Encoding Categorical Data with the

One-Hot Encoding Method

A.- Binary encoding: Finally, categorical variables are converted into a binary format that is more suitable for machine learning algorithms, by representing each category as a unique column in the dataset: pd.get_dummies is applied to the Gender and Product_Category columns:

Gender: Two binary columns are created: Gender_F (1 if gender is F, 0 otherwise) and Gender_M.
Product_Category: As many columns are created as there are unique categories in Product_Category, with binary values for each category.

B.- Output Data.- The resulting DataFrame, df_encoded, contains:

Original columns except the encoded categorical ones.
New binary columns representing the categories in Gender and Product_Category.
True equals 1 and False equals 0.

The final data represents the result after processing: cleaning, transforming, and applying One-Hot Encoding to the original dataset.

Final Step.- Meaning of the Output Data.

Each row still represents a customer, but the data has been adjusted to make it useful in applications such as the following:

1.- Improved Comparability:

Normalizing Age and Monthly_Spend ensures that these features are comparable to each other and to the encoded variables.

2.- Preparation for Machine Learning Models:

Categorical data has been transformed using One-Hot Encoding, which is crucial for statistical and machine learning models that cannot directly handle non-numeric variables.
For example, a model can use these columns to predict monthly spending based on product category, region, or age.

3.- Multidimensional Analysis:

It can now be observed how different combinations of categories (Product_Category, Region) or genders are associated with specific age and monthly spending patterns.
Example: Identify whether women in North America spend more on Electronics.

Practical Use of Output Data

1.- Demographic Analysis:

Which regions have customers with the highest monthly spending?
Which product categories are more popular among men or women?

2.- Customer Segmentation:

Identify groups with similar spending patterns for specific marketing campaigns.

3.- Prediction:

Predict monthly spending or the likelihood of a customer purchasing in a specific category.

In summary, the final data is optimally structured to be used in advanced analytics and predictive applications.

Consequences of not cleaning data before training

Poor model performance:

If the data is not properly cleaned, the model may learn incorrect or irrelevant patterns, leading to poor performance.
This may result in the generation of low-quality or inaccurate content, negatively affecting the usefulness of the generative AI system.

Increased risk of overfitting:

Uncleaned data may contain noise and outliers that the model might try to adjust, which can lead to overfitting.
This means that the model becomes overly complex and overfits the training data, losing its ability to generalize to new, unseen data.
As a result, the generative model might fail when faced with real-world situations.

Technological education News

546 位关注者

要查看或添加评论，请登录

Carlos Sampson的更多文章

Monitor your company's social media with AI Agents Measure customer satisfaction. Python program that implements AI Libraries

2025年3月18日

Monitor your company's social media with AI Agents Measure customer satisfaction. Python program that implements AI Libraries

Introduction This roadmap outlines the steps required to deploy these advanced AI agents, from data collection and…
How to train AI models: Process raw data until you get the right information

2025年3月12日

How to train AI models: Process raw data until you get the right information

Python Programming with AI Libraries Introduction Data preprocessing is a fundamental phase in building artificial…
How to Plan the Development of a Generative AI Assistant: Channel Steps with End-to-End Software

2025年2月26日

How to Plan the Development of a Generative AI Assistant: Channel Steps with End-to-End Software

Introduction Developing a generative AI assistant requires a methodical and well-structured approach to ensure its…
Diffusion AI Models, the Visual Revolution: How Image Generation works

2025年2月19日

Diffusion AI Models, the Visual Revolution: How Image Generation works

Introduction They are a type of generative model in artificial intelligence that generates content, such as images…
AI Transformer Models: The Revolution in Natural Language Processing

2025年2月12日

AI Transformer Models: The Revolution in Natural Language Processing

Introduction They are a type of neural network architecture that revolutionized the field of natural language…
Generative Antagonist Models (GANS) From Generative AI

2025年2月5日

Generative Antagonist Models (GANS) From Generative AI

Introduction Generative AI (Generative Artificial Intelligence) is a type of artificial intelligence that uses advanced…
Art.I- Supervised machine learning

2025年1月29日

Art.I- Supervised machine learning

Introduction Artificial intelligence (AI) refers to the ability of machines to perform tasks that typically require…
Security of a digital archive of a work of art in the blockchain.

2025年1月22日

Security of a digital archive of a work of art in the blockchain.

Art and archives in traditional databases: Let us see a digital file that contains the information of a work of art and…
Digital currencies: cryptocurrencies and ERC20 tokens

2025年1月19日

Digital currencies: cryptocurrencies and ERC20 tokens

What is a virtual or digital currency? It is a type of money that works in the digital world or cyberspace, which is…

See all articles

Technological education News

546 位关注者

Carlos Sampson的更多文章

Monitor your company's social media with AI Agents Measure customer satisfaction. Python program that implements AI Libraries

How to train AI models: Process raw data until you get the right information

How to Plan the Development of a Generative AI Assistant: Channel Steps with End-to-End Software

Diffusion AI Models, the Visual Revolution: How Image Generation works

AI Transformer Models: The Revolution in Natural Language Processing

Generative Antagonist Models (GANS) From Generative AI

Art.I- Supervised machine learning

Security of a digital archive of a work of art in the blockchain.

Digital currencies: cryptocurrencies and ERC20 tokens

社区洞察