Automate Data Cleaning Like a Pro with Pandas!
Kaibalya Biswal
Always a Learner-- || Professor || Tech fanatic ?? || Guiding and Mentoring || Data Science & ML , Tableau, PowerBI, SQL,Statistics (Basic to Advance) , Business Analytics || Kaggle Contributor
Data cleaning is often the most time-consuming step in data science. Handling missing values, removing duplicates, normalizing numerical data, encoding categorical variables, and processing text can be overwhelming. What if you could automate it all? ??
This Pandas-based automation script does it all: ? Fills missing values for numerical, categorical, and textual data. ? Removes duplicates to ensure clean datasets. ? Detects and removes outliers using the IQR method. ? Encodes categorical variables while grouping rare categories into “Other.” ? Normalizes numerical features for better model performance. ? Cleans textual data by removing stopwords, punctuation, and correcting spelling errors. ? Extracts valuable features from date columns for better analysis.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob
import string
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
def clean_and_preprocess_data(df, normalize=True, encoding=True, outlier_removal=True, text_processing=True, feature_engineering=True):
"""
Cleans and preprocesses any DataFrame by:
- Handling missing values
- Removing duplicates
- Encoding categorical features
- Normalizing numerical features
- Removing outliers using IQR method
- Processing textual data by removing stopwords, punctuation, lowercasing, and correcting spelling
- Feature Engineering: Extracting date components and handling high cardinality categorical variables
Parameters:
df (pd.DataFrame): The input DataFrame.
normalize (bool): If True, normalizes numerical columns using StandardScaler.
encoding (bool): If True, encodes categorical columns using LabelEncoder.
outlier_removal (bool): If True, removes outliers using the IQR method.
text_processing (bool): If True, processes textual data.
feature_engineering (bool): If True, performs feature engineering.
Returns:
pd.DataFrame: Cleaned and preprocessed DataFrame.
"""
df = df.copy()
# Remove duplicates
df.drop_duplicates(inplace=True)
# Handle missing values (fill numeric with median, categorical with mode, text with empty string)
for col in df.columns:
if df[col].dtype == np.number:
df[col].fillna(df[col].median(), inplace=True)
elif df[col].dtype == 'O':
df[col].fillna(df[col].mode()[0], inplace=True)
# Remove outliers using IQR
if outlier_removal:
for col in df.select_dtypes(include=[np.number]).columns:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
# Encode categorical features with high cardinality handling
if encoding:
label_encoders = {}
for col in df.select_dtypes(include=['object']).columns:
if df[col].nunique() > 10:
df[col] = df[col].apply(lambda x: x if df[col].value_counts()[x] > 5 else 'Other')
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# Normalize numerical features
if normalize:
scaler = StandardScaler()
df[df.select_dtypes(include=[np.number]).columns] = scaler.fit_transform(df.select_dtypes(include=[np.number]))
# Process textual data
if text_processing:
stop_words = set(stopwords.words('english'))
def process_text(text):
text = text.lower()
text = re.sub(f"[{string.punctuation}]", "", text)
words = word_tokenize(text)
words = [word for word in words if word not in stop_words and word.isalpha()]
corrected_text = " ".join([str(TextBlob(word).correct()) for word in words])
return corrected_text
for col in df.select_dtypes(include=['object']).columns:
df[col] = df[col].astype(str).apply(process_text)
# Feature Engineering: Extract date components
if feature_engineering:
for col in df.select_dtypes(include=['datetime64']).columns:
df[col + '_year'] = df[col].dt.year
df[col + '_month'] = df[col].dt.month
df[col + '_day'] = df[col].dt.day
return df
# Example Usage:
if __name__ == "__main__":
data = {
'Name': ['Alice', 'Bob', 'Charlie', np.nan, 'Eve'],
'Age': [25, 30, 35, 40, np.nan],
'Salary': [50000, 60000, np.nan, 80000, 90000],
'Gender': ['F', 'M', 'M', 'F', 'F'],
'Description': ['Loves Python!', 'Data scientist at work.', np.nan, 'ML enthusiast!', 'AI & ML expert.'],
'Join_Date': pd.to_datetime(['2021-06-01', '2020-09-15', '2019-12-10', '2022-01-20', '2018-07-07'])
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
cleaned_df = clean_and_preprocess_data(df)
print("\nCleaned and Preprocessed Data:")
print(cleaned_df)
Intern Digital Marketing & Lead Generation | AI CERTS
1 周Thanks for sharing this insightful post, Kaibalya! If you're interested in further honing your tech skills, I thought you might like this upcoming event. Join AI CERTs for a free webinar on "Mastering AI Development: Building Smarter Applications with Machine Learning" on March 20, 2025. Register at: https://bit.ly/s-ai-development-machine-learning. Participants will also receive a certification of participation.
Aspiring Business Intelligence Professional | AI Enthusiast | Future Entrepreneur in Healthcare, technology & AI | Driven by Innovation
1 周Import pandas as pd??