登录查看更多内容

Automate Data Cleaning Like a Pro with Pandas!

Kaibalya Biswal

Always a Learner-- || Professor || Tech fanatic ?? || Guiding and Mentoring || Data Science & ML , Tableau, PowerBI, SQL,Statistics (Basic to Advance) , Business Analytics || Kaggle Contributor

发布日期: 2025年3月17日

Data cleaning is often the most time-consuming step in data science. Handling missing values, removing duplicates, normalizing numerical data, encoding categorical variables, and processing text can be overwhelming. What if you could automate it all? ??

This Pandas-based automation script does it all: ? Fills missing values for numerical, categorical, and textual data. ? Removes duplicates to ensure clean datasets. ? Detects and removes outliers using the IQR method. ? Encodes categorical variables while grouping rare categories into “Other.” ? Normalizes numerical features for better model performance. ? Cleans textual data by removing stopwords, punctuation, and correcting spelling errors. ? Extracts valuable features from date columns for better analysis.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob
import string
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')

def clean_and_preprocess_data(df, normalize=True, encoding=True, outlier_removal=True, text_processing=True, feature_engineering=True):
    """
    Cleans and preprocesses any DataFrame by:
    - Handling missing values
    - Removing duplicates
    - Encoding categorical features
    - Normalizing numerical features
    - Removing outliers using IQR method
    - Processing textual data by removing stopwords, punctuation, lowercasing, and correcting spelling
    - Feature Engineering: Extracting date components and handling high cardinality categorical variables
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.
    normalize (bool): If True, normalizes numerical columns using StandardScaler.
    encoding (bool): If True, encodes categorical columns using LabelEncoder.
    outlier_removal (bool): If True, removes outliers using the IQR method.
    text_processing (bool): If True, processes textual data.
    feature_engineering (bool): If True, performs feature engineering.
    
    Returns:
    pd.DataFrame: Cleaned and preprocessed DataFrame.
    """
    df = df.copy()
    
    # Remove duplicates
    df.drop_duplicates(inplace=True)
    
    # Handle missing values (fill numeric with median, categorical with mode, text with empty string)
    for col in df.columns:
        if df[col].dtype == np.number:
            df[col].fillna(df[col].median(), inplace=True)
        elif df[col].dtype == 'O':
            df[col].fillna(df[col].mode()[0], inplace=True)
    
    # Remove outliers using IQR
    if outlier_removal:
        for col in df.select_dtypes(include=[np.number]).columns:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    
    # Encode categorical features with high cardinality handling
    if encoding:
        label_encoders = {}
        for col in df.select_dtypes(include=['object']).columns:
            if df[col].nunique() > 10:
                df[col] = df[col].apply(lambda x: x if df[col].value_counts()[x] > 5 else 'Other')
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col])
            label_encoders[col] = le
    
    # Normalize numerical features
    if normalize:
        scaler = StandardScaler()
        df[df.select_dtypes(include=[np.number]).columns] = scaler.fit_transform(df.select_dtypes(include=[np.number]))
    
    # Process textual data
    if text_processing:
        stop_words = set(stopwords.words('english'))
        def process_text(text):
            text = text.lower()
            text = re.sub(f"[{string.punctuation}]", "", text)
            words = word_tokenize(text)
            words = [word for word in words if word not in stop_words and word.isalpha()]
            corrected_text = " ".join([str(TextBlob(word).correct()) for word in words])
            return corrected_text
        
        for col in df.select_dtypes(include=['object']).columns:
            df[col] = df[col].astype(str).apply(process_text)
    
    # Feature Engineering: Extract date components
    if feature_engineering:
        for col in df.select_dtypes(include=['datetime64']).columns:
            df[col + '_year'] = df[col].dt.year
            df[col + '_month'] = df[col].dt.month
            df[col + '_day'] = df[col].dt.day
    
    return df

# Example Usage:
if __name__ == "__main__":
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', np.nan, 'Eve'],
        'Age': [25, 30, 35, 40, np.nan],
        'Salary': [50000, 60000, np.nan, 80000, 90000],
        'Gender': ['F', 'M', 'M', 'F', 'F'],
        'Description': ['Loves Python!', 'Data scientist at work.', np.nan, 'ML enthusiast!', 'AI & ML expert.'],
        'Join_Date': pd.to_datetime(['2021-06-01', '2020-09-15', '2019-12-10', '2022-01-20', '2018-07-07'])
    }
    df = pd.DataFrame(data)
    print("Original Data:")
    print(df)
    
    cleaned_df = clean_and_preprocess_data(df)
    print("\nCleaned and Preprocessed Data:")
    print(cleaned_df)

Ch Sujata

Intern Digital Marketing & Lead Generation | AI CERTS

1 周

Thanks for sharing this insightful post, Kaibalya! If you're interested in further honing your tech skills, I thought you might like this upcoming event. Join AI CERTs for a free webinar on "Mastering AI Development: Building Smarter Applications with Machine Learning" on March 20, 2025. Register at: https://bit.ly/s-ai-development-machine-learning. Participants will also receive a certification of participation.

1 次回应

Raj Mishra

Aspiring Business Intelligence Professional | AI Enthusiast | Future Entrepreneur in Healthcare, technology & AI | Driven by Innovation

1 周

Import pandas as pd??

1 次回应

查看更多评论

要查看或添加评论，请登录

Kaibalya Biswal的更多文章

Writing Cleaner Python Code: Best Practices and Examples

2025年3月13日

Writing Cleaner Python Code: Best Practices and Examples

Python is praised for its simplicity and readability, yet even experienced developers can produce cluttered…

2 条评论
Why Data Quality Matters in AI

2024年11月13日

Why Data Quality Matters in AI

In the world of AI, data is the foundation on which everything is built. Without clean, accurate data, even the most…
The Power of Python: Transforming Data Gathering Across Industries

2024年11月12日

The Power of Python: Transforming Data Gathering Across Industries

In today's data-driven world, Python has emerged as one of the most powerful tools, revolutionizing the way we gather…

2 条评论

Automate Data Cleaning Like a Pro with Pandas!

Kaibalya Biswal

Always a Learner-- || Professor || Tech fanatic ?? || Guiding and Mentoring || Data Science & ML , Tableau, PowerBI, SQL,Statistics (Basic to Advance) , Business Analytics || Kaggle Contributor

Kaibalya Biswal的更多文章

社区洞察

其他会员也浏览了

How to Find Time Complexity of Any Recursive Algorithm: Part 1

S4: Episode 4: Principal Component Analysis (PCA)

Vectorized Quick Sort Part 2

Which path leads to success? Explore the world of decision trees

Machine Learning in Simple Steps

Pip install autoroc

Extra Tree Classifier / Regressor

The Role of Feature Engineering in Building Robust Machine Learning Models

Basics of LLMs 'chunks'

Data Science Series: The Hidden Pitfalls of Feature Engineering

Kaibalya Biswal的更多文章

Writing Cleaner Python Code: Best Practices and Examples

Why Data Quality Matters in AI

The Power of Python: Transforming Data Gathering Across Industries

社区洞察

其他会员也浏览了

How to Find Time Complexity of Any Recursive Algorithm: Part 1

S4: Episode 4: Principal Component Analysis (PCA)

Vectorized Quick Sort Part 2

Which path leads to success? Explore the world of decision trees

Machine Learning in Simple Steps

Pip install autoroc

Extra Tree Classifier / Regressor

The Role of Feature Engineering in Building Robust Machine Learning Models

Basics of LLMs 'chunks'

Data Science Series: The Hidden Pitfalls of Feature Engineering