Automate Data Cleaning Like a Pro with Pandas!

Automate Data Cleaning Like a Pro with Pandas!

Data cleaning is often the most time-consuming step in data science. Handling missing values, removing duplicates, normalizing numerical data, encoding categorical variables, and processing text can be overwhelming. What if you could automate it all? ??

This Pandas-based automation script does it all: ? Fills missing values for numerical, categorical, and textual data. ? Removes duplicates to ensure clean datasets. ? Detects and removes outliers using the IQR method. ? Encodes categorical variables while grouping rare categories into “Other.” ? Normalizes numerical features for better model performance. ? Cleans textual data by removing stopwords, punctuation, and correcting spelling errors. ? Extracts valuable features from date columns for better analysis.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob
import string
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')

def clean_and_preprocess_data(df, normalize=True, encoding=True, outlier_removal=True, text_processing=True, feature_engineering=True):
    """
    Cleans and preprocesses any DataFrame by:
    - Handling missing values
    - Removing duplicates
    - Encoding categorical features
    - Normalizing numerical features
    - Removing outliers using IQR method
    - Processing textual data by removing stopwords, punctuation, lowercasing, and correcting spelling
    - Feature Engineering: Extracting date components and handling high cardinality categorical variables
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.
    normalize (bool): If True, normalizes numerical columns using StandardScaler.
    encoding (bool): If True, encodes categorical columns using LabelEncoder.
    outlier_removal (bool): If True, removes outliers using the IQR method.
    text_processing (bool): If True, processes textual data.
    feature_engineering (bool): If True, performs feature engineering.
    
    Returns:
    pd.DataFrame: Cleaned and preprocessed DataFrame.
    """
    df = df.copy()
    
    # Remove duplicates
    df.drop_duplicates(inplace=True)
    
    # Handle missing values (fill numeric with median, categorical with mode, text with empty string)
    for col in df.columns:
        if df[col].dtype == np.number:
            df[col].fillna(df[col].median(), inplace=True)
        elif df[col].dtype == 'O':
            df[col].fillna(df[col].mode()[0], inplace=True)
    
    # Remove outliers using IQR
    if outlier_removal:
        for col in df.select_dtypes(include=[np.number]).columns:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    
    # Encode categorical features with high cardinality handling
    if encoding:
        label_encoders = {}
        for col in df.select_dtypes(include=['object']).columns:
            if df[col].nunique() > 10:
                df[col] = df[col].apply(lambda x: x if df[col].value_counts()[x] > 5 else 'Other')
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col])
            label_encoders[col] = le
    
    # Normalize numerical features
    if normalize:
        scaler = StandardScaler()
        df[df.select_dtypes(include=[np.number]).columns] = scaler.fit_transform(df.select_dtypes(include=[np.number]))
    
    # Process textual data
    if text_processing:
        stop_words = set(stopwords.words('english'))
        def process_text(text):
            text = text.lower()
            text = re.sub(f"[{string.punctuation}]", "", text)
            words = word_tokenize(text)
            words = [word for word in words if word not in stop_words and word.isalpha()]
            corrected_text = " ".join([str(TextBlob(word).correct()) for word in words])
            return corrected_text
        
        for col in df.select_dtypes(include=['object']).columns:
            df[col] = df[col].astype(str).apply(process_text)
    
    # Feature Engineering: Extract date components
    if feature_engineering:
        for col in df.select_dtypes(include=['datetime64']).columns:
            df[col + '_year'] = df[col].dt.year
            df[col + '_month'] = df[col].dt.month
            df[col + '_day'] = df[col].dt.day
    
    return df

# Example Usage:
if __name__ == "__main__":
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', np.nan, 'Eve'],
        'Age': [25, 30, 35, 40, np.nan],
        'Salary': [50000, 60000, np.nan, 80000, 90000],
        'Gender': ['F', 'M', 'M', 'F', 'F'],
        'Description': ['Loves Python!', 'Data scientist at work.', np.nan, 'ML enthusiast!', 'AI & ML expert.'],
        'Join_Date': pd.to_datetime(['2021-06-01', '2020-09-15', '2019-12-10', '2022-01-20', '2018-07-07'])
    }
    df = pd.DataFrame(data)
    print("Original Data:")
    print(df)
    
    cleaned_df = clean_and_preprocess_data(df)
    print("\nCleaned and Preprocessed Data:")
    print(cleaned_df)
        
Ch Sujata

Intern Digital Marketing & Lead Generation | AI CERTS

1 周

Thanks for sharing this insightful post, Kaibalya! If you're interested in further honing your tech skills, I thought you might like this upcoming event. Join AI CERTs for a free webinar on "Mastering AI Development: Building Smarter Applications with Machine Learning" on March 20, 2025. Register at: https://bit.ly/s-ai-development-machine-learning. Participants will also receive a certification of participation.

Raj Mishra

Aspiring Business Intelligence Professional | AI Enthusiast | Future Entrepreneur in Healthcare, technology & AI | Driven by Innovation

1 周

Import pandas as pd??

要查看或添加评论,请登录

Kaibalya Biswal的更多文章

社区洞察

其他会员也浏览了