Advanced Data cleaning technique in Python
1. Load and Inspect the Data
Start by loading the dataset and inspecting its structure to identify issues.
import pandas as pd
# Load the data
df = pd.read_csv("data.csv")
# Quick inspection
print(df.head())
print(df.info())
print(df.describe())
2. Handle Missing Values
print(df.isnull().sum())
Impute or Drop Missing Data
df = df.dropna(thresh=len(df) * 0.8, axis=1) # Drop columns with >20% missing
df['column_name'].fillna(df['column_name'].mean(), inplace=True) # Mean/median imputation df['column_name'].fillna('Unknown', inplace=True) # Fill with a placeholderdf['column_name'].fillna(df['column_name'].mean(), inplace=True) # Mean/median imputation df['column_name'].fillna('Unknown', inplace=True) # Fill with a placeholder
3. Fix Data Types
Convert columns to appropriate data types to avoid errors and improve performance.
df['date_column'] = pd.to_datetime(df['date_column']) # Convert to numeric df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce') # Convert to categorical df['category_column'] = df['category_column'].astype('category')
4. Outlier Detection and Handling
Q1 = df['numeric_column'].quantile(0.25) Q3 = df['numeric_column'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 IQR upper_bound = Q3 + 1.5 IQR df = df[(df['numeric_column'] >= lower_bound) & (df['numeric_column'] <= upper_bound)]
from scipy.stats import zscore df = df[(zscore(df['numeric_column']) < 3)]
5. Remove Duplicates
from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() # or MinMaxScaler() df['scaled_column'] = scaler.fit_transform(df[['numeric_column']])
Create New Features:
df['age'] = 2024 - df['birth_year']
6. Feature Engineering
from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() # or MinMaxScaler() df['scaled_column'] = scaler.fit_transform(df[['numeric_column']])
df['age'] = 2024 - df['birth_year']
7. Text Data Cleaning
import re def clean_text(text): text = text.lower() text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove non-alphabet characters return text df['text_column'] = df['text_column'].apply(clean_text)
Tokenization:
from nltk.tokenize import word_tokenize df['tokens'] = df['text_column'].apply(word_tokenize)
8. Handle Imbalanced Data
from sklearn.utils import resample # Upsample minority minority_class = df[df['target'] == 1] majority_class = df[df['target'] == 0] minority_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42) df = pd.concat([majority_class, minority_upsampled])
9. Validate and Save Cleaned Data
print(df.isnull().sum()) print(df.describe())
df.to_csv("cleaned_data.csv", index=False)
10. Automate and Reuse
Wrap these steps into functions or classes for reusability:
def clean_data(df): # Apply cleaning steps return df
Free Python learning and practice opportunities:
Here’s a list of platforms and labs offering free Python learning and practice opportunities:
领英推荐
1. Google's Python Class
2. Kaggle
3. SoloLearn
4. HackerRank
5. Codewars
6. EdX
7. freeCodeCamp
8. Real Python
9. W3Schools Python Tutorial
Here’s a list of platforms and labs offering free Python learning and practice opportunities:
1. Google's Python Class
2. Kaggle
3. SoloLearn
4. HackerRank
5. Codewars
6. EdX
7. freeCodeCamp
8. Real Python
9. W3Schools Python Tutorial
10. GeeksforGeeks Python Programming
11. PythonAnywhere
12. Project Euler
13. PyBites
14. Python.org Tutorials
Directeur de la Stratégie digitale chez Le Figaro
2 个月Sounds dope. Python's always evolving. How does this edition compare to the last one?