Advanced Data cleaning technique in Python

Advanced Data cleaning technique in Python


1. Load and Inspect the Data

Start by loading the dataset and inspecting its structure to identify issues.

import pandas as pd

# Load the data

df = pd.read_csv("data.csv")

# Quick inspection

print(df.head())

print(df.info())

print(df.describe())


2. Handle Missing Values

  • Identify Missing Values

print(df.isnull().sum())

Impute or Drop Missing Data

  • Drop rows/columns with excessive missing values:

df = df.dropna(thresh=len(df) * 0.8, axis=1) # Drop columns with >20% missing

  • Impute missing values:

df['column_name'].fillna(df['column_name'].mean(), inplace=True) # Mean/median imputation df['column_name'].fillna('Unknown', inplace=True) # Fill with a placeholderdf['column_name'].fillna(df['column_name'].mean(), inplace=True) # Mean/median imputation df['column_name'].fillna('Unknown', inplace=True) # Fill with a placeholder


3. Fix Data Types

Convert columns to appropriate data types to avoid errors and improve performance.

df['date_column'] = pd.to_datetime(df['date_column']) # Convert to numeric df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce') # Convert to categorical df['category_column'] = df['category_column'].astype('category')


4. Outlier Detection and Handling

  • Use IQR for outlier detection:

Q1 = df['numeric_column'].quantile(0.25) Q3 = df['numeric_column'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 IQR upper_bound = Q3 + 1.5 IQR df = df[(df['numeric_column'] >= lower_bound) & (df['numeric_column'] <= upper_bound)]

  • Use Z-score for outlier detection:

from scipy.stats import zscore df = df[(zscore(df['numeric_column']) < 3)]


5. Remove Duplicates

from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() # or MinMaxScaler() df['scaled_column'] = scaler.fit_transform(df[['numeric_column']])

Create New Features:

df['age'] = 2024 - df['birth_year']


6. Feature Engineering

  • Standardize/Normalize Data:

from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() # or MinMaxScaler() df['scaled_column'] = scaler.fit_transform(df[['numeric_column']])

  • Create New Features:

df['age'] = 2024 - df['birth_year']


7. Text Data Cleaning

  • Lowercase, remove punctuation, stop words, and special characters:

import re def clean_text(text): text = text.lower() text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove non-alphabet characters return text df['text_column'] = df['text_column'].apply(clean_text)

Tokenization:

from nltk.tokenize import word_tokenize df['tokens'] = df['text_column'].apply(word_tokenize)


8. Handle Imbalanced Data

  • Upsample minority class or downsample majority class for classification tasks:

from sklearn.utils import resample # Upsample minority minority_class = df[df['target'] == 1] majority_class = df[df['target'] == 0] minority_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42) df = pd.concat([majority_class, minority_upsampled])


9. Validate and Save Cleaned Data

  • Validate:

print(df.isnull().sum()) print(df.describe())

  • Save:

df.to_csv("cleaned_data.csv", index=False)

10. Automate and Reuse

Wrap these steps into functions or classes for reusability:

def clean_data(df): # Apply cleaning steps return df


Free Python learning and practice opportunities:

Here’s a list of platforms and labs offering free Python learning and practice opportunities:

1. Google's Python Class

  • Features:Free tutorials and exercises.Focuses on basic Python concepts and common tasks like file I/O, data handling, and more.Includes downloadable lecture videos and practice materials.
  • Best for: Beginners.

2. Kaggle

  • Features:Python courses focusing on data analysis, machine learning, and visualization.Interactive notebooks for hands-on practice.Free datasets and challenges for applying Python skills.
  • Best for: Data science enthusiasts.

3. SoloLearn

  • Features:Interactive Python tutorials with quizzes.Community-driven discussions for doubts.Mobile-friendly app for on-the-go learning.
  • Best for: Beginners seeking quick lessons.

4. HackerRank

  • Features:Python-specific challenges across different levels.Gamified experience with badges and leaderboards.Focus on problem-solving and algorithmic thinking.
  • Best for: Practice with a competitive edge.

5. Codewars

  • Features:Challenges (kata) to improve Python skills.Choose challenges by difficulty level.Collaborative learning through shared solutions.
  • Best for: Intermediate and advanced users.

6. EdX

  • Features:Offers free Python courses from institutions like MIT and Microsoft.Audit courses for free (certificate requires payment).
  • Best for: Structured academic learning.

7. freeCodeCamp

  • Features:Comprehensive Python curriculum, including topics like NumPy, pandas, and data visualization.Hands-on projects and interactive lessons.
  • Best for: Beginners to intermediate learners.

8. Real Python

  • Features:Free tutorials on Python basics and advanced topics.Covers tools, libraries, and frameworks.Blog-style lessons with code examples.
  • Best for: Intermediate learners.

9. W3Schools Python Tutorial

  • Features:Simple and beginner-friendly.Interactive Python shell for coding alongside tutorials.
  • Best for: Absolute beginners.

Here’s a list of platforms and labs offering free Python learning and practice opportunities:


1. Google's Python Class

  • Features:Free tutorials and exercises.Focuses on basic Python concepts and common tasks like file I/O, data handling, and more.Includes downloadable lecture videos and practice materials.
  • Best for: Beginners.


2. Kaggle

  • Features:Python courses focusing on data analysis, machine learning, and visualization.Interactive notebooks for hands-on practice.Free datasets and challenges for applying Python skills.
  • Best for: Data science enthusiasts.


3. SoloLearn

  • Features:Interactive Python tutorials with quizzes.Community-driven discussions for doubts.Mobile-friendly app for on-the-go learning.
  • Best for: Beginners seeking quick lessons.


4. HackerRank

  • Features:Python-specific challenges across different levels.Gamified experience with badges and leaderboards.Focus on problem-solving and algorithmic thinking.
  • Best for: Practice with a competitive edge.


5. Codewars

  • Features:Challenges (kata) to improve Python skills.Choose challenges by difficulty level.Collaborative learning through shared solutions.
  • Best for: Intermediate and advanced users.


6. EdX

  • Features:Offers free Python courses from institutions like MIT and Microsoft.Audit courses for free (certificate requires payment).
  • Best for: Structured academic learning.


7. freeCodeCamp

  • Features:Comprehensive Python curriculum, including topics like NumPy, pandas, and data visualization.Hands-on projects and interactive lessons.
  • Best for: Beginners to intermediate learners.


8. Real Python

  • Features:Free tutorials on Python basics and advanced topics.Covers tools, libraries, and frameworks.Blog-style lessons with code examples.
  • Best for: Intermediate learners.


9. W3Schools Python Tutorial

  • Features:Simple and beginner-friendly.Interactive Python shell for coding alongside tutorials.
  • Best for: Absolute beginners.


10. GeeksforGeeks Python Programming

  • Features:Beginner to advanced-level topics.Covers theoretical concepts with practical examples.
  • Best for: Comprehensive understanding.

11. PythonAnywhere

  • Features:Free cloud-based Python environment for practicing code.Ideal for experimenting without needing local installations.
  • Best for: Hands-on coding.


12. Project Euler

  • Features:Mathematics-based Python challenges.Ideal for sharpening logic and problem-solving skills.
  • Best for: Advanced users with an interest in algorithms.


13. PyBites

  • Features:Free beginner-level exercises.Bite-sized Python challenges.
  • Best for: Busy learners seeking quick practice.


14. Python.org Tutorials

  • Features:Official Python documentation with a free beginner's guide.Includes examples and exercises.
  • Best for: Learners who prefer an official source.



Gaspard Trevoux

Directeur de la Stratégie digitale chez Le Figaro

2 个月

Sounds dope. Python's always evolving. How does this edition compare to the last one?

回复

要查看或添加评论,请登录

Hemant D.的更多文章

  • DATA ANALYSIS IN PYTHON

    DATA ANALYSIS IN PYTHON

    Data analysis in Python typically follows a structured process. Here’s a step-by-step outline to guide you: 1.

  • Numpy

    Numpy

    What is NumPy? NumPy (Numerical Python) is an open-source library used for numerical computing. It provides support for…

    2 条评论
  • Python 3.13

    Python 3.13

    Python 3.13.

  • Tableau Pulse :

    Tableau Pulse :

    Tableau Pulse is a feature introduced by Tableau as part of its broader focus on enhancing the data experience for…

  • Pyhton Notes Edition 6:

    Pyhton Notes Edition 6:

    Do you realize that how to generate a sequence number in python? There are several ways to generate a sequence number…

  • How to generate OTP in Python?

    How to generate OTP in Python?

    You can generate a One-Time Password (OTP) in Python using various methods. Here are a few common approaches: 1.

  • IDENTFIERS

    IDENTFIERS

    In Python, identifiers are names given to entities like variables, functions, classes, modules, etc. Here are the rules…

    2 条评论
  • Python Notes Edition 3

    Python Notes Edition 3

    Freeware: =>If any software downloaded Freely and that Software comes under Freeware Examples: Python, Java-----…

    1 条评论
  • Python Version

    Python Version

    ==================================================== Python programming language contains 3 Types of version. They are…

  • Python news letter by Weekly

    Python news letter by Weekly

    Dive into the world of Python with our newsletter! Stay updated on the latest trends, tips, and tricks in the Python…

    1 条评论

社区洞察

其他会员也浏览了