登录查看更多内容

Advanced Data cleaning technique in Python

Hemant D.

Senior Manager & BI Architect @ Capgemini | Certified Tableau Developer| Power BI |Teradata |Python |MS BI full stack | Azure

发布日期: 2024年12月4日

+ 关注

1. Load and Inspect the Data

Start by loading the dataset and inspecting its structure to identify issues.

import pandas as pd

# Load the data

df = pd.read_csv("data.csv")

# Quick inspection

print(df.head())

print(df.info())

print(df.describe())

2. Handle Missing Values

Identify Missing Values

print(df.isnull().sum())

Impute or Drop Missing Data

Drop rows/columns with excessive missing values:

df = df.dropna(thresh=len(df) * 0.8, axis=1) # Drop columns with >20% missing

Impute missing values:

df['column_name'].fillna(df['column_name'].mean(), inplace=True) # Mean/median imputation df['column_name'].fillna('Unknown', inplace=True) # Fill with a placeholderdf['column_name'].fillna(df['column_name'].mean(), inplace=True) # Mean/median imputation df['column_name'].fillna('Unknown', inplace=True) # Fill with a placeholder

3. Fix Data Types

Convert columns to appropriate data types to avoid errors and improve performance.

df['date_column'] = pd.to_datetime(df['date_column']) # Convert to numeric df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce') # Convert to categorical df['category_column'] = df['category_column'].astype('category')

4. Outlier Detection and Handling

Use IQR for outlier detection:

Q1 = df['numeric_column'].quantile(0.25) Q3 = df['numeric_column'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 IQR upper_bound = Q3 + 1.5 IQR df = df[(df['numeric_column'] >= lower_bound) & (df['numeric_column'] <= upper_bound)]

Use Z-score for outlier detection:

from scipy.stats import zscore df = df[(zscore(df['numeric_column']) < 3)]

5. Remove Duplicates

from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() # or MinMaxScaler() df['scaled_column'] = scaler.fit_transform(df[['numeric_column']])

Create New Features:

df['age'] = 2024 - df['birth_year']

6. Feature Engineering

Standardize/Normalize Data:

from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() # or MinMaxScaler() df['scaled_column'] = scaler.fit_transform(df[['numeric_column']])

Create New Features:

df['age'] = 2024 - df['birth_year']

7. Text Data Cleaning

Lowercase, remove punctuation, stop words, and special characters:

import re def clean_text(text): text = text.lower() text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove non-alphabet characters return text df['text_column'] = df['text_column'].apply(clean_text)

Tokenization:

from nltk.tokenize import word_tokenize df['tokens'] = df['text_column'].apply(word_tokenize)

8. Handle Imbalanced Data

Upsample minority class or downsample majority class for classification tasks:

from sklearn.utils import resample # Upsample minority minority_class = df[df['target'] == 1] majority_class = df[df['target'] == 0] minority_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42) df = pd.concat([majority_class, minority_upsampled])

9. Validate and Save Cleaned Data

Validate:

print(df.isnull().sum()) print(df.describe())

Save:

df.to_csv("cleaned_data.csv", index=False)

10. Automate and Reuse

Wrap these steps into functions or classes for reusability:

def clean_data(df): # Apply cleaning steps return df

Free Python learning and practice opportunities:

Here’s a list of platforms and labs offering free Python learning and practice opportunities:

领英推荐

5 Alternatives to Matplotlib That Make Data…

StrataScratch 1 个月前

Data Analytics tool and their implementation with…

Ajiboye Abayomi 8 个月前

The application of Python for analysis and…

Rodrigo C. 1 年前

1. Google's Python Class

Features:Free tutorials and exercises.Focuses on basic Python concepts and common tasks like file I/O, data handling, and more.Includes downloadable lecture videos and practice materials.
Best for: Beginners.

2. Kaggle

Features:Python courses focusing on data analysis, machine learning, and visualization.Interactive notebooks for hands-on practice.Free datasets and challenges for applying Python skills.
Best for: Data science enthusiasts.

3. SoloLearn

Features:Interactive Python tutorials with quizzes.Community-driven discussions for doubts.Mobile-friendly app for on-the-go learning.
Best for: Beginners seeking quick lessons.

4. HackerRank

Features:Python-specific challenges across different levels.Gamified experience with badges and leaderboards.Focus on problem-solving and algorithmic thinking.
Best for: Practice with a competitive edge.

5. Codewars

Features:Challenges (kata) to improve Python skills.Choose challenges by difficulty level.Collaborative learning through shared solutions.
Best for: Intermediate and advanced users.

6. EdX

Features:Offers free Python courses from institutions like MIT and Microsoft.Audit courses for free (certificate requires payment).
Best for: Structured academic learning.

7. freeCodeCamp

Features:Comprehensive Python curriculum, including topics like NumPy, pandas, and data visualization.Hands-on projects and interactive lessons.
Best for: Beginners to intermediate learners.

8. Real Python

Features:Free tutorials on Python basics and advanced topics.Covers tools, libraries, and frameworks.Blog-style lessons with code examples.
Best for: Intermediate learners.

9. W3Schools Python Tutorial

Features:Simple and beginner-friendly.Interactive Python shell for coding alongside tutorials.
Best for: Absolute beginners.

Here’s a list of platforms and labs offering free Python learning and practice opportunities:

1. Google's Python Class

Features:Free tutorials and exercises.Focuses on basic Python concepts and common tasks like file I/O, data handling, and more.Includes downloadable lecture videos and practice materials.
Best for: Beginners.

2. Kaggle

Features:Python courses focusing on data analysis, machine learning, and visualization.Interactive notebooks for hands-on practice.Free datasets and challenges for applying Python skills.
Best for: Data science enthusiasts.

3. SoloLearn

Features:Interactive Python tutorials with quizzes.Community-driven discussions for doubts.Mobile-friendly app for on-the-go learning.
Best for: Beginners seeking quick lessons.

4. HackerRank

Features:Python-specific challenges across different levels.Gamified experience with badges and leaderboards.Focus on problem-solving and algorithmic thinking.
Best for: Practice with a competitive edge.

5. Codewars

Features:Challenges (kata) to improve Python skills.Choose challenges by difficulty level.Collaborative learning through shared solutions.
Best for: Intermediate and advanced users.

6. EdX

Features:Offers free Python courses from institutions like MIT and Microsoft.Audit courses for free (certificate requires payment).
Best for: Structured academic learning.

7. freeCodeCamp

Features:Comprehensive Python curriculum, including topics like NumPy, pandas, and data visualization.Hands-on projects and interactive lessons.
Best for: Beginners to intermediate learners.

8. Real Python

Features:Free tutorials on Python basics and advanced topics.Covers tools, libraries, and frameworks.Blog-style lessons with code examples.
Best for: Intermediate learners.

9. W3Schools Python Tutorial

Features:Simple and beginner-friendly.Interactive Python shell for coding alongside tutorials.
Best for: Absolute beginners.

10. GeeksforGeeks Python Programming

Features:Beginner to advanced-level topics.Covers theoretical concepts with practical examples.
Best for: Comprehensive understanding.

11. PythonAnywhere

Features:Free cloud-based Python environment for practicing code.Ideal for experimenting without needing local installations.
Best for: Hands-on coding.

12. Project Euler

Features:Mathematics-based Python challenges.Ideal for sharpening logic and problem-solving skills.
Best for: Advanced users with an interest in algorithms.

13. PyBites

Features:Free beginner-level exercises.Bite-sized Python challenges.
Best for: Busy learners seeking quick practice.

14. Python.org Tutorials

Features:Official Python documentation with a free beginner's guide.Includes examples and exercises.
Best for: Learners who prefer an official source.

Python Notes

2,855 位关注者

Gaspard Trevoux

Directeur de la Stratégie digitale chez Le Figaro

2 个月

Sounds dope. Python's always evolving. How does this edition compare to the last one?

查看更多评论

要查看或添加评论，请登录

Hemant D.的更多文章

DATA ANALYSIS IN PYTHON

2024年11月25日

DATA ANALYSIS IN PYTHON

Data analysis in Python typically follows a structured process. Here’s a step-by-step outline to guide you: 1.
Numpy

2024年11月18日

Numpy

What is NumPy? NumPy (Numerical Python) is an open-source library used for numerical computing. It provides support for…

2 条评论
Python 3.13

2024年10月8日

Python 3.13

Python 3.13.
Tableau Pulse :

2024年8月28日

Tableau Pulse :

Tableau Pulse is a feature introduced by Tableau as part of its broader focus on enhancing the data experience for…
Pyhton Notes Edition 6:

2024年8月27日

Pyhton Notes Edition 6:

Do you realize that how to generate a sequence number in python? There are several ways to generate a sequence number…
How to generate OTP in Python?

2024年8月18日

How to generate OTP in Python?

You can generate a One-Time Password (OTP) in Python using various methods. Here are a few common approaches: 1.
IDENTFIERS

2024年8月7日

IDENTFIERS

In Python, identifiers are names given to entities like variables, functions, classes, modules, etc. Here are the rules…

2 条评论
Python Notes Edition 3

2024年7月7日

Python Notes Edition 3

Freeware: =>If any software downloaded Freely and that Software comes under Freeware Examples: Python, Java-----…

1 条评论
Python Version

2024年4月20日

Python Version

==================================================== Python programming language contains 3 Types of version. They are…
Python news letter by Weekly

2024年4月14日

Python news letter by Weekly

Dive into the world of Python with our newsletter! Stay updated on the latest trends, tips, and tricks in the Python…

1 条评论

See all articles

1. Load and Inspect the Data

2. Handle Missing Values

3. Fix Data Types

4. Outlier Detection and Handling

5. Remove Duplicates

6. Feature Engineering

7. Text Data Cleaning

Tokenization:

8. Handle Imbalanced Data

9. Validate and Save Cleaned Data

10. Automate and Reuse

Free Python learning and practice opportunities:

领英推荐

1. Google's Python Class

2. Kaggle

3. SoloLearn

4. HackerRank

5. Codewars

6. EdX

7. freeCodeCamp

8. Real Python

9. W3Schools Python Tutorial

1. Google's Python Class

2. Kaggle

3. SoloLearn

4. HackerRank

5. Codewars

6. EdX

7. freeCodeCamp

8. Real Python

9. W3Schools Python Tutorial

10. GeeksforGeeks Python Programming

11. PythonAnywhere

12. Project Euler

13. PyBites

14. Python.org Tutorials

Python Notes

2,855 位关注者

Hemant D.的更多文章

DATA ANALYSIS IN PYTHON

Numpy

Python 3.13

Tableau Pulse :

Pyhton Notes Edition 6:

How to generate OTP in Python?

IDENTFIERS

Python Notes Edition 3

Python Version

Python news letter by Weekly

社区洞察

其他会员也浏览了

Using Python Pandas to turn ISO Country Codes into a string to use as values for a SQL Query

6th Story – If You can Visualize It. You can Explain It

Data and the Meaning of Life According to Monty Python

Time to buy Safaricom (SCOM) shares ? Market analysis using Python

DQ Outlier Detection with Interquartile Range (IQR) in Python

Day 3: Exploring Python Data Types – Built-in Types, Getting, and Setting Data Types (Boy and Monk Series)

The Chance Framework: How to Explain A/B Test Results to Managers Using Probability (Without p-values)

How Memory Works for Multi-User, Multi-Request Applications.

Random Data Generation ( Important Topic )

How to Write More Efficient Pandas Code in Python