Identifying and Handling Outliers in Pandas: With Python Examples by Fidel V.
Identifying and handling outliers in pandas involves several steps, such as detecting outliers, deciding how to handle them (remove, replace, or keep), and implementing the chosen strategy. Below, I'll provide a step-by-step, along with coding examples using Python and pandas.
python Mad Scientist Desktop
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
python Mad Scientist Desktop
# Creating a Fidel DataFrame
data = {'A': [1, 2, 3, 4, 5, 200],
'B': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)
python Mad Scientist Desktop
# Calculate Z-score for each column
z_scores = (df - df.mean()) / df.std()
# Find outliers based on Z-score
threshold = 2
outliers = df[(z_scores > threshold).any(axis=1)]
print("Outliers detected:")
print(outliers)
python Mad Scientist Desktop
# Replace outliers with median
median = df.median()
df = df.mask((z_scores > threshold).any(axis=1), median, axis=1)
print("DataFrame after handling outliers:")
print(df)
领英推荐
python Mad Scientist Desktop
# Boxplot before outlier handling
plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
df.boxplot()
plt.title("Before Outlier Handling")
# Boxplot after outlier handling
plt.subplot(1, 2, 2)
df.boxplot()
plt.title("After Outlier Handling")
plt.show()
Here I will create visualization code to visualize outliers before and after handling using boxplots
Done! This code identifies outliers using Z-score and replaces them with the median. You can adjust the threshold and choose a different handling strategy based on your data and analysis requirements.
{Thank you for your attention and commitment to follow me}
Best regards,
Fidel Vetino
Solution Architect & Cybersecurity Analyst
#nasa / #Aerospace / #spacex / #AWS / #oracle / #microsoft / #GCP / #Azure / #ERP / #spark / #snowflake / #SAP / #AI / #GenAI / #LLM / #ML / #machine_learning / #cybersecurity / #itsecurity / #python / #Databricks / #Redshift / #deltalake / #datalake / #apache_spark / #tableau / #SQL / #MongoDB / #NoSQL / #acid / #apache / #visualization / #sourcecode / #opensource / #datascience / #pandas / #AIX / #unix / #linux / #bigdata / #freebsd / #pandas / #cloud