Python Data Analysis: A Comprehensive Guide from Basics to Advanced
Introduction
Data analysis is one of the most crucial skills in today's data-driven world. Python, with its rich ecosystem of libraries, provides powerful tools to clean, manipulate, analyze, and visualize data effectively. This newsletter covers a structured approach to Python data analysis, from fundamental techniques to advanced methodologies.
1. Setting Up Your Environment
Before diving into data analysis, you need to set up a proper environment:
2. Data Handling with Pandas
Loading Data
import pandas as pd
# Load CSV file
df = pd.read_csv("data.csv")
# Load Excel file
df = pd.read_excel("data.xlsx")
# Load JSON file
df = pd.read_json("data.json")
Exploring Data
print(df.head()) # First five rows
print(df.info()) # Column types and non-null values
print(df.describe()) # Summary statistics
Cleaning Data
# Handling missing values
df.dropna(inplace=True) # Remove missing values
df.fillna(df.mean(), inplace=True) # Fill missing values with mean
# Removing duplicates
df.drop_duplicates(inplace=True)
# Renaming columns
df.rename(columns={"old_name": "new_name"}, inplace=True)
3. Data Manipulation
Filtering and Sorting
# Filtering data
df_filtered = df[df["age"] > 25]
# Sorting data
df_sorted = df.sort_values(by="salary", ascending=False)
Grouping and Aggregation
# Group by department and calculate mean salary
df_grouped = df.groupby("department")["salary"].mean()
# Multiple aggregations
df_agg = df.groupby("department").agg({"salary": ["mean", "median"], "age": "max"})
Merging and Joining
df1 = pd.DataFrame({"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"]})
df2 = pd.DataFrame({"ID": [1, 2, 3], "Salary": [50000, 60000, 70000]})
# Merge on ID column
df_merged = pd.merge(df1, df2, on="ID", how="inner")
4. Data Visualization
Matplotlib Basics
import matplotlib.pyplot as plt
# Line plot
plt.plot(df["year"], df["sales"], marker="o")
plt.xlabel("Year")
plt.ylabel("Sales")
plt.title("Yearly Sales Trend")
plt.show()
Seaborn for Advanced Visuals
import seaborn as sns
# Histogram
sns.histplot(df["age"], bins=20, kde=True)
plt.show()
# Boxplot
sns.boxplot(x="department", y="salary", data=df)
plt.show()
5. Statistical Analysis
Correlation and Covariance
# Correlation matrix
df.corr()
# Covariance matrix
df.cov()
Hypothesis Testing
from scipy import stats
# t-test
stats.ttest_ind(df[df["gender"]=="Male"]["salary"], df[df["gender"]=="Female"]["salary"])
6. Machine Learning Basics with Scikit-Learn
Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df[["age", "salary"]]
y = df["purchase"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Training a Model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
7. Time Series Analysis
import pandas as pd
import matplotlib.pyplot as plt
# Load time series data
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
df["sales"].plot()
plt.show()
Conclusion
Python provides an extensive toolkit for data analysis, from data wrangling and visualization to machine learning. Mastering these techniques can enhance your ability to extract insights and make data-driven decisions.
Stay tuned for more deep dives into Python data science topics!
That sounds like an amazing resource for anyone looking to dive into Python for data analysis!? Anmol Nayak