Real-World Data Merging: Inner Joins & Aggregations in Pandas
Mastering Data Integration with CSV Files in Pandas
Merging datasets is a key skill in data analysis and engineering. In real-world scenarios, data often comes from multiple sources, and understanding how to integrate and analyze this information is crucial.
In this article, we’ll cover:
By the end, you’ll be able to confidently integrate and process large datasets, making your analysis more effective and actionable.
Why This Module is Important
Merging and aggregating data enables:
In this module, we introduce practical dataset merging techniques, helping you work with real-world CSV files efficiently.
Merging CSV Files Using Inner Joins
The Real-World Scenario
We are working with two datasets:
1?? Sales Representatives Data (sales_reps_data.csv)
2?? Toyota Sales Data (toyota_sales_data.csv)
The goal is to merge these datasets to match sales reps with their sales data and compute aggregated sales totals.
You can download the datasets from the following GitHub link: GitHub Datasets
Step-by-Step Implementation in Pandas
1?? Loading CSV Files as Pandas DataFrames
We are working with two datasets:
import pandas as pd
# Load the CSV files
sales_reps_data = pd.read_csv("data/car_sales/sales_reps_data.csv")
toyota_sales_data = pd.read_csv("data/car_sales/toyota_sales_data.csv")
# Display the column names of the sales representatives dataset
sales_reps_columns = sales_reps_data.columns
print("Sales Representatives Data Columns:", sales_reps_columns)
# Display the column names of the Toyota sales dataset
toyota_sales_columns = toyota_sales_data.columns
print("Toyota Sales Data Columns:", toyota_sales_columns)
2?? Performing an Inner Join
We match each sales record with the corresponding sales rep by joining on:
# Merge DataFrames
merged_data = pd.merge(
sales_reps_data,
toyota_sales_data,
left_on="rep_id",
right_on="sale_rep_id",
how="inner"
)
print(merged_data.shape) # Check dimensions
print(merged_data)
?? Observations:
3?? Aggregating Sales by Sales Representatives
After merging, let’s calculate the total sales amount per sales rep:
# Group the merged data by sales representative details and sum the sales amount
sales_by_rep = merged_data. \
groupby(["rep_id", "first_name", "last_name", "region"])["sale_amount"]. \
sum()
print(sales_by_rep)
# Group the merged data again, summing the sales amount, but this time reset the index
sales_by_rep = merged_data. \
groupby(["rep_id", "first_name", "last_name", "region"])["sale_amount"]. \
sum(). \
reset_index()
print(sales_by_rep)
4?? Adding a Commission Calculation
Now, let's calculate the commission earned by each sales rep using the commission_percentage field.
merged_data["commission_earned"] = merged_data["sale_amount"] * (merged_data["commission_pct"].fillna(0))
print(merged_data[["rep_id", "first_name", "last_name", "sale_amount", "commission_pct", "commission_earned"]]
?? Handling Missing Data:
? Missing commission percentages are set to 0 using .fillna(0).
? The new column calculates the commission earned per sale.
领英推荐
5?? Handling Missing Data in Sales Records
Missing data can cause problems in calculations. We should fill missing values:
merged_data["commission_pct"].fillna(0, inplace=True)
merged_data["sale_amount"].fillna(0, inplace=True)
This ensures data integrity when performing calculations.
6?? Full Data Processing Pipeline
Here’s a complete example of how the process works from start to finish:
# Load CSV files
sales_reps = pd.read_csv("data/car_sales/sales_reps_data.csv")
toyota_sales = pd.read_csv("data/car_sales/toyota_sales_data.csv")
# Merge DataFrames using Inner Join
merged_data = pd.merge(
sales_reps_data,
toyota_sales_data,
left_on="rep_id",
right_on="sale_rep_id",
how="inner"
)
# Fill Missing Values
merged_data["commission_pct"].fillna(0, inplace=True)
# Calculate Commission Earned
merged_data["commission_earned"] = round(
merged_data["sale_amount"] * merged_data["commission_pct"], 2
)
# Group Sales by Rep
commission_earned_by_rep = merged_data. \
groupby(["rep_id", "email"])["commission_earned"]. \
sum(). \
reset_index()
# Display Results
print(commission_earned_by_rep)
merged_data.isnull().sum()
This script fully integrates & aggregates sales data into a structured report!
Best Practices for Data Merging & Aggregation
Always inspect data after joins
Group and aggregate data carefully
Handle missing data
Practice Assignment
?? Want to practice? Attempt the Real-World Data Joining and Aggregation in Pandas Assignment ?? Click Here.
?? Need help? Leave a comment, and we’ll assist you!
What’s Next?
In the next lecture, we will explore Data Transformations in Pandas. Data transformation is an essential step in the data analysis workflow, helping to clean, standardize, and modify data to make it more useful for analysis.
Click ?? to Enroll in the Python for Beginners: Learn Python with Hands-on Projects. It only costs $10 and you can reach out to us for $10 Coupon.
Conclusion
In this guide, we covered:
Conclusion: Join or Merge Operations Using Pandas
Congratulations on completing the module on working with multiple DataFrames, specifically focusing on join and merge operations using Pandas. You’ve taken a crucial step toward mastering data integration and transformation techniques, which are essential for real-world data engineering and analytics.
Key Takeaways:
Throughout this module, we explored various techniques for combining datasets efficiently:
Keep in Mind:
Working with multiple DataFrames is a core skill in data engineering and analytics. These techniques are directly applicable in key areas like:
By mastering these skills, you will be better equipped to handle complex data workflows and support data-driven decision-making.
Thank you for following along! I hope you found it valuable and engaging. See you in the next module, where we’ll continue building your expertise in Pandas.
?? Engage With Us!
? Authored by Siva Kalyan Geddada , Abhinav Sai Penmetsa
?? Share this article with anyone interested in data engineering, Python, or data analysis. ?? Have questions or need help? Comment below! Let's discuss.
?? Follow us for more hands-on data science tutorials!