Efficient Data Processing with Pandas: Chaining Transformations
In this article, we'll explore chaining transformations in Pandas, a powerful technique for cleaning, transforming, and analyzing data efficiently. By chaining multiple operations, you can improve:
- Efficiency – Avoid unnecessary intermediate variables.
- Readability – Write concise and intuitive code.
- Scalability – Easily extend workflows for complex tasks.
We’ll demonstrate real-world examples using the Toyota Sales Dataset.
Why Use Chaining for Data Transformations?
Instead of writing multiple steps separately, chaining allows us to combine multiple operations into a single workflow.
Benefits:
- Reduces temporary variables, improving memory efficiency.
- Enhances readability, making code more intuitive.
- Facilitates debugging, since each transformation builds on the previous one.
Let’s see how it works in action!
You can download the datasets from the following GitHub link: GitHub Datasets
Step 1: Load the Dataset
import pandas as pd
# Load dataset
toyota_sales_data = pd.read_csv("data/car_sales/Toyota_sales_data.csv")
# Preview dataset
print(toyota_sales_data.head())
Step 2: Traditional Approach vs. Chained Approach
Let's filter completed sales and calculate commission.
Traditional (Step-by-Step) Approach
# Filter completed sales
completed_sales = toyota_sales_data[toyota_sales_data["sale_status"] == "Completed"]
# Compute commission earned
completed_sales["commission_earned"] = completed_sales["sale_amount"] * completed_sales["commission_pct"]
# Display results
print(completed_sales.head())
? Warning: This generates a Pandas SettingWithCopyWarning, indicating potential issues with modifying a slice of the DataFrame.
Step 3: Chaining Transformations
Using Pandas chaining, we can perform the same operations in one seamless workflow.
Chained Approach
sales_with_commission = (
toyota_sales_data
.query("sale_status == 'Completed'") # Filter completed sales
.assign(commission_earned=lambda x: x["sale_amount"] * x["commission_pct"]) # Compute commission
)
print(sales_with_commission.head())
Step 4: Handling Missing Values
What if commission_percentage has missing values (NaN)? We can fill missing values with 0 while chaining.
sales_with_commission = (
toyota_sales_data
.query("sale_status == 'Completed'")
.assign(
commission_pct=lambda x: x["commission_pct"].fillna(0), # Replace NaN with 0
commission_earned=lambda x: x["sale_amount"] * x["commission_pct"] # Compute commission
)
)
print(sales_with_commission.head())
? Handling missing values ensures data integrity.
Step 5: Using round() for Formatting
Let’s round commission values to two decimal places.
sales_with_commission = (
toyota_sales_data
.query("sale_status == 'Completed'")
.assign(
commission_pct=lambda x: x["commission_pct"].fillna(0),
commission_earned=lambda x: round(x["sale_amount"] * x["commission_pct"], 2) # Round to 2 decimals
)
)
print(sales_with_commission.head())
Step 6: Merging Multiple DataFrames with Chaining
Let’s merge Sales Data with Sales Rep Data, then filter & compute commission in a single workflow.
领英推è
sales_reps_data = pd.read_csv("data/car_sales/sales_reps_data.csv")
# Chained transformations: Merge, Filter, Compute Commission
sales_with_commission = (
pd.merge(sales_reps_data, toyota_sales_data, left_on="rep_id", right_on="sale_rep_id", how="inner")
.query("sale_status == 'Completed'")
.assign(
commission_earned=lambda x: round(x["sale_amount"] * x["commission_pct"].fillna(0), 2)
)
)
print(sales_with_commission.head())
Step 7: Debugging Chained Operations
Debugging a chained workflow can be done step-by-step:
# Step 1: Merge Data
merged_data = pd.merge(sales_reps_data, toyota_sales_data, left_on="rep_id", right_on="sale_rep_id", how="inner")
# Step 2: Filter Completed Sales
filtered_data = merged_data.query("sale_status == 'Completed'")
# Step 3: Compute Commission
final_data = filtered_data.assign(
commission_earned=lambda x: round(x["sale_amount"] * x["commission_pct"].fillna(0), 2)
)
# Preview Data
print(final_data.head())
? Break transformations into steps before chaining them together.
Best Practices for Chaining
Ensure Readability
- Use line breaks and parentheses for clarity.
- Use comments to explain transformations.
Use .assign() for New Columns
- Avoid modifying original DataFrame slices.
- .assign() prevents SettingWithCopyWarning.
Use .query() for Filtering
- More readable than df[df["column"] == "value"].
Use .fillna() for Handling Missing Data
- Prevents NaN values from breaking calculations.
Break Down Complex Steps for Debugging
- Test transformations individually before chaining.
Practice Assignment
?? Want to practice? Attempt the Chaining Transformations in Pandas Assignment
?? Click here.
?? Need help? Leave a comment, and we’ll assist you!
What’s Next?
In the next module, we’ll explore advanced chaining techniques that allow you to write cleaner, more efficient, and readable Pandas code by linking multiple operations together seamlessly.
Click ?? to Enroll in the Python for Beginners: Learn Python with Hands-on Projects. It only costs $10 and you can reach out to us for $10 Coupon.
Conclusion
In this module, we explored:
- How to use Pandas chaining to streamline data transformations.
- How to merge, filter, and compute new columns efficiently.
- How to debug complex workflows before chaining.
- Best practices for readability and performance.
?? Engage With Us!
? Authored by Siva Kalyan Geddada , Abhinav Sai Penmetsa
?? Share this article with anyone interested in data engineering, Python, or data analysis. ?? Have questions or need help? Comment below! Let's discuss.
?? Follow us for more hands-on data science tutorials!