Aggregating DataFrames in Pandas
Can Arslan
?? Founder & Startup Consultant @InnvestMind | Founder & Data Analytics Instructor @Hands-on Mentor
Pandas is a popular library for data manipulation and analysis in Python. One of its key features is the ability to aggregate data in a DataFrame. In this tutorial, we will explore the various ways of aggregating data in Pandas, including using groupby(), pivot_table(), and crosstab() functions.
1. Groupby() Function
The groupby() function is used to group rows of data based on one or more columns. Once the data is grouped, we can perform various aggregate functions on the grouped data, such as sum(), mean(), max(), min(), and count(). Here is an example of how to use the groupby() function:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
'Score': [90, 80, 70, 85, 75, 65]
})
# Group the data by Name column and calculate the mean score for each group
grouped_data = df.groupby('Name')['Score'].mean()
print(grouped_data)
Output:
Name
Alice 87.5
Bob 77.5
Charlie 67.5
Name: Score, dtype: float64
In this example, we grouped the data by the 'Name' column and calculated the mean score for each group. The output shows the mean score for each name.
2. Pivot_table() Function
The pivot_table() function is used to summarize and aggregate data in a DataFrame. It is similar to the groupby() function but provides additional functionality. Here is an example of how to use the pivot_table() function:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
'Subject': ['Math', 'Math', 'Math', 'Science', 'Science', 'Science'],
'Score': [90, 80, 70, 85, 75, 65]
})
# Create a pivot table to summarize the data by Name and Subject
pivot_table = pd.pivot_table(df, values='Score', index='Name', columns='Subject', aggfunc='mean')
print(pivot_table)
Output:
Subject Math Science
Name
Alice 90.0 85.0
Bob 80.0 75.0
Charlie 70.0 65.0
In this example, we created a pivot table to summarize the data by 'Name' and 'Subject' columns. The pivot table shows the mean score for each name and subject.
领英推荐
3. Crosstab() Function
The crosstab() function is used to compute a frequency table of two or more factors. It is similar to the pivot_table() function but provides a more concise output. Here is an example of how to use the crosstab() function:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
'Subject': ['Math', 'Math', 'Math', 'Science', 'Science', 'Science']
})
# Create a crosstab to summarize the data by Name and Subject
crosstab_table = pd.crosstab(df['Name'], df['Subject'])
print(crosstab_table)
Output:
Subject Math Science
Name
Alice 2 1
Bob 1 1
Charlie 1 1
In this example, we created a crosstab to summarize the data by 'Name' and 'Subject' columns. The crosstab table shows the frequency of each name and subject combination.
Using User-Defined Functions
Here is an example of how to use groupby() and apply() together:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
'Score': [90, 80, 70, 85, 75, 65]
})
# Define a custom function to apply to each group
def normalize_scores(group):
"""
Return a Series with the normalized scores for the input group.
The normalized score is calculated as the difference between
each score and the mean score for the group.
"""
mean_score = group.mean()
normalized_scores = group - mean_score
return normalized_scores
# Group the data by Name column and apply the normalize_scores function to each group
normalized_data = df.groupby('Name')['Score'].apply(normalize_scores)
print(normalized_data)
Output:
0 2.5
1 2.5
2 2.5
3 -2.5
4 -2.5
5 -2.5
Name: Score, dtype: float64
In this example, we defined a custom function called normalize_scores that takes a group of data as input and returns a Series with the normalized scores for that group. We then used the groupby() function to group the data by the 'Name' column, and applied the normalize_scores function to each group using the apply() function. The output shows the normalized scores for each row in the original DataFrame.
To explain the output, the function calculates the mean score for each group, and then subtracts the mean from each score in the group. The result is a Series with the normalized scores for each group. In this example, the scores for Alice, Bob, and Charlie are each 5 points above their respective group means, while the scores for the second row of each group are 5 points below their respective group means.
Conclusion
Aggregating data in Pandas is a powerful tool for summarizing and analyzing large datasets. The groupby(), pivot_table(), and crosstab() functions provide different ways of aggregating data, depending on the specific requirements of your analysis. With these tools, you can quickly gain insights into your data and make informed decisions based on the results.