Data Manipulation with Pandas
Ways of Manipulating Data with Pandas in Python with Examples
Pandas is a popular data manipulation library for Python. It provides data structures for efficiently storing and manipulating large datasets. In this blog post, we will introduce some of the most common ways of manipulating data with Pandas, along with examples.
Subset Selection
Subset selection is the process of selecting a subset of rows or columns from a DataFrame based on certain criteria. Here are some examples:
Selecting a Single Column
To select a single column from a DataFrame, you can use the column name as an index. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df['name'])
Output:
0 John
1 Mary
2 Peter
3 Jeff
4 Lisa
Name: name, dtype: object
Selecting Multiple Columns
To select multiple columns from a DataFrame, you can pass a list of column names as an index. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df[['name', 'age']])
Output:
name age
0 John 23
1 Mary 19
2 Peter 42
3 Jeff 31
4 Lisa 24
Selecting Rows Based on a Condition
To select rows from a DataFrame based on a condition, you can use boolean indexing. For example, to select all rows where the age is greater than 30:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df[df['age'] > 30])
Output:
name age country
2 Peter 42 Australia
3 Jeff 31 USA
Loc and Iloc in Pandas: Explained with Examples
Two of the most commonly used methods for data selection and manipulation in Pandas are?loc?and?iloc. Let's introduce these methods and provide examples of how they can be used.
Loc
loc?is a label-based data selection method in Pandas. It allows you to select rows and columns from a DataFrame based on their labels. Here are some examples:
Selecting a Single Row
To select a single row based on its label, you can use the row label as an index. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc['a'])
Output:
name John
age 23
country USA
Name: a, dtype: object
Selecting Multiple Rows
To select multiple rows based on their labels, you can pass a list of row labels as an index. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc[['a', 'c', 'e']])
Output:
name age country
a John 23 USA
c Peter 42 Australia
e Lisa 24 Canada
Selecting a Single Column
To select a single column based on its label, you can use the column name as an index. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc[:, 'name'])
Output:
a John
b Mary
c Peter
d Jeff
e Lisa
Name: name, dtype: object
Selecting Multiple Columns
To select multiple columns based on their labels, you can pass a list of column names as an index. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc[:, ['name', 'age']])
Output:
name age
a John 23
b Mary 19
c Peter 42
d Jeff 31
e Lisa 24
Selecting Rows and Columns
To select both rows and columns based on their labels, you can pass both row and column labels as indexes. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc[['a', 'c', 'e'], ['name', 'age']])
Output:
name age
a John 23
c Peter 42
e Lisa 24
Iloc
iloc?is an integer-based data selection method in Pandas. It allows you to select rows and columns from a DataFrame based on their integer positions. Here are some examples:
Selecting a Single Row
To select a single row based on its integer position, you can use the row index as an index. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df.iloc[0])
Output:
name John
age 23
country USA
Name: 0, dtype: object
Selecting Multiple Rows
To select multiple rows based on their integer positions, you can pass a list of row indexes as an index. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df.iloc[[0, 2, 4]])
Output:
name age country
0 John 23 USA
2 Peter 42 Australia
4 Lisa 24 Canada
Selecting a Single Column
To select a single column based on its integer position, you can use the column index as an index. For example:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
'age': [23, 19, 42, 31, 24],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df.iloc[:, 1])
Groupby and Data Aggregation in Pandas
One of the most important functionalities of Pandas is its ability to group data and perform data aggregation. In this article, Let's 's introduce the?groupby?method and provide examples of how it can be used for data aggregation.
领英推荐
Groupby
The?groupby?method in Pandas allows you to group rows of a DataFrame based on the values in one or more columns. Once the rows are grouped, you can perform various operations on the groups, such as counting, summing, averaging, and more.
Here’s an example of how to use?groupby?to group rows of a DataFrame based on the values in one column:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa', 'Ann', 'Bob', 'Alice'],
'age': [23, 19, 42, 31, 24, 26, 29, 35],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada', 'USA', 'USA', 'Australia'],
'gender': ['M', 'F', 'M', 'M', 'F', 'F', 'M', 'F']}
df = pd.DataFrame(data)
grouped = df.groupby('country')
In this example, we create a DataFrame with four columns:?name,?age,?country, and?gender. Then, we use the?groupby?method to group the rows of the DataFrame based on the values in the?country?column. The result is a?DataFrameGroupBy?object that contains the groups.
Data Aggregation
Once you have grouped the rows of a DataFrame, you can perform various operations on the groups. These operations are collectively known as data aggregation. Here are some examples of how to perform data aggregation using the?groupby?method:
Counting Rows in Each Group
To count the number of rows in each group, you can use the?size?method:
grouped.size()
Output:
country
Australia 2
Canada 2
USA 4
dtype: int64
This returns a Pandas?Series?object that contains the number of rows in each group.
Summing Values in Each Group
To sum the values in each group, you can use the?sum?method:
grouped.sum()
Output:
age
country
Australia 77
Canada 43
USA 107
This returns a DataFrame that contains the sum of the?age?column for each group.
Averaging Values in Each Group
To calculate the average of the values in each group, you can use the?mean?method:
grouped.mean()
Output:
age
country
Australia 38.500000
Canada 21.500000
USA 26.750000
This returns a DataFrame that contains the average of the?age?column for each group.
Maximum and Minimum Values in Each Group
To find the maximum and minimum values in each group, you can use the?max?and?min?methods:
grouped.max()
Output:
age gender name
country
Australia 42 M Peter
Canada 24 F Mary
USA 35 M Robert
grouped.min()
Output:
age gender name
country
Australia 35 F Alice
Canada 19 F Lisa
USA 23 F Ann
These methods return DataFrames that contain the maximum and minimum values for each column in each group.
Applying Custom Functions
You can also apply custom functions to the groups using the?apply?method. For example, to find the range of ages in each group, you can define a custom function that calculates the range and apply it to the?age?column:
def age_range(group):
return group.max() - group.min()
grouped['age'].apply(age_range)
Output:
country
Australia 7
Canada 5
USA 12
Name: age, dtype: int64
This returns a Pandas?Series?object that contains the age range for each group.
Pivot Tables in Pandas: A Comprehensive Guide with Examples
One of the most powerful functionalities of Pandas is its ability to create pivot tables. In this article, let's introduce pivot tables and provide examples of how they can be used for data analysis.
What are Pivot Tables?
A pivot table is a table that summarizes and aggregates data from another table. It allows you to group, filter, and aggregate data based on different criteria. Pivot tables are especially useful for analyzing large datasets with many variables.
Here’s an example of how to create a pivot table in Pandas:
import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa', 'Ann', 'Bob', 'Alice'],
'age': [23, 19, 42, 31, 24, 26, 29, 35],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada', 'USA', 'USA', 'Australia'],
'gender': ['M', 'F', 'M', 'M', 'F', 'F', 'M', 'F']}
df = pd.DataFrame(data)
pivot = pd.pivot_table(df, values='age', index='country', columns='gender', aggfunc='mean')
print(pivot)
In this example, we create a DataFrame with four columns:?name,?age,?country, and?gender. Then, we create a pivot table using the?pd.pivot_table()?method. We specify the?values?to aggregate (age), the?index?column (country), the?columns?to group by (gender), and the?aggfunc?to use for aggregation (mean). The result is a pivot table that shows the average age of males and females in each country.
Pivot Table Examples
Pivot Table with Multiple Aggregations
You can also create pivot tables with multiple aggregations. For example, to show the total and average age of males and females in each country:
pivot = pd.pivot_table(df, values='age', index='country', columns='gender', aggfunc=['sum', 'mean'])
print(pivot)
Output:
sum mean
gender F M F M
country
Australia 35 42 35.0 42.0
Canada 43 24 21.5 24.0
USA 49 107 24.5 35.7
This pivot table shows the sum and mean age of males and females in each country.
Pivot Table with Multiple Indexes
You can also create pivot tables with multiple indexes. For example, to show the average age of males and females in each country and state:
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa', 'Ann', 'Bob', 'Alice'],
'age': [23, 19, 42, 31, 24, 26, 29, 35],
'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada', 'USA', 'USA', 'Australia'],
'state': ['NY', 'BC', 'NSW', 'CA', 'ON', 'NY', 'CA', 'VIC'],
'gender': ['M', 'F', 'M', 'M', 'F', 'F', 'M', 'F']}
df = pd.DataFrame(data)
pivot = pd.pivot_table(df, values='age', index=['country', 'state'], columns='gender', aggfunc='mean')
print(pivot)
Output:
gender F M
country state
Australia NSW NaN 42.0
VIC 35.0 NaN
Canada BC 19.0 NaN
ON 24.0 NaN
USA CA NaN 31.0
NY 26.0 23.0
This pivot table shows the average age of males and females in each country and state.
Pivot Table with Custom Aggregation Function
You can also create pivot tables with custom aggregation functions. For example, to show the range of ages of males and females in each country:
def age_range(group):
return group.max() - group.min()
pivot = pd.pivot_table(df, values='age', index='country', columns='gender', aggfunc=age_range)
print(pivot)
Output:
gender F M
country
Australia NaN 7.0
Canada 5.0 NaN
USA 8.0 12.0
This pivot table shows the age range of males and females in each country.
Conclusion
In this document, we have introduced Pandas, a popular data manipulation library for Python. We have covered its basic functionalities, such as data structures, creating DataFrames, and reading and writing data. We have also discussed some common ways of manipulating data with Pandas, such as subset selection and the?loc?and?iloc?methods. Overall, Pandas is a powerful tool for working with data in Python, and its capabilities go far beyond what we have covered here. If you are working with data, Pandas is definitely worth exploring further.
AI Engineer | Generative AI | LLM | LLMOps | Computer Vision
1 年Congrats Can, great examples I loved it! ??