How to “apply” your Panda(s)
Antony Oliver
Network Automation Developer | Network Automation Tool Creator | Network Design | Network Automation Advocate
When working with datasets in Pandas, we often need to restructure the data due to it being collected from different sources and having different formats. An example of this is when collecting interface information from IOS and NXOS devices. IOS devices present IP subnets as subnet masks, whereas NXOS displays subnet values using prefixMask. To ensure consistency within the dataset or database, it's important to standardise the notation of the subnet mask. In this blog, I will explore options for working with data within a dataframe, focusing on standardising the subnet mask notation.
Imagine that you have a dataset of 40,000 interface records, all of which require validation, and any incorrect data needs to be reformatted. This task would normally require the use of a loop, but loops can be very slow. Fortunately, the Pandas Apply method can come to the rescue. The Apply method in Pandas is designed to apply a function along an axis of a DataFrame or a Series, and it's incredibly fast.
This process works by reading the contents of the first row of a defined column, and then sending it to a function for further processing. The function can return a response that overwrites the existing value, or it can be written to a new column, depending on your preference.
Before we start, lets import the data.
import pandas as pd
df_interface = pd.read_csv('blog_post/pandas_slice/dummy_data.csv')
df_interface
? ? deviceName? ? ?ipAddress? ? ?ipSubnetMask? ipPrefixMask? interface? ? ? ? ? ? ? ? ? ? ? ? ? description
0? ?test_lab_1? 192.168.10.1? ? 255.255.255.0? ? ? ? ? ?NaN? ?vlan1000? ?test interface test_lab_1_vlan1000
1? ?test_lab_1? 192.168.11.1? 255.255.255.252? ? ? ? ? ?NaN? ?vlan1001? ?test interface test_lab_1_vlan1001
2? ?test_lab_1? ? 172.16.1.1? 255.255.255.255? ? ? ? ? ?NaN? loopback0? test interface test_lab_1_loopback0
3? ?test_lab_1? ? 172.16.1.2? 255.255.255.255? ? ? ? ? ?NaN? loopback1? test interface test_lab_1_loopback1
4? ?test_lab_2? 192.168.14.1? ? ? ? ? ? ? NaN? ? ? ? ? 24.0? ?vlan1000? ?test interface test_lab_2_vlan1000
5? ?test_lab_2? 192.168.15.1? ? ? ? ? ? ? NaN? ? ? ? ? 29.0? ?vlan1001? ?test interface test_lab_2_vlan1001
6? ?test_lab_2? 192.168.16.1? ? ? ? ? ? ? NaN? ? ? ? ? 28.0? ?vlan1002? ?test interface test_lab_2_vlan1002
7? ?test_lab_2? 192.168.17.1? ? ? ? ? ? ? NaN? ? ? ? ? 24.0? ?vlan1003? ?test interface test_lab_2_vlan1003
I have imported the contents of my .csv file into Pandas, replaced/filled any NaN values, and then set the column types to integer and string.
df_interface['ipPrefixMask'] = df_interface['ipPrefixMask'].fillna('0').astype(int
df_interface['ipSubnetMask'] = df_interface['ipSubnetMask'].fillna('').astype(str))
df_interface
deviceName? ? ?ipAddress? ? ?ipSubnetMask? ipPrefixMask? interface? ? ? ? ? ? ? ? ? ? ? ? ? descriptio
0? ?test_lab_1? 192.168.10.1? ? 255.255.255.0? ? ? ? ? ? ?0? ?vlan1000? ?test interface test_lab_1_vlan1000
1? ?test_lab_1? 192.168.11.1? 255.255.255.252? ? ? ? ? ? ?0? ?vlan1001? ?test interface test_lab_1_vlan1001
2? ?test_lab_1? ? 172.16.1.1? 255.255.255.255? ? ? ? ? ? ?0? loopback0? test interface test_lab_1_loopback0
3? ?test_lab_1? ? 172.16.1.2? 255.255.255.255? ? ? ? ? ? ?0? loopback1? test interface test_lab_1_loopback1
4? ?test_lab_2? 192.168.14.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?24? ?vlan1000? ?test interface test_lab_2_vlan1000
5? ?test_lab_2? 192.168.15.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?29? ?vlan1001? ?test interface test_lab_2_vlan1001
6? ?test_lab_2? 192.168.16.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?28? ?vlan1002? ?test interface test_lab_2_vlan1002
7? ?test_lab_2? 192.168.17.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?24? ?vlan1003? ?test interface test_lab_2_vlan1003n
For my first example, I'm going to take the values contained within the deviceName column and return each value as uppercase. Each replaced value will overwrite the existing deviceName value.
We have three options available, all of which will produce the same results.
Option 1) calls a regular function that returns a string in uppercase.
Option 2) calls a lambda function that returns a string in uppercase.
?Option 3) calls the str().upper() function, which returns a string in uppercase.
def make_upper_case(device_name):
return device_name.upper()
df_interface['deviceName'] = df_interface['deviceName'].apply(make_upper_case)
df_interface['deviceName'] = df_interface['deviceName'].apply(lambda x: x.upper())
df_interface['deviceName'] = df_interface['deviceName'].str.upper()
As your logic conditions become more complex, I find it easier to write a regular function. However, for simple tasks like this, either the lambda function or calling the function will suffice.
? ? deviceName? ? ?ipAddress? ? ?ipSubnetMask? ipPrefixMask? interface? ? ? ? ? ? ? ? ? ? ? ? ? descriptio
0? ?TEST_LAB_1? 192.168.10.1? ? 255.255.255.0? ? ? ? ? ? ?0? ?vlan1000? ?test interface test_lab_1_vlan1000
1? ?TEST_LAB_1? 192.168.11.1? 255.255.255.252? ? ? ? ? ? ?0? ?vlan1001? ?test interface test_lab_1_vlan1001
2? ?TEST_LAB_1? ? 172.16.1.1? 255.255.255.255? ? ? ? ? ? ?0? loopback0? test interface test_lab_1_loopback0
3? ?TEST_LAB_1? ? 172.16.1.2? 255.255.255.255? ? ? ? ? ? ?0? loopback1? test interface test_lab_1_loopback1n
Next, I need to reformat the contents of the ipPrefixMask column and, in addition, reformat the values found in the ipSubnetMask column to an IP prefix mask. To accomplish this, I will pass the values of both the ipPrefixMask and ipSubnetMask columns and return the result accordingly. The condition should be as follows:
Now that we have defined the logic, let's proceed to write the code.
from ipaddress import IPv4Interface
def reformat_mask(ip_subnet_mask,ip_prefix_mask):
if ip_subnet_mask != '':
ip_prefix_mask = IPv4Interface('/'.join(('0.0.0.0',
str(ip_subnet_mask)))).with_prefixlen.split('/')[1]
if ip_prefix_mask != 0:
ip_subnet_mask = IPv4Interface('/'.join(('0.0.0.0',
str(ip_prefix_mask)))).with_netmask.split('/')[1]
return ip_subnet_mask,ip_prefix_mask
(df_interface['ipSubnetMask'],
df_interface['ipPrefixMask']) = zip(*df_interface.apply(lambda x: reformat_mask(x['ipSubnetMask'],
x['ipPrefixMask']),
axis=1))
If we examine the code, we can see that the two columns to the left of the operator will display the results. The apply method utilises a lambda function, which invokes the reformat_mask function along the horizontal axis. Finally, since the returned results are in the form of a tuple, the zip function (using the *unzip operator) will divide the results across the two columns.
To make things more interesting, I've expanded out my data set to 40,000 rows.
领英推荐
df_interface = pd.read_csv('blog_post/pandas_slice/dummy_data.csv'
df_interface['ipPrefixMask'] = df_interface['ipPrefixMask'].fillna('0').astype(int)
df_interface['ipSubnetMask'] = df_interface['ipSubnetMask'].fillna('').astype(str)
df_interface_large = pd.DataFrame()
for x in range(2000):
df_interface_large = pd.concat([df_interface_large,df_interface]))
len(df_interface_large)
40000
test using pandas apply method:
import time
start_time = int(time.time())
(df_interface_large['ipSubnetMask'],
df_interface_large['ipPrefixMask']) = zip(*df_interface_large.apply(lambda x: reformat_mask(x['ipSubnetMask'],
x['ipPrefixMask']),
axis=1))
print(f'processing time was {int(time.time()) - start_time} seconds')
processing time was 1 seconds
Now, let's consider what would happen if I were to use a standard loop instead. Let's find out and compare the results.
test using loop method:
import numpy as np
def loop_test():
df_interface_large.index = np.arange(1, len(df_interface_large) + 1)
start_time = int(time.time())
for i, row in df_interface_large.iterrows():
result = df_interface_large.at[i,'ipSubnetMask'],df_interface_large.at[i,'ipPrefixMask']
df_interface_large.loc[i,'ipSubnetMask'] = result[0]
df_interface_large.loc[i,'ipPrefixMask'] = result[1]
print(f'processing time was {int(time.time()) - start_time} seconds')
loop_test()
processing time was 36 seconds
36 seconds versus 1 second. This is an important point: it's the same data executed on the same machine but using a different technique. It's crucial to only resort to using a loop as a last option!
The information provided on this blog is for general informational purposes only. It is not intended as professional or technical advice and should not be construed as such. The views expressed on this blog are solely those of the author and do not necessarily represent the views of any other organization or entity. While I make every effort to ensure that the information provided on this blog is accurate and up-to-date, we cannot guarantee its accuracy or completeness. I will not be liable for any losses, injuries, or damages from the use of or reliance on the information provided on this blog. Any reliance you place on such information is therefore strictly at your own risk. I reserve the right to modify or discontinue the blog at any time without notice.
For my first example, I’m going to take the column “ipAddress” and append the ip address to the existing description. The result will overwrite the existing description value.