How to “apply” your Panda(s)

When working with datasets in Pandas, we often need to restructure the data due to it being collected from different sources and having different formats. An example of this is when collecting interface information from IOS and NXOS devices. IOS devices present IP subnets as subnet masks, whereas NXOS displays subnet values using prefixMask. To ensure consistency within the dataset or database, it's important to standardise the notation of the subnet mask. In this blog, I will explore options for working with data within a dataframe, focusing on standardising the subnet mask notation.

Imagine that you have a dataset of 40,000 interface records, all of which require validation, and any incorrect data needs to be reformatted. This task would normally require the use of a loop, but loops can be very slow. Fortunately, the Pandas Apply method can come to the rescue. The Apply method in Pandas is designed to apply a function along an axis of a DataFrame or a Series, and it's incredibly fast.

This process works by reading the contents of the first row of a defined column, and then sending it to a function for further processing. The function can return a response that overwrites the existing value, or it can be written to a new column, depending on your preference.

Before we start, lets import the data.

import pandas as pd

df_interface = pd.read_csv('blog_post/pandas_slice/dummy_data.csv')

df_interface
? ? deviceName? ? ?ipAddress? ? ?ipSubnetMask? ipPrefixMask? interface? ? ? ? ? ? ? ? ? ? ? ? ? description
0? ?test_lab_1? 192.168.10.1? ? 255.255.255.0? ? ? ? ? ?NaN? ?vlan1000? ?test interface test_lab_1_vlan1000
1? ?test_lab_1? 192.168.11.1? 255.255.255.252? ? ? ? ? ?NaN? ?vlan1001? ?test interface test_lab_1_vlan1001
2? ?test_lab_1? ? 172.16.1.1? 255.255.255.255? ? ? ? ? ?NaN? loopback0? test interface test_lab_1_loopback0
3? ?test_lab_1? ? 172.16.1.2? 255.255.255.255? ? ? ? ? ?NaN? loopback1? test interface test_lab_1_loopback1
4? ?test_lab_2? 192.168.14.1? ? ? ? ? ? ? NaN? ? ? ? ? 24.0? ?vlan1000? ?test interface test_lab_2_vlan1000
5? ?test_lab_2? 192.168.15.1? ? ? ? ? ? ? NaN? ? ? ? ? 29.0? ?vlan1001? ?test interface test_lab_2_vlan1001
6? ?test_lab_2? 192.168.16.1? ? ? ? ? ? ? NaN? ? ? ? ? 28.0? ?vlan1002? ?test interface test_lab_2_vlan1002
7? ?test_lab_2? 192.168.17.1? ? ? ? ? ? ? NaN? ? ? ? ? 24.0? ?vlan1003? ?test interface test_lab_2_vlan1003
        

I have imported the contents of my .csv file into Pandas, replaced/filled any NaN values, and then set the column types to integer and string.

df_interface['ipPrefixMask'] = df_interface['ipPrefixMask'].fillna('0').astype(int
df_interface['ipSubnetMask'] = df_interface['ipSubnetMask'].fillna('').astype(str))

df_interface

deviceName? ? ?ipAddress? ? ?ipSubnetMask? ipPrefixMask? interface? ? ? ? ? ? ? ? ? ? ? ? ? descriptio
0? ?test_lab_1? 192.168.10.1? ? 255.255.255.0? ? ? ? ? ? ?0? ?vlan1000? ?test interface test_lab_1_vlan1000
1? ?test_lab_1? 192.168.11.1? 255.255.255.252? ? ? ? ? ? ?0? ?vlan1001? ?test interface test_lab_1_vlan1001
2? ?test_lab_1? ? 172.16.1.1? 255.255.255.255? ? ? ? ? ? ?0? loopback0? test interface test_lab_1_loopback0
3? ?test_lab_1? ? 172.16.1.2? 255.255.255.255? ? ? ? ? ? ?0? loopback1? test interface test_lab_1_loopback1
4? ?test_lab_2? 192.168.14.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?24? ?vlan1000? ?test interface test_lab_2_vlan1000
5? ?test_lab_2? 192.168.15.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?29? ?vlan1001? ?test interface test_lab_2_vlan1001
6? ?test_lab_2? 192.168.16.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?28? ?vlan1002? ?test interface test_lab_2_vlan1002
7? ?test_lab_2? 192.168.17.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?24? ?vlan1003? ?test interface test_lab_2_vlan1003n        

For my first example, I'm going to take the values contained within the deviceName column and return each value as uppercase. Each replaced value will overwrite the existing deviceName value.

We have three options available, all of which will produce the same results.

Option 1) calls a regular function that returns a string in uppercase.

Option 2) calls a lambda function that returns a string in uppercase.

?Option 3) calls the str().upper() function, which returns a string in uppercase.

def make_upper_case(device_name):
    return device_name.upper()


df_interface['deviceName'] = df_interface['deviceName'].apply(make_upper_case)
df_interface['deviceName'] = df_interface['deviceName'].apply(lambda x: x.upper())
df_interface['deviceName'] = df_interface['deviceName'].str.upper()        

As your logic conditions become more complex, I find it easier to write a regular function. However, for simple tasks like this, either the lambda function or calling the function will suffice.

? ? deviceName? ? ?ipAddress? ? ?ipSubnetMask? ipPrefixMask? interface? ? ? ? ? ? ? ? ? ? ? ? ? descriptio
0? ?TEST_LAB_1? 192.168.10.1? ? 255.255.255.0? ? ? ? ? ? ?0? ?vlan1000? ?test interface test_lab_1_vlan1000
1? ?TEST_LAB_1? 192.168.11.1? 255.255.255.252? ? ? ? ? ? ?0? ?vlan1001? ?test interface test_lab_1_vlan1001
2? ?TEST_LAB_1? ? 172.16.1.1? 255.255.255.255? ? ? ? ? ? ?0? loopback0? test interface test_lab_1_loopback0
3? ?TEST_LAB_1? ? 172.16.1.2? 255.255.255.255? ? ? ? ? ? ?0? loopback1? test interface test_lab_1_loopback1n        

Next, I need to reformat the contents of the ipPrefixMask column and, in addition, reformat the values found in the ipSubnetMask column to an IP prefix mask. To accomplish this, I will pass the values of both the ipPrefixMask and ipSubnetMask columns and return the result accordingly. The condition should be as follows:

  • If the ipPrefixMask value is present, it should be converted to a subnet mask, then return both ipPrefixMask and ipSubnetMask
  • If the ipSubnetMask value is present, it should be converted to prefix mask, then return both ipPrefixMask and ipSubnetMask

Now that we have defined the logic, let's proceed to write the code.

from ipaddress import IPv4Interface


def reformat_mask(ip_subnet_mask,ip_prefix_mask):
    if ip_subnet_mask != '':
        ip_prefix_mask =  IPv4Interface('/'.join(('0.0.0.0',
                                                    str(ip_subnet_mask)))).with_prefixlen.split('/')[1]
    if ip_prefix_mask != 0:
        ip_subnet_mask = IPv4Interface('/'.join(('0.0.0.0', 
                                                 str(ip_prefix_mask)))).with_netmask.split('/')[1]
    return ip_subnet_mask,ip_prefix_mask



(df_interface['ipSubnetMask'],
    df_interface['ipPrefixMask']) = zip(*df_interface.apply(lambda x: reformat_mask(x['ipSubnetMask'],
                                                                                    x['ipPrefixMask']),
                                                                                    axis=1))        

If we examine the code, we can see that the two columns to the left of the operator will display the results. The apply method utilises a lambda function, which invokes the reformat_mask function along the horizontal axis. Finally, since the returned results are in the form of a tuple, the zip function (using the *unzip operator) will divide the results across the two columns.

To make things more interesting, I've expanded out my data set to 40,000 rows.

df_interface = pd.read_csv('blog_post/pandas_slice/dummy_data.csv'
df_interface['ipPrefixMask'] = df_interface['ipPrefixMask'].fillna('0').astype(int)
df_interface['ipSubnetMask'] = df_interface['ipSubnetMask'].fillna('').astype(str)

df_interface_large = pd.DataFrame()
for x in range(2000):
    df_interface_large  = pd.concat([df_interface_large,df_interface]))

len(df_interface_large)
40000
        

test using pandas apply method:

import time
start_time = int(time.time())
(df_interface_large['ipSubnetMask'],
    df_interface_large['ipPrefixMask']) = zip(*df_interface_large.apply(lambda x: reformat_mask(x['ipSubnetMask'],
                                                                                                x['ipPrefixMask']),
                                                                                                axis=1))
print(f'processing time was {int(time.time()) - start_time} seconds')
processing time was 1 seconds        

Now, let's consider what would happen if I were to use a standard loop instead. Let's find out and compare the results.

test using loop method:

import numpy as np


def loop_test():
    df_interface_large.index = np.arange(1, len(df_interface_large) + 1)
    start_time = int(time.time())
    for i, row in df_interface_large.iterrows():
        result = df_interface_large.at[i,'ipSubnetMask'],df_interface_large.at[i,'ipPrefixMask']
        df_interface_large.loc[i,'ipSubnetMask'] = result[0]
        df_interface_large.loc[i,'ipPrefixMask'] = result[1]
    print(f'processing time was {int(time.time()) - start_time} seconds')


loop_test()
processing time was 36 seconds
        

36 seconds versus 1 second. This is an important point: it's the same data executed on the same machine but using a different technique. It's crucial to only resort to using a loop as a last option!


The information provided on this blog is for general informational purposes only. It is not intended as professional or technical advice and should not be construed as such. The views expressed on this blog are solely those of the author and do not necessarily represent the views of any other organization or entity. While I make every effort to ensure that the information provided on this blog is accurate and up-to-date, we cannot guarantee its accuracy or completeness. I will not be liable for any losses, injuries, or damages from the use of or reliance on the information provided on this blog. Any reliance you place on such information is therefore strictly at your own risk. I reserve the right to modify or discontinue the blog at any time without notice.













For my first example, I’m going to take the column “ipAddress” and append the ip address to the existing description. The result will overwrite the existing description value.

要查看或添加评论,请登录

Antony Oliver的更多文章

  • Network Vista Beta Release 0.9

    Network Vista Beta Release 0.9

    I’m excited to announce that Network Vista is now available for download via Docker Hub! What is Network Vista? Network…

    2 条评论
  • Network Time Bombs

    Network Time Bombs

    Network inconsistencies can be likened to time bombs lurking within your infrastructure, waiting to detonate. These…

    3 条评论
  • Nebula's Control List Module

    Nebula's Control List Module

    Exciting news! Today, I'm diving into Nebula's Control List Audit Module. What exactly is a control list? ?? A control…

  • Nebula’s Database and Historical Diff Module

    Nebula’s Database and Historical Diff Module

    When I started designing Nebula, I aimed to address key challenges in network management. Changes within a network can…

    1 条评论
  • Simplifying Meraki Configuration Audits

    Simplifying Meraki Configuration Audits

    In my first blog of 2024, I am excited to showcase the new Meraki Configuration Audit Module, a component integrated…

    1 条评论
  • ACI Terraform Server

    ACI Terraform Server

    It's been a while since I shared my thoughts here, and with a bit of free time on my hands, I'm excited to present my…

    1 条评论
  • How to slice and dice your Panda(s)

    How to slice and dice your Panda(s)

    Why Pandas? Pandas is a Python library that offers extensive data manipulation capabilities, enabling users to easily…

  • IP Address Python Module

    IP Address Python Module

    As network automation engineers, we straddle two worlds. On one side, we must have an expert understanding of…

社区洞察

其他会员也浏览了