登录查看更多内容

How to “apply” your Panda(s)

Antony Oliver

Network Automation Developer | Network Automation Tool Creator | Network Design | Network Automation Advocate

发布日期: 2023年5月22日

When working with datasets in Pandas, we often need to restructure the data due to it being collected from different sources and having different formats. An example of this is when collecting interface information from IOS and NXOS devices. IOS devices present IP subnets as subnet masks, whereas NXOS displays subnet values using prefixMask. To ensure consistency within the dataset or database, it's important to standardise the notation of the subnet mask. In this blog, I will explore options for working with data within a dataframe, focusing on standardising the subnet mask notation.

Imagine that you have a dataset of 40,000 interface records, all of which require validation, and any incorrect data needs to be reformatted. This task would normally require the use of a loop, but loops can be very slow. Fortunately, the Pandas Apply method can come to the rescue. The Apply method in Pandas is designed to apply a function along an axis of a DataFrame or a Series, and it's incredibly fast.

This process works by reading the contents of the first row of a defined column, and then sending it to a function for further processing. The function can return a response that overwrites the existing value, or it can be written to a new column, depending on your preference.

Before we start, lets import the data.

import pandas as pd

df_interface = pd.read_csv('blog_post/pandas_slice/dummy_data.csv')

df_interface
? ? deviceName? ? ?ipAddress? ? ?ipSubnetMask? ipPrefixMask? interface? ? ? ? ? ? ? ? ? ? ? ? ? description
0? ?test_lab_1? 192.168.10.1? ? 255.255.255.0? ? ? ? ? ?NaN? ?vlan1000? ?test interface test_lab_1_vlan1000
1? ?test_lab_1? 192.168.11.1? 255.255.255.252? ? ? ? ? ?NaN? ?vlan1001? ?test interface test_lab_1_vlan1001
2? ?test_lab_1? ? 172.16.1.1? 255.255.255.255? ? ? ? ? ?NaN? loopback0? test interface test_lab_1_loopback0
3? ?test_lab_1? ? 172.16.1.2? 255.255.255.255? ? ? ? ? ?NaN? loopback1? test interface test_lab_1_loopback1
4? ?test_lab_2? 192.168.14.1? ? ? ? ? ? ? NaN? ? ? ? ? 24.0? ?vlan1000? ?test interface test_lab_2_vlan1000
5? ?test_lab_2? 192.168.15.1? ? ? ? ? ? ? NaN? ? ? ? ? 29.0? ?vlan1001? ?test interface test_lab_2_vlan1001
6? ?test_lab_2? 192.168.16.1? ? ? ? ? ? ? NaN? ? ? ? ? 28.0? ?vlan1002? ?test interface test_lab_2_vlan1002
7? ?test_lab_2? 192.168.17.1? ? ? ? ? ? ? NaN? ? ? ? ? 24.0? ?vlan1003? ?test interface test_lab_2_vlan1003

I have imported the contents of my .csv file into Pandas, replaced/filled any NaN values, and then set the column types to integer and string.

df_interface['ipPrefixMask'] = df_interface['ipPrefixMask'].fillna('0').astype(int
df_interface['ipSubnetMask'] = df_interface['ipSubnetMask'].fillna('').astype(str))

df_interface

deviceName? ? ?ipAddress? ? ?ipSubnetMask? ipPrefixMask? interface? ? ? ? ? ? ? ? ? ? ? ? ? descriptio
0? ?test_lab_1? 192.168.10.1? ? 255.255.255.0? ? ? ? ? ? ?0? ?vlan1000? ?test interface test_lab_1_vlan1000
1? ?test_lab_1? 192.168.11.1? 255.255.255.252? ? ? ? ? ? ?0? ?vlan1001? ?test interface test_lab_1_vlan1001
2? ?test_lab_1? ? 172.16.1.1? 255.255.255.255? ? ? ? ? ? ?0? loopback0? test interface test_lab_1_loopback0
3? ?test_lab_1? ? 172.16.1.2? 255.255.255.255? ? ? ? ? ? ?0? loopback1? test interface test_lab_1_loopback1
4? ?test_lab_2? 192.168.14.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?24? ?vlan1000? ?test interface test_lab_2_vlan1000
5? ?test_lab_2? 192.168.15.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?29? ?vlan1001? ?test interface test_lab_2_vlan1001
6? ?test_lab_2? 192.168.16.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?28? ?vlan1002? ?test interface test_lab_2_vlan1002
7? ?test_lab_2? 192.168.17.1? ? ? ? ? ? ? ? ? ? ? ? ? ? ?24? ?vlan1003? ?test interface test_lab_2_vlan1003n

For my first example, I'm going to take the values contained within the deviceName column and return each value as uppercase. Each replaced value will overwrite the existing deviceName value.

We have three options available, all of which will produce the same results.

Option 1) calls a regular function that returns a string in uppercase.

Option 2) calls a lambda function that returns a string in uppercase.

?Option 3) calls the str().upper() function, which returns a string in uppercase.

def make_upper_case(device_name):
    return device_name.upper()


df_interface['deviceName'] = df_interface['deviceName'].apply(make_upper_case)
df_interface['deviceName'] = df_interface['deviceName'].apply(lambda x: x.upper())
df_interface['deviceName'] = df_interface['deviceName'].str.upper()

As your logic conditions become more complex, I find it easier to write a regular function. However, for simple tasks like this, either the lambda function or calling the function will suffice.

? ? deviceName? ? ?ipAddress? ? ?ipSubnetMask? ipPrefixMask? interface? ? ? ? ? ? ? ? ? ? ? ? ? descriptio
0? ?TEST_LAB_1? 192.168.10.1? ? 255.255.255.0? ? ? ? ? ? ?0? ?vlan1000? ?test interface test_lab_1_vlan1000
1? ?TEST_LAB_1? 192.168.11.1? 255.255.255.252? ? ? ? ? ? ?0? ?vlan1001? ?test interface test_lab_1_vlan1001
2? ?TEST_LAB_1? ? 172.16.1.1? 255.255.255.255? ? ? ? ? ? ?0? loopback0? test interface test_lab_1_loopback0
3? ?TEST_LAB_1? ? 172.16.1.2? 255.255.255.255? ? ? ? ? ? ?0? loopback1? test interface test_lab_1_loopback1n

Next, I need to reformat the contents of the ipPrefixMask column and, in addition, reformat the values found in the ipSubnetMask column to an IP prefix mask. To accomplish this, I will pass the values of both the ipPrefixMask and ipSubnetMask columns and return the result accordingly. The condition should be as follows:

If the ipPrefixMask value is present, it should be converted to a subnet mask, then return both ipPrefixMask and ipSubnetMask
If the ipSubnetMask value is present, it should be converted to prefix mask, then return both ipPrefixMask and ipSubnetMask

Now that we have defined the logic, let's proceed to write the code.

from ipaddress import IPv4Interface


def reformat_mask(ip_subnet_mask,ip_prefix_mask):
    if ip_subnet_mask != '':
        ip_prefix_mask =  IPv4Interface('/'.join(('0.0.0.0',
                                                    str(ip_subnet_mask)))).with_prefixlen.split('/')[1]
    if ip_prefix_mask != 0:
        ip_subnet_mask = IPv4Interface('/'.join(('0.0.0.0', 
                                                 str(ip_prefix_mask)))).with_netmask.split('/')[1]
    return ip_subnet_mask,ip_prefix_mask



(df_interface['ipSubnetMask'],
    df_interface['ipPrefixMask']) = zip(*df_interface.apply(lambda x: reformat_mask(x['ipSubnetMask'],
                                                                                    x['ipPrefixMask']),
                                                                                    axis=1))

If we examine the code, we can see that the two columns to the left of the operator will display the results. The apply method utilises a lambda function, which invokes the reformat_mask function along the horizontal axis. Finally, since the returned results are in the form of a tuple, the zip function (using the *unzip operator) will divide the results across the two columns.

To make things more interesting, I've expanded out my data set to 40,000 rows.

领英推荐

The Future of Data Analysis: Unraveling Hidden…

Indian Cyber Security Solutions (GreenFellow IT Security Solutions Pvt Ltd) 1 年前

Selecting and Filtering Data: How to Navigate a…

Benjamin Bennett Alexander 8 个月前

How to index data into Vector DB from highly…

Gaurav Chopra 5 个月前

df_interface = pd.read_csv('blog_post/pandas_slice/dummy_data.csv'
df_interface['ipPrefixMask'] = df_interface['ipPrefixMask'].fillna('0').astype(int)
df_interface['ipSubnetMask'] = df_interface['ipSubnetMask'].fillna('').astype(str)

df_interface_large = pd.DataFrame()
for x in range(2000):
    df_interface_large  = pd.concat([df_interface_large,df_interface]))

len(df_interface_large)
40000

test using pandas apply method:

import time
start_time = int(time.time())
(df_interface_large['ipSubnetMask'],
    df_interface_large['ipPrefixMask']) = zip(*df_interface_large.apply(lambda x: reformat_mask(x['ipSubnetMask'],
                                                                                                x['ipPrefixMask']),
                                                                                                axis=1))
print(f'processing time was {int(time.time()) - start_time} seconds')
processing time was 1 seconds

Now, let's consider what would happen if I were to use a standard loop instead. Let's find out and compare the results.

test using loop method:

import numpy as np


def loop_test():
    df_interface_large.index = np.arange(1, len(df_interface_large) + 1)
    start_time = int(time.time())
    for i, row in df_interface_large.iterrows():
        result = df_interface_large.at[i,'ipSubnetMask'],df_interface_large.at[i,'ipPrefixMask']
        df_interface_large.loc[i,'ipSubnetMask'] = result[0]
        df_interface_large.loc[i,'ipPrefixMask'] = result[1]
    print(f'processing time was {int(time.time()) - start_time} seconds')


loop_test()
processing time was 36 seconds

36 seconds versus 1 second. This is an important point: it's the same data executed on the same machine but using a different technique. It's crucial to only resort to using a loop as a last option!

The information provided on this blog is for general informational purposes only. It is not intended as professional or technical advice and should not be construed as such. The views expressed on this blog are solely those of the author and do not necessarily represent the views of any other organization or entity. While I make every effort to ensure that the information provided on this blog is accurate and up-to-date, we cannot guarantee its accuracy or completeness. I will not be liable for any losses, injuries, or damages from the use of or reliance on the information provided on this blog. Any reliance you place on such information is therefore strictly at your own risk. I reserve the right to modify or discontinue the blog at any time without notice.

For my first example, I’m going to take the column “ipAddress” and append the ip address to the existing description. The result will overwrite the existing description value.

要查看或添加评论，请登录

Antony Oliver的更多文章

Network Vista Beta Release 0.9

2025年3月3日

Network Vista Beta Release 0.9

I’m excited to announce that Network Vista is now available for download via Docker Hub! What is Network Vista? Network…

2 条评论
Network Time Bombs

2024年3月28日

Network Time Bombs

Network inconsistencies can be likened to time bombs lurking within your infrastructure, waiting to detonate. These…

3 条评论
Nebula's Control List Module

2024年2月20日

Nebula's Control List Module

Exciting news! Today, I'm diving into Nebula's Control List Audit Module. What exactly is a control list? ?? A control…
Nebula’s Database and Historical Diff Module

2024年2月14日

Nebula’s Database and Historical Diff Module

When I started designing Nebula, I aimed to address key challenges in network management. Changes within a network can…

1 条评论
Simplifying Meraki Configuration Audits

2024年1月12日

Simplifying Meraki Configuration Audits

In my first blog of 2024, I am excited to showcase the new Meraki Configuration Audit Module, a component integrated…

1 条评论
ACI Terraform Server

2023年12月11日

ACI Terraform Server

It's been a while since I shared my thoughts here, and with a bit of free time on my hands, I'm excited to present my…

1 条评论
How to slice and dice your Panda(s)

2023年5月8日

How to slice and dice your Panda(s)

Why Pandas? Pandas is a Python library that offers extensive data manipulation capabilities, enabling users to easily…
IP Address Python Module

2023年4月26日

IP Address Python Module

As network automation engineers, we straddle two worlds. On one side, we must have an expert understanding of…

See all articles

How to “apply” your Panda(s)

Antony Oliver

Network Automation Developer | Network Automation Tool Creator | Network Design | Network Automation Advocate

领英推荐

Antony Oliver的更多文章

社区洞察

其他会员也浏览了

?? Unlock Time Series Insights Using Python’s KPSS Test ??

Extracting Coordinates from Google Maps URLs and Exporting to Excel

Check Regional Information via Coarsen

Evolution of Data Visualization in R

Understanding Ridge Regression with 2D Data & Custom Implementation

Kruskal-Wallis Test: A Comprehensive Guide for Non-Parametric Analysis

Day #4 of Rust 100 Days Challenge: Data Types

How to speed up tabular data processing by 1053x in pandas/cudf

pClick - the challenges, the possibilities, the insights

领英推荐

Antony Oliver的更多文章

Network Vista Beta Release 0.9

Network Time Bombs

Nebula's Control List Module

Nebula’s Database and Historical Diff Module

Simplifying Meraki Configuration Audits

ACI Terraform Server

How to slice and dice your Panda(s)

IP Address Python Module

社区洞察

其他会员也浏览了

?? Unlock Time Series Insights Using Python’s KPSS Test ??

Extracting Coordinates from Google Maps URLs and Exporting to Excel

Check Regional Information via Coarsen

Evolution of Data Visualization in R

Understanding Ridge Regression with 2D Data & Custom Implementation

Kruskal-Wallis Test: A Comprehensive Guide for Non-Parametric Analysis

Day #4 of Rust 100 Days Challenge: Data Types

How to speed up tabular data processing by 1053x in pandas/cudf

pClick - the challenges, the possibilities, the insights