Scripting Techniques in KPIS Monitoring: case study

Scripting Techniques in KPIS Monitoring: case study

INTRODUCTION

Facing the challenges of network monitoring in an ever-evolving environment, this blog recounts the implementation of a KPI monitoring system on a router, utilizing a Linux server. Given the impossibility of updating libraries due to compatibility restrictions, and facing the limitations of traditional monitoring methods like memory monitoring through SNMP, which often prove insufficient for our specific needs, the need for innovation became clear. This led us to develop customized scripting solutions that allowed for a balance between innovation and operational stability with the existing conditions of the environment.

In this blog, the various facets of this project are explored, from tool selection to the automation of the data collection process, database configuration, information gathering and processing, and its effective visualization.?

TOOLS AND TECHNOLOGIES USED

To address specific challenges and contribute effectively to the monitoring system, various tools and technologies were carefully selected:

  • Jenkins was chosen for its versatility in automating tasks, becoming an indispensable tool for scheduling monitoring script executions. Its added value lies in the ability to provide detailed visualizations of task execution, enabling proactive adjustments and continuous optimization.
  • InfluxDB was selected as the database due to its preexisting presence on the Linux server, dictating the need for adaptation and leveraging its efficiency in handling large volumes of data.
  • Grafana was chosen for visualization for its ability to transform complex data into interactive and understandable dashboards, thus facilitating the identification of anomalies and trends in network performance.
  • Python 2.7 was used due to server restrictions, needing compatibility with previous developments. The most relevant libraries used include:

The following diagram offers a clear view of the process and interaction between the different components of our solution. It illustrates the data journey from its collection point to the visualization phase.

Within this process, the following KPIs are monitored:

  • Controller & Line card CPU Utilization
  • Controller & Line card Memory utilization
  • Chassis temperature
  • Interfaces Bandwidth
  • Interfaces errors & packet discard
  • Interfaces & circuit delay – TWAMP/IPSLA mechanism
  • BGP protocol neighboring
  • ISIS protocol neighboring.?

For each KPI, a dedicated Python script was created to collect and process the relevant information. However, due to the complexity and the volume of data associated with some KPIs, it was necessary to develop more than one script. This strategy, along with others implemented, allowed for the reduction of sampling and execution times for each script. Even in cases such as 'Interface Errors and Packet Discards', where the number of subinterfaces exceeded 400.

DATABASE CONFIGURATION

The project used InfluxDB version 5.3.1, a choice dictated by its presence as a data storage solution on the Linux server and the need for consistency with other solutions already in use. Following, we can review the initialization procedure and data structure used.

Database Creation: Through the InfluxDB CLI, the database was established, intended to be the central repository for all collected data.

CREATE DATABASE ScriptDB

Definition of Measurements and Tags: Specific “measurements” for each KPI were defined. Each measurement encapsulates a set of metrics, with “tags” and “fields” designed to facilitate data filtering and querying. Tags are used to filter within the database query, and fields as a source for Grafana.

Example of Data Structure:

Measurement: interface_metrics

  • Tags: Host, Interface, Description.
  • Fields: input_rate, output_rate, input_drop, output_drop.

Information gathering

Information gathering was conducted through customized Python scripts, developed for each KPI, allowing direct interaction with the router and the collection of specific data on system performance and health. General and specific extraction methods considered are:?

  • Connection and Commands: Netmiko was used to establish secure SSH connections with the router, enabling the execution of specific commands and the collection of real-time data.
  • Data Parsing: Given the diversity of collected information, regular expressions were applied to parse and extract specific data from command outputs, transforming raw data into structured information.
  • Using Netconf for Optimization: Netconf, through ncclient, was crucial for efficiently handling information from hundreds of subinterfaces, significantly improving the data collection process.

As an illustration, two examples of the various Python scripts developed for this project are presented:

  • Monitoring Memory and CPU: These scripts are designed to assess memory and CPU usage, critical factors for evaluating router performance. The scripts parse command outputs to identify and record key metrics.
  • Collection of Interface Metrics: A specific script was implemented to collect detailed metrics of interfaces, such as input/output rates and error statistics. This approach provides a clear view of the network traffic's state and efficiency.

WRITING TO THE DATABASE

Each Python script was designed with a specific function to transmit the collected data to InfluxDB. This integration structures the data in JSON format, reflecting our desired structure of measurements, tags, and fields. This method ensures the data are properly prepared for storage, organizing them to facilitate subsequent querying and analysis.

Within each script, we have created the send_to_influxdb function, which establishes a connection to InfluxDB using specific details such as credentials and the port. For example, in CPU monitoring and interface metrics, send_to_influxdb prepares JSON objects that include relevant tags like host, routing_engine, or interface, and fields with detailed metrics. At the end of each script, send_to_influxdb is executed, ensuring the continuous update of the database with the most recent data without the need for manual intervention.

A crucial part of structuring these data is the use of JSON to clearly define the tags and fields, which is essential for precise KPI monitoring. Below is an example of how these data are structured in the send_to_influxdb function:

def send_to_influxdb(host, interface_list):

????client = InfluxDBClient('localhost ', 8086, None, None, 'ScriptDB')

????for interface, data in interface_list.items():

????????json_body = [

????????????{

????????????????"measurement": "interface_metrics",

????????????????"tags": {

????????????????????"interface": interface,

????????????????????"host": host,

????????????????????"description": data.get("description", ""),

????????????????},

????????????????"fields": {

????????????????????"input_rate_bytes": data.get("input_rate_bytes", 0),

????????????????????"output_rate_bytes": data.get("output_rate_bytes", 0),

????????????????????"total_input_drops": data.get("total_input_drops", 0),

????????????????????"total_output_drops": data.get("total_output_drops", 0),

????????????????}

????????????}

????????]

????????client.write_points(json_body)

????client.close()

This structured approach enables efficient indexing and quick data retrieval, facilitating complex and detailed analysis and visualizations. The specific structuring within the "interface_metrics" measurement to capture input_rate, output_rate, total_input_drops, and total_output_drops is crucial. This organization optimizes data handling within the database, making their filtering, querying, and visualization efficient as a table within the database.

VISUALIZATION WITH GRAFANA

Grafana plays a key role in the dynamic visualization of data stored in InfluxDB, transforming complex metrics into clear and accessible graphs.

We created dashboards in Grafana that clearly display the KPIs, visualizing critical metrics such as CPU usage, memory, latency, among others. This provides us with a comprehensive view of the state of our infrastructure. Also, we formulated specific queries in Grafana to extract and display the data accurately.

  • Memory Monitoring: The following queries show the percentage of memory usage in each routing engine

SELECT "usage_percentage" FROM "memory_re" WHERE "re" = 'RE0'

SELECT "usage_percentage" FROM "memory_re" WHERE "re" = 'RE1'

Measurement: memory_re

  • Tags: Host, re.?
  • Fields: usage_percentage.

  • Drops in Interfaces: This query allows examining the statistics of input and output drops, facilitating the identification of potential problems in network traffic. The inclusion of the $Host variable allows for dynamic filtering by specific host.

SELECT? "input_drop", "output_drop" FROM "link_status_metrics" WHERE "host" = '$Host' AND $timeFilter GROUP BY "interface" fill(null)

Measurement: link_status_metrics

  • Tags: interface, host y description
  • Fields: input_drop, output_drop

In addition to visualizations, we have set up alerts in Grafana to signal situations requiring immediate attention, such as excessive CPU usage or a high number of revolutions in the FANs.

AUTOMATION WITH JENKINS

Through Jenkins, we have configured an automated task named "ROUTER-KPI," responsible for executing the central script kpi_aggregator.py . This script, in turn, invokes a series of specific scripts designed to collect various KPIs from the router.

The "ROUTER-KPI" task has been set to run periodically, adapting to our data collection needs. The configuration details include:

  • Scheduled Execution: We have scheduled Jenkins to launch kpi_aggregator.py at regular intervals. For example, for executions every 2 minutes, we configured H/2 ? in the Build Triggers → Build periodically section.

  • Execution Command: In Build → Execute shell, we specified how and where kpi_aggregator.py will be executed, including the necessary parameters for connecting to the router and collecting KPIs.

cd /ruta/a/scripts

sudo python kpi_aggregator.py --host=BLOG_U01 --username=admin--password=admin

This command navigates to the directory containing the script and executes it with the appropriate arguments.

  • Build Management: To ensure system efficiency, we have configured Jenkins to retain only a limited number of old builds, using log rotation policies and build retention.


RECOMMENDATIONS FOR OPTIMIZING KPI MONITORING

To ensure both efficient and effective KPI monitoring, we offer the following practical tips:?

  • Robust Exception Handling: Incorporate robust exception handling in your scripts. This not only facilitates the identification and resolution of failures but also improves the accurate diagnosis of connection issues with the router.
  • SSH Session Optimization: Adjusting the number of SSH sessions that the router can manage simultaneously can significantly improve efficiency.
  • Adjustments in Execution Periodicity: The frequency at which scripts are executed directly affects data collection. We recommend using Jenkins to establish a periodicity based on empirical measurements of your scripts' execution time. Additionally, if necessary, adapt the script to allow parallel executions on different items within the same KPI. This can significantly improve script execution times, optimize the monitoring process and increase overall system efficiency.
  • Effective Database Management: It is advisable to perform data cleanup after changes in InfluxDB tags or fields, thus avoiding ambiguities in data analysis and visualization.
  • Data Validation: Manually validating the collected and stored data, comparing it with information obtained directly from the equipment, Grafana visualizations, or execution logs, is essential to ensure accuracy in unit conversion and avoid misinterpretations.

HELPFUL LINKS:

Guide to NETCONF XML Management Protocol

https://www.juniper.net/documentation/us/en/software/junos/netconf/topics/task/netconf-session-sample.html

Introduction to Netmiko Library for Network Automation?

https://pynet.twb-tech.com/blog/netmiko-python-library.html

InfluxDB Schema Design Best Practices?

https://docs.influxdata.com/influxdb/cloud-serverless/write-data/best-practices/schema-design/

Steps to Create a Custom Grafana Dashboard?

https://www.njclabs.com/blogs/13-simple-steps-for-creating-a-custom-built-grafana-dashboard

Best Practices for Creating and Managing Grafana Dashboards

https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/best-practices/

Beginner's Guide to Jenkins Pipelines?

https://medium.com/@venkatsatyanreddy_92646/beginners-guide-to-jenkins-pipelines-16a6181def97


- AUTHOR: Alex Moran

Networking professional with expertise in ADC, CGN, Wireless, and DevOps Automation. Skilled in routing, switching, Data Center, and Linux technologies. Experienced in working with various vendors, including Juniper, A10, and Cisco, for Service Provider and Enterprise environments.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了