Scripting Techniques in KPIS Monitoring: case study
INTRODUCTION
Facing the challenges of network monitoring in an ever-evolving environment, this blog recounts the implementation of a KPI monitoring system on a router, utilizing a Linux server. Given the impossibility of updating libraries due to compatibility restrictions, and facing the limitations of traditional monitoring methods like memory monitoring through SNMP, which often prove insufficient for our specific needs, the need for innovation became clear. This led us to develop customized scripting solutions that allowed for a balance between innovation and operational stability with the existing conditions of the environment.
In this blog, the various facets of this project are explored, from tool selection to the automation of the data collection process, database configuration, information gathering and processing, and its effective visualization.?
TOOLS AND TECHNOLOGIES USED
To address specific challenges and contribute effectively to the monitoring system, various tools and technologies were carefully selected:
The following diagram offers a clear view of the process and interaction between the different components of our solution. It illustrates the data journey from its collection point to the visualization phase.
Within this process, the following KPIs are monitored:
For each KPI, a dedicated Python script was created to collect and process the relevant information. However, due to the complexity and the volume of data associated with some KPIs, it was necessary to develop more than one script. This strategy, along with others implemented, allowed for the reduction of sampling and execution times for each script. Even in cases such as 'Interface Errors and Packet Discards', where the number of subinterfaces exceeded 400.
DATABASE CONFIGURATION
The project used InfluxDB version 5.3.1, a choice dictated by its presence as a data storage solution on the Linux server and the need for consistency with other solutions already in use. Following, we can review the initialization procedure and data structure used.
Database Creation: Through the InfluxDB CLI, the database was established, intended to be the central repository for all collected data.
CREATE DATABASE ScriptDB
Definition of Measurements and Tags: Specific “measurements” for each KPI were defined. Each measurement encapsulates a set of metrics, with “tags” and “fields” designed to facilitate data filtering and querying. Tags are used to filter within the database query, and fields as a source for Grafana.
Example of Data Structure:
Measurement: interface_metrics
Information gathering
Information gathering was conducted through customized Python scripts, developed for each KPI, allowing direct interaction with the router and the collection of specific data on system performance and health. General and specific extraction methods considered are:?
As an illustration, two examples of the various Python scripts developed for this project are presented:
WRITING TO THE DATABASE
Each Python script was designed with a specific function to transmit the collected data to InfluxDB. This integration structures the data in JSON format, reflecting our desired structure of measurements, tags, and fields. This method ensures the data are properly prepared for storage, organizing them to facilitate subsequent querying and analysis.
Within each script, we have created the send_to_influxdb function, which establishes a connection to InfluxDB using specific details such as credentials and the port. For example, in CPU monitoring and interface metrics, send_to_influxdb prepares JSON objects that include relevant tags like host, routing_engine, or interface, and fields with detailed metrics. At the end of each script, send_to_influxdb is executed, ensuring the continuous update of the database with the most recent data without the need for manual intervention.
A crucial part of structuring these data is the use of JSON to clearly define the tags and fields, which is essential for precise KPI monitoring. Below is an example of how these data are structured in the send_to_influxdb function:
def send_to_influxdb(host, interface_list):
????client = InfluxDBClient('localhost', 8086, None, None, 'ScriptDB')
????for interface, data in interface_list.items():
????????json_body = [
????????????{
????????????????"measurement": "interface_metrics",
????????????????"tags": {
????????????????????"interface": interface,
????????????????????"host": host,
????????????????????"description": data.get("description", ""),
????????????????},
????????????????"fields": {
????????????????????"input_rate_bytes": data.get("input_rate_bytes", 0),
????????????????????"output_rate_bytes": data.get("output_rate_bytes", 0),
????????????????????"total_input_drops": data.get("total_input_drops", 0),
????????????????????"total_output_drops": data.get("total_output_drops", 0),
????????????????}
????????????}
????????]
领英推荐
????????client.write_points(json_body)
????client.close()
This structured approach enables efficient indexing and quick data retrieval, facilitating complex and detailed analysis and visualizations. The specific structuring within the "interface_metrics" measurement to capture input_rate, output_rate, total_input_drops, and total_output_drops is crucial. This organization optimizes data handling within the database, making their filtering, querying, and visualization efficient as a table within the database.
VISUALIZATION WITH GRAFANA
Grafana plays a key role in the dynamic visualization of data stored in InfluxDB, transforming complex metrics into clear and accessible graphs.
We created dashboards in Grafana that clearly display the KPIs, visualizing critical metrics such as CPU usage, memory, latency, among others. This provides us with a comprehensive view of the state of our infrastructure. Also, we formulated specific queries in Grafana to extract and display the data accurately.
SELECT "usage_percentage" FROM "memory_re" WHERE "re" = 'RE0'
SELECT "usage_percentage" FROM "memory_re" WHERE "re" = 'RE1'
Measurement: memory_re
SELECT? "input_drop", "output_drop" FROM "link_status_metrics" WHERE "host" = '$Host' AND $timeFilter GROUP BY "interface" fill(null)
Measurement: link_status_metrics
In addition to visualizations, we have set up alerts in Grafana to signal situations requiring immediate attention, such as excessive CPU usage or a high number of revolutions in the FANs.
AUTOMATION WITH JENKINS
Through Jenkins, we have configured an automated task named "ROUTER-KPI," responsible for executing the central script kpi_aggregator.py. This script, in turn, invokes a series of specific scripts designed to collect various KPIs from the router.
The "ROUTER-KPI" task has been set to run periodically, adapting to our data collection needs. The configuration details include:
cd /ruta/a/scripts
sudo python kpi_aggregator.py --host=BLOG_U01 --username=admin--password=admin
This command navigates to the directory containing the script and executes it with the appropriate arguments.
RECOMMENDATIONS FOR OPTIMIZING KPI MONITORING
To ensure both efficient and effective KPI monitoring, we offer the following practical tips:?
HELPFUL LINKS:
Guide to NETCONF XML Management Protocol
Introduction to Netmiko Library for Network Automation?
InfluxDB Schema Design Best Practices?
Steps to Create a Custom Grafana Dashboard?
Best Practices for Creating and Managing Grafana Dashboards
Beginner's Guide to Jenkins Pipelines?
- AUTHOR: Alex Moran
Networking professional with expertise in ADC, CGN, Wireless, and DevOps Automation. Skilled in routing, switching, Data Center, and Linux technologies. Experienced in working with various vendors, including Juniper, A10, and Cisco, for Service Provider and Enterprise environments.