GrafLI - An out-of-the-box Azure monitoring and visualization platform

Prateek Singh

March 28, 2024

Co-authors: Co-authored byPrateek Singh and Co-authored byRahul N.

In the ever-evolving landscape of cloud services, the importance of monitoring and visualizing data has become increasingly crucial for engineering organizations to enhance application performance, guarantee system reliability, ensure operational efficiency and enable strategic decision-making. At LinkedIn, the rapid expansion of the Azure footprint, coupled with the continuous addition of new services, necessitates a robust and scalable solution that can keep pace with the dynamic nature of cloud environments.

Recognizing these challenges, the Productivity Engineering team at LinkedIn crafted GrafLI, a cloud-native data visualization tool designed to transform the visualization of Azure and on-premises services. In this post, we delve into the intricacies of GrafLI and how it enhances the developer experience and increases engineering velocity. We also talk about how GrafLI leverages Azure’s native monitoring stack and the powerful synergy of Azure Resource Graph, Kusto Query Language (KQL) and Azure Monitor to provide real-time visualization capabilities such as consolidated monitoring views, automated dynamic dashboard generation.

Why GrafLI?

As the portfolio of applications and services hosted on Azure expanded within our organization, it was imperative that our internal monitoring capabilities evolved correspondingly. This encompassed all three fundamental pillars of monitoring: data collection, alerts, and visualization. Each of these components had to be scalable in order to meet our growing Azure footprint.

Our earlier approaches involved manually curating metrics and creating visualizations using tools such as Log Analytics, Azure dashboards, and Grafana. These methods, while functional, posed significant challenges as we began implementing them at scale.

As a result, we explored and implemented more automated and dynamic solutions for visualization. This led to us developing a custom layer on top of existing Azure monitoring components and leveraging native APIs to automate dashboard creation and updates. By enhancing our visualization capabilities, we created more efficient, real-time monitoring that met LinkedIn’s needs. It also improved our ability to respond to the evolving needs of our Azure-based applications and services.

Image of the benefits of GrafLI — Figure 1. Benefits of GrafLI's unified monitoring and visualization ecosystem

Key features of GrafLI

Out-of-the-box visualizations

GrafLI’s primary feature is the dynamic dashboard generation that enables effortless scaling of visualizations for various services or environments. This eliminates the need to manually monitor dashboard creation and ensures instant visualizations for any provisioned application, workload, or service in Azure. Leveraging the power of CI/CD and Infrastructure as Code (IaC) provisioning pipelines, GrafLI significantly reduces time-to-value (TTV) by efficiently provisioning Azure monitoring infrastructure and establishing monitoring data ingestion with pre-defined baselines.

Discoverability and ease of search

GrafLI significantly enhances the developer experience by making it easier to search and access visualizations. By utilizing optimized Azure Resource Graph queries, which guarantees scalability and rapid compilation of all Azure-provisioned applications, GrafLi allows users to access a complete and up-to-date list of their deployed Azure resources in a matter of seconds. Consequently, developers can quickly explore the relevant metrics for their Azure resources, leading to a more efficient and productive workflow.

Image of GrafLI’s user view of Application Search — Figure 2.1. GrafLI’s user view - Application Search

Consolidated monitoring dashboards

GrafLI seamlessly compiles metrics from diverse sources to present a unified dashboard for comprehensive observability across entire application stacks. This user-friendly dashboard combines different statuses, metrics, trends, and statistics from application facets and dimensions. These encompass elements like user interactions, server response times, dependencies, exceptions, custom events, and other relevant aspects.

GrafLI also provides more focused insights through drill-downs into Application Performance Metrics (APM), Infrastructure Metrics (IaaS), and Platform Metrics (PaaS) for a more nuanced analysis and understanding of application behavior and performance.

Image of GrafLI’s user view for the Application Overview — Figure 2.2. GrafLI’s user view - Application Overview

How it works

Over the next few sections, we will cover the different segments of GrafLI and how it comes together to provide a single unified monitoring, metrics collection, and visualization platform.

Architecture

Let’s start by taking a high-level look into the intricacies of GrafLI's architecture and service ecosystem from the users’ perspective and their interactions with Azure.

Image of GrafLI’s high-level design — Figure 3.1. GrafLI’s high-level design

A user’s journey begins when they commit their infrastructure deployment code to our IaC repositories (Azure DevOps or GitHub). This action triggers the Azure CI/CD pipelines, which kickstarts the infrastructure provisioning process that includes foundational deployments such as networking and monitoring, along with the deployment of requested resources specified in the Infrastructure as Code (IaC) code. These resources can range from building VMs to App services or any Platform as a Service (PaaS) components (such as Azure Functions, CosmosDB and more), ensuring strict adherence to security principles, naming conventions, and resource tagging with metadata like owners and departments.

Once the resources are provisioned, the user or the application owner accesses the GrafLI web application. When a user accesses the GrafLI’s web application, the request triggers several backend API endpoints via the Azure WebApp service. The backend analyzes resource deployment patterns, aggregates relevant metrics from multiple Azure sources, and delivers them to the web frontend in the form of time-series data.

Finally, the frontend transforms this time series data into graphs and visualizations within a single dashboard, serving it to the client's browser for a comprehensive monitoring experience.

Infrastructure provisioning, log ingestion and monitoring strategy

Infrastructure as Code (IaC)

With cloud providers like Azure, infrastructure provisioning pipelines play a crucial role in ensuring consistency across deployed resources. This entails following standardized patterns, including naming conventions, Azure resource tags, location, SKU, and resource group specifications.

Let’s explore how adhering to these established patterns and conventions effectively streamlines the discoverability of related resources at scale within Azure.

Establishing a Monitoring Baseline

Azure infrastructure provisioners seamlessly allows us to use different baseline configurations during deployments via ARM template deployments. From a monitoring and visualization perspective, this translates to:

Installing Monitoring Agents: Ensuring the essential agents (Azure Monitor agent, Log Analytics agent, etc.) are in place for effective monitoring.
Enabling Performance Counters: Activating performance counters to gauge system performance efficiently on Azure IaaS (Virtual Machines, Virtual Machine Scale Sets, etc.).
Installing Monitoring Solutions: Deploying specialized monitoring solutions tailored to specific needs, like Change Tracking, VM Insights, Service Map, etc.
Enabling Diagnostic Settings: Configuring diagnostic settings for enhanced insights into behavior of Azure PaaS components and to converge logs to a centralized Azure Monitor logs.
Configure Alerts and thresholds: Establish and customize alert configurations, defining baseline thresholds for optimal performance in the Azure environment.

By combining standard IaC principles with robust monitoring baselines, our approach ensures consistent infrastructure and enables us to proactively set up a comprehensive monitoring system during the provisioning process that provides visibility of all the services deployed in Azure. Once the relevant Azure components are deployed, the next steps involve developing a system that queries the logs and metrics from Azure, at scale.

Querying logs & metrics at scale

When users search for a specific application or service, the Azure Resource Graph yields a list of potential matches. When the user makes a selection, GrafLI precisely identifies the Azure Monitor resource associated with the chosen workload, and queries the logs and metrics that are typically routed to Azure Application Insights or the Log Analytics workspaces.

Azure Resource Graph

With GrafLI, the ability to query at scale has been a game-changer. Azure Resource Graph plays a vital role in facilitating this by simplifying the process of gathering a comprehensive list of provisioned resources. This includes monitoring resources, tenants, and tags, and relevant resource ownership information. This capability is important to GrafLI because it helps precisely targeting monitoring resources such as Azure Log Analytics and Azure Application Insights to retrieve the relevant logs and metrics associated with deployed infrastructure and/or applications.

In addition to leveraging the power of Azure Resource Graph, GrafLI also allows users to effortlessly discover alerts and various resource types through resource graph queries. Applications and deployed infrastructure are automatically on-boarded onto the Azure Monitoring stack as soon as an application becomes discoverable within Azure Resource Graph, enhancing the efficiency of the entire monitoring process.

Kusto Query Language (KQL)

To gather monitoring information and other metrics, GrafLI heavily relies on the Kusto Query Language (KQL). We execute a series of predefined query templates crafted in KQL, targeting both the Log Analytics workspace and Application Insights. An example for this is listed below:

              Python
          

           {
   "cpu": """let interval = {0}m; let stime = datetime({1}); let etime = datetime({2});
InsightsMetrics | where Name contains "UtilizationPercentage" and Computer contains '{3}'
| make-series Cpu = toint(avg(Val)) on TimeGenerated from stime to etime step interval
by Computer""",


   "mem": """let interval = {0}m; let stime = datetime({1}); let etime = datetime({2});
InsightsMetrics | where Namespace contains "Memory" and Name contains "AvailableMB" and
Computer contains '{3}' | make-series Mem = (100 - toint(avg((Val/toint(parse_json(Tags["vm.azm.ms/memorySizeMB"])*100)))) on TimeGenerated
from stime to etime step interval  by Computer""",

   "disk": """let interval = {0}m; let stime = datetime({1}); let etime = datetime({2});
InsightsMetrics | where Namespace contains "LogicalDisk" and Name contains
"FreeSpacePercentage" and Computer contains '{3}' | make-series Disk = (100 -
toint(avg(Val))) on TimeGenerated from stime to etime step interval  by Compute"""
}
      

Batch queries

The Azure Monitor Log Analytics API and Query Client SDK both provide a compelling feature of batching queries together. This enhancement not only elevates the efficiency of our querying processes but also brings noteworthy advantages when dealing with querying multiple logs and metrics for a consolidated single pane of glass.

Cross resource queries

When dealing with a diverse range of resources and monitoring elements in a complex cloud setup, it's important to fetch monitoring data seamlessly at scale. Cross-Resource Queries within the Log Analytics workspace allows us to easily specify the workspace, app, or resource details using expressions like workspace(), app(), or resource() expressions.

Next, we’ll delve into how the metrics collected using these mechanisms are aggregated and visualized.

Metric aggregation and time-series data delivery

In GrafLI, we harness the power of the Azure Monitor Query client library to seamlessly execute read-only queries across Azure Monitor's dual data platform:

Logs: Our system methodically gathers and organizes logs and performance data from diverse sources, including platform logs from Azure services, performance data from virtual machine agents, and usage statistics from applications. This data is normalized and available within a unified Azure Log Analytics workspace. Leveraging the Kusto Query Language, various data types (like datetime, string, integer, decimals, etc.) can be comprehensively analyzed together, enhancing the depth and efficiency of our monitoring capabilities.
Metrics: GrafLI leverages Azure Metric definitions to discover metrics for Azure Platform as a Service (PaaS) resources. These definitions include a description, unit, aggregations, time granularities etc as shown in the following figure:

Image of Metrics Definitions — Figure 3.2. Metrics Definitions

This process involves utilizing the query client library to identify and retrieve metric definitions, allowing GrafLI to offer comprehensive insights into the performance and health of Azure PaaS resources.

Once we pinpoint the target monitoring resource, and the query is executed, the resultant monitoring data is presented in the form of a time series dataset. The user interface facilitates the application of aggregations to this data, enhancing the depth and precision of the insights derived from the monitoring process based on time granularity and range.

Image of Time Series data — Figure 3.3. Time-series data

Charting and visualizations

Utilizing the time-series datasets and the metric definitions retrieved from Azure Monitor, GrafLI employs advanced charting and visualization libraries like ChartJS and D3.js to dynamically generate compelling visualizations in real-time. This seamless integration of data visualization provides users with an intuitive and insightful monitoring experience as soon as the application or the Azure resource is deployed.

Image of Metrics visualization on chart — Figure 3.4. Metrics visualization on chart

Visualizations in GrafLI are classified into four primary categories:

Service Overview
Infrastructure Metrics (IaaS)
Platform Metrics (PaaS)
Application Performance Metrics (APM)

Service overview

This view provides users with a high-level summary, collating critical information about application availability, performance, infrastructure metrics, and platform-specific metrics. The visualizations here serve as a gateway to delve deeper into the nuances of the monitored components.

Image of GrafLI user view - Application overview — Figure 3.6. GrafLI user view - Application overview

Infrastructure (IaaS) metrics

In this view, GrafLI focuses on Infrastructure as a Service (IaaS) metrics, offering detailed insights into the performance and health of the underlying infrastructure components, mostly Virtual Machines and Virtual Machine Scale sets. Users can easily track and visualize key resource indicators like CPU, Memory and Disk Utilization enabling proactive measures and ensuring the seamless operation of the infrastructure deployed on Azure or on-premises.

Image of GrafLI user view - Infrastructure overview — Figure 3.7. GrafLI user view - Infrastructure overview

Application Performance Metrics (APM)

With the APM view, GrafLI offers granular insights into Application Performance Metrics (APM), giving users a detailed understanding of their applications' performance. This section delves into response times, error rates, and other crucial APM metrics, allowing engineering teams to pinpoint issues promptly and optimize application performance for an enhanced end-user experience.

Image of GrafLI user view - Application Performance — Figure 3.8. GrafLI user view - Application Performance

Platform Metrics (PaaS)

The Platform Metrics section is dedicated to providing visibility into the performance of Platform as a Service (PaaS) components. GrafLI captures and visualizes critical metrics related to Azure's PaaS offerings, allowing users to monitor the health of these services, ultimately ensuring the seamless functioning of cloud-based applications.

Image of GrafLI user view - Platform overview — Figure 3.9. GrafLI user view - Platform overview

By leveraging these meticulously curated views within GrafLI, engineering teams gain a comprehensive understanding of their Azure environment. The dynamic visualizations, powered by ChartJS and D3.js, transform raw data into actionable insights, empowering teams and application owners to make informed decisions using these consolidated monitoring dashboards.

GrafLI’s extensible framework

While building GrafLI, we prioritized extensibility and adaptability to accommodate changes seamlessly. This design ethos facilitates the onboarding of applications beyond the Azure ecosystem. Additionally, it empowers users to collect metrics through custom KQL queries, expanding the GrafLI’s capabilities beyond the limitations of standard Azure queries.

Hybrid and on-premises workloads

GrafLI relies on Azure resource deployment patterns accessed through Resource Graph. However, onboarding applications and resources deployed on our on-premise data centers, running on virtualization platforms and applications involves provisioning Azure Log Analytics and Application Insights, maintaining a specific naming convention, and configuring monitoring agents. This allows for GrafLI’s seamless integration with on-premise workloads.

In LinkedIn’s software ecosystem, which are organized into releasable logical units known as 'multiproducts', onboarding to GrafLI is facilitated by provisioning essential Azure monitoring resources and implementing code-based instrumentation using the Azure SDK which allows applications to send app telemetry to Azure Application Insights. Upon populating metrics within the Azure monitor resource, GrafLI can swiftly generate dashboards and visualizations, providing instant insights.

Customizing metrics: a tailored query approach

GrafLI goes beyond the constraints of predefined Azure Resource metric definitions, providing users with the flexibility to extend support for custom metrics to support the monitoring requirements unique to their service, application, or team.

By seamlessly onboarding a templatized KQL query that accommodates time range and granularity specifications, this approach facilitates the rapid onboarding of virtually any metric. Whether it's custom text-based logs or sophisticated queries on existing logs, this versatile capability allows users to quickly integrate diverse metrics within minutes, and scale to practically any application or service onboarded to GrafLI.

The following are a few samples of templated custom queries designed to retrieve time series data:

              Python
          

          {
   "requests/http_status_code": """let interval = {0}m; let stime = datetime({1});
let etime = datetime({2}); let Custom = app('{3}').requests; Custom | where client_Type !=
"Browser" | make-series ['requests/http_status_code']= sum(itemCount) on timestamp from
stime to etime step interval by resultCode
               """,

   "pageViews/browser": """let interval = {0}m; let stime = datetime({1}); let etime = 
datetime({2}); let Custom = app('{3}').pageViews; Custom | extend browser = 
replace(@'(.\d+)','',client_Browser) | make-series
['pageViews/browser']= count(user_Id) on timestamp from stime to etime step interval
by browser""",


   "pageViews/device_model": """let interval = {0}m; let stime = datetime({1}); let etime
= datetime({2}); let Custom = app('{3}').pageViews; Custom | make-series
['pageViews/device_model' = count(user_Id) on timestamp from stime to etime step interval
by client_Model""",
}

      

Metrics & Impact

As GrafLI went live, we also found that it significantly enhances developer productivity by providing a multifaceted view of both Azure and hybrid environments through its advanced visualization capabilities. GrafLI transforms complex time-series data into clear, real-time visualizations across multiple dimensions: service overview, IaaS, APM, and PaaS metrics. This allows developers and on-call engineers to quickly analyze application performance, infrastructure health, and platform efficiency.

GrafLI not only boosts developer productivity but also facilitates improved resource utilization and swifter responses to monitoring requirements. This results in tangible cost savings and heightened operational efficiency.

Coverage

GrafLI encompasses a total of 1500+ dashboards, each spanning 800+ distinct services deployed in Azure. Access to approximately ~100K Azure resources and their corresponding metrics, including both Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) components, along with their metric definitions.

Image of GrafLI services — Figure 4. GrafLI services

Efficiency in Insights

Users experience a notable reduction in the time needed to access insights. Time to insights is slashed from minutes to seconds, fostering quicker decision-making and heightened productivity.

Conclusion and Future Work

The task of effectively monitoring and visualizing metrics within the dynamically changing environment of cloud, hybrid and containerized workloads poses a significant challenge to modern organizations. GrafLI, is a turnkey solution that aims to solve a lot of these complexities by building on top of the Azure Monitoring Stack and unifying the monitoring, metrics collection and visualization challenges in a holistic manner.

Since it was operationalized, GrafLI has offered automated monitoring dashboards, enhanced search and discoverability features, with ability to integrate metrics from a variety of sources to hundreds of applications deployed on Azure. We aim to expand on this implementation by adding capabilities to monitor and visualize hybrid, and containerized workloads, making GrafLI the de-facto visualization platform for LinkedIn’s Productivity Engineering team.

Our focus remains on providing a robust, efficient, and comprehensive visualization solution that keeps pace with the ever-changing technological landscape.

Acknowledgements

Thanks to the team: Raghav Ayyamani, and Madhav Bhandari for making this project possible and contributing to the project from ideation through design, and implementation. Our gratitude also extends to our leaders, Balaji Ramaswamy, Balaji Vappala, and Harihara Sudhan for providing the opportunity, continuous encouragement, and invaluable guidance throughout this endeavor. A special note of thanks to Sudha Prabhunandan for her valuable mentorship, continuous feedback from inception to implementation, and assistance in driving product adoption.

Infrastructure

Accelerating LinkedIn’s My Network tab by reducing latency and...

Elan Meltsner

Dec 20, 2024
Open Source

Liger-Kernel: Empowering an open source ecosystem of Triton Ke...

Pin-Lun (Byron) Hsu

Dec 5, 2024
Infrastructure

Stateful workload operator: stateful systems on Kubernetes at ...

Michael Youssef

Nov 12, 2024