Observability in the Age of Gen AI
3one4 Capital
Early-stage venture capital firm based in Bangalore, India #RaiseTheBar
What is Observability?
A simple way of describing observability is how well you can understand the system and its internal state from the outputs it generates.
Expanded to IT, software, and cloud computing, observability is how engineers can understand the current state of a system from the data it generates. To fully understand, you’ve got to proactively collect the right data, then visualise it, and apply intelligence.
Observability provides a proactive approach to troubleshooting and optimising software systems effectively. It offers a real-time and interconnected perspective on all operational data within a software system, enabling on-the-fly inquiries about applications and infrastructure.
In the modern era of complex systems developed by distributed teams, observability is essential. Observability goes beyond traditional monitoring by allowing engineers to understand not only what is wrong but also why something broke down.
The Market
The observability market is huge. To give you an idea - “For every $1 you spend on public cloud, you’re likely spending $0.25–$0.35 on observability” - The observability and monitoring market was valued at $41B at the end of 2022 and is poised to grow to $62B by the end of 2026.
Organisations are willing to pay for the right tools, which is highlighted by DataDog’s annual posted revenues to the tune of $2B in FY23. Coinbase was billed $65M in Datadog expenses alone in 2021, highlighting how these costs can be much larger than one can imagine.
In essence, this is a mature space with players like Splunk, Datadog, Grafana, New Relic and others taking up a significant wallet share of software enterprises. These companies are innovating daily, and the table below helps us understand their strengths/ focus areas:
How is Generative AI Impacting the Space?
Traditional ML algorithms, such as decision trees, random forests, and clustering algorithms, were and are used in AIOps for tasks such as anomaly detection, root cause analysis, and predictive analytics. Traditional ML techniques played a crucial role in AIOps by providing the foundational algorithms and methodologies for automating IT operations tasks. However, the advent of LLMs has further advanced the capabilities of AIOps by enabling more sophisticated natural language processing and understanding, which can be particularly useful for analyzing unstructured data such as logs and alerts. Foundational models are able to effectively “apply intelligence” by “inquiring” the system about observability data (think MELT - metrics, events, logs and traces) to get a faster understanding of the “why”. Due to the readily available data and need for such solutions in the market, this is one the first areas where generative AI is witnessing immediate applications. In addition to traditional machine learning approaches that have their own salient points, LLMs create value in a number of new ways.
Some of the key differences between traditional ML models and LLMs are as follows:
These capabilities are very useful for the observability problem statement since in essence the field revolves around understanding time-series system data to understanding more about the current and possible future states of the system.
Challenges in Observability
The most surprising part of the earnings call by Datadog was this - “Management projects its core observability and monitoring market will grow at a 10.89% compound annual growth rate (CAGR) from $41 billion at the end of 2022 to $62 billion at the end of 2026. Thus far, Datadog has only captured a 4.36% share of this market, so the company has a long runway for growth.”
领英推荐
1. Tool Sprawl: Separate storage tools are typically used for each telemetry data type: logs, metrics, and traces, creating additional cost and complexity to manage.
2. Increasing TCO: Observability tools are notorious for large bills, highlighted in the graph below. This is primarily due to high tool sprawl and index storage costs.
Additionally, IT budgets are being rationalised:
3. Shortage of Talent: Resolving incidents requires a large surface area of unique skills. These skills are usually only developed after years of trial and error, so solving complex production problems requires senior SREs which are hard to hire.
Additionally, layoffs affect the MTTR for incidents as well (fuelling into the SRE problem):
The biggest shift in current times has been the advent of large language models, which extract precise information faster and better. Based on this, we have seen multiple players start up here.
Areas of Opportunity in Observability
Building Observability Tools with LLMs: Key Considerations
Despite the crowded landscape there’s more to be done with the paradigm shift in the making with LLMs. We think that the following factors should be considered as guiding markers for starting up here at the outset:
In summary, this is a large addressable spend, and the flexibility of new tools can unlock massive value by simplifying the path to outcomes. We’re ready for some big changes in this space and excited to see what startups build.
seed-stage enterprise-tech VC
3 个月Sonal Saldanha well argued. the key is being able to find a niche where you can get in and expand. however, most niches are also decently penetrated and the road from a niche to being an overarching player seems to be a long and arduous one. There might be value though in being a platform that integrates with existing stack and reduces false positive alerts using contextualization and scaling from there. Very similar opportunity in the Cyber security area as well