From Reactive to Predictive: Entering the AI Era of Observability

From Reactive to Predictive: Entering the AI Era of Observability

The rise of AI presents a breaking point for the current model of observability: The confluence of AI-generated code, non-deterministic applications, and agentic workloads will bring with them an explosion in complexity and cardinality that today’s observability tools are not equipped to handle.

But as much as AI presents this problem, I’d propose that it brings with it the solution as well. Plenty of customers have already signalled that they’re ready for a seismic shift in this market; some are spending as much as 10% of their total cloud bill on the current generation of observability tools. AI brings with it the opportunity to rethink what observability software is and does, shifting from a reactive paradigm to a predictive one. In other words, AI brings the possibility of a generational leap in observability that goes beyond providing better dashboards, giving us new tools that are both smarter and more cost-efficient, stopping system problems before they start.

This article dives into how AI-native approaches are emerging, where incumbents are vulnerable, and why at Vertex Ventures US we’re excited about investing in the future of autonomous systems engineering.

How the Cloud Era transformed observability

In the beginning, developers relied on logs. You SSH’d into a machine, tailed a log file, and debugged. Your infrastructure team may have muttered about Ganglia and Nagios alerts, but most developers didn't need to care. Then came the move to the cloud. Infrastructure became more complex, applications became distributed, it wasn’t enough just to debug, systems needed to be monitored and optimized. Ops teams and then developers needed dashboards, query interfaces, and visualizations… enter players like Datadog, and solutions like application custom metrics and Application Performance Monitoring (APM). By embedding itself into developer workflows and enabling cross collaboration with Ops teams, Datadog became the preferred player, winning adoption from the bottom up and redefining observability.

Cloud Era observability architecture

The AI Era: a new paradigm for observability

Now, we’re entering a new era with AI and complexity is set to increase dramatically. It’s hard to understand any system that you didn’t develop yourself, harder still when those AI-generated AI apps don’t have consistent behaviours. With non-deterministic systems, visibility into how they acted is even more important. In today’s systems, a human performs an action that directly impacts a service but with AI-driven workflows and agents, a single human action can trigger a chain reaction of AI processes, scaling up and down dynamically.?

This complexity is in turn triggering an explosion in cardinality - the number of unique combinations of metadata values in telemetry data being collected. It’s no longer practical to rely on dashboards alone. Debugging through a UI feels like searching for a needle in an ever-expanding haystack. One engineering manager from a Fortune 500 company told me “so often we see incidents where we are already emitting the correct metric to have caught it earlier, but we didn’t add it to the right dashboard” it’s a common frustration. Observability needs a new interface, and it’s becoming clear that the new interface will be driven by AI.

AI Era observability architecture

AI-driven root cause analysis: finding and fixing issues faster

Just as AI-powered coding assistants are transforming how software is written, they will soon transform how systems are monitored and debugged. Today’s observability toil e.g. querying logs, correlating traces, sifting through dashboards will be automated. AI-enabled root cause analysis (RCA) tools are emerging, promising to surface issues faster and eliminate manual work in optimizing systems and debugging.

Incumbents like Datadog and New Relic were built for an era where humans handled debugging, so their platforms prioritize fast, interactive querying. AI-driven RCA systems, however, focus on reducing total response time by automatically identifying root causes, which doesn’t require sub-second query speeds but instead demands deep, long-term pattern analysis. Incumbents struggle to transition because their architecture, user experience, and customer expectations are all optimized for manual debugging, making it difficult to support both human- and AI-centric workflows. As a result, AI-native observability startups have an advantage by building automation-first systems without the constraints of legacy platforms.

Emerging early stage companies

We believe any early stage player that wins here will have the unique combination of deep customer empathy, experience running large scaled systems and enough expertise in cutting edge AI. The winning companies will start with battle tested and repeatable use cases e.g. analyzing logs for errors, metrics for uptime and availability issues, traces for performance issues while keeping a narrow focus defined by language, application type or even more vertical e.g. debugging Spark pipelines. This product focus needs to be combined with hands-on customer support to build on agent performance, fill any gaps and work towards a more generalisable product.?

It will take time to build trust before any real, end-to-end automation is possible. One senior backend developer I spoke to said she would want the AI system correctly root causing and suggesting fixes for every incident over a 2-year timeline before trusting anything to run autonomously. How to build trust? Working collaboratively with early customers, proving accuracy, failing gracefully and investing heavily in a smooth UX that reduces friction in an already complex workflow. Early customers will be companies that feel the most pain - in the sweet spot of running complex infrastructure with strained SREs and likely for whom uptime = revenue, industries like software infrastructure, payments, analytics and e-commerce.

Can incumbents adapt from human-centric to AI-centric RCA?

Incumbents like Datadog and Chronosphere are starting to market their own AI products for automating on-call workflows and searching telemetry data using natural language. They have access to huge amounts of proprietary telemetry and incident data and this likely gives them a significant headstart in automating workflows. It’s unclear if these products are being used in production without customer case studies. As a public company, short term results are prioritized and there is a high bar for accuracy with any automation like this and so the roll-out and experimentation will likely be slow and steady. They are also already charging 10% of a company’s cloud bill and so customers may expect any AI products to be close to free. Plus, given the structure and breadth of the general purpose data that these platforms collect, avoiding false negatives and positives has historically been a challenge with e.g. Watchdog. This may burn early customer trust. The incumbents are a threat but there is an opportunity to move fast and build a better, AI-first leader in observability.?

Observability data lakes: decoupling storage from analysis

At the same time, a deeper shift is happening in the infrastructure layer. The disaggregation of the modern data stack has revolutionized data engineering with standardisation across table, file and storage formats and is now playing out in observability. Traditionally, APM vendors provided an all-in-one solution—collect, store, analyze, visualize. Now, infrastructure teams, SREs, and IT Ops are rethinking how they handle observability data. With the rise of microservices and Kubernetes, telemetry cardinality has exploded by 10-15x in recent years. AI-driven, real-time applications are generating even more granular data. In the past, teams solved this by down-sampling - throwing away data to manage cost and performance. That approach is no longer enough.

Cribl was an early pioneer in observability cost reduction, making it easy for companies to filter, enrich and route telemetry data before storing and analysing. Honeycomb were the first to pioneer the idea of an observability tool that could handle the explosion of cardinality from microservices. These tools will continue to see growth in the AI era but observability analytics is now a discipline in itself. It’s no longer about just throwing up a dashboard with key metrics; teams need to slice, dice, and correlate observability data like they would with any other critical data they use to run their business.?

Companies are realizing they need new ways to store and query high-cardinality telemetry, separate from the visualization and alerting layer. The key challenges here are 1) cross-correlating between metrics, logs, traces and events - typically stored and processed separately 2) fast queries across those data sources. For this analysis to be useful to a developer, it has to be real-time. We’re seeing the rise of observability data lakes and query engines built specifically for large-scale time-series analysis. Newer incumbents like Cribl are already working on their own observability data lake, to extend their pipeline products. The early-stage companies that will win here will drive greater standardisation (capitalizing on open standards like OpenTelemetry and Iceberg and formats like S3) and drive value in one of two ways - dramatically reduce cost with a tradeoff in performance or focus on high-performance but involving a new persona e.g. product teams with use cases like per customer analytics.?

Emerging early stage companies

Why enterprises are taking observability in-house

There has been a growing trend among the largest tech companies to roll their own observability stack, often combining internal tooling focused on large scale data processing with a Datadog or Grafana UI. After FAANG, notable examples include Uber (OSS Jaeger), LinkedIn (Kafka-based), Dropbox, Airbnb, Pinterest, Shopify and more recently companies like Palo Alto Networks are applying their expertise in data infrastructure to observability, accelerated by the improvement in data tooling for observability and driven by exploding cardinality and the associated cost with an incumbent vendor.

Incumbents are modernizing their data infrastructure but what are the limits?

Datadog has invested heavily in its internal data infrastructure releasing Husky as their event store in 2022 and more recently acquiring Quickwit, a sub-second query engine optimized for telemetry data. Datadog’s Husky is already a highly efficient storage engine and so it’s unlikely a new player will compete directly on performance. They are also positioning themselves to offer cross-correlation querying of any telemetry data and high cardinality but it’s an open question how this will collide with their existing business model, which is based on bulk, general purpose ingest and storage. The market rewards them for collecting and storing large amounts of data, forever, with rich UIs to query that are so loved by developers that they earn an 80% gross margin. Extracting meaningful insights from this bulk data using AI is an extremely hard problem to solve and with CFOs already pushing back on spend, Datadog will be unlikely to capture more high margin value without laser sharp messaging on the value these insights will drive.

Another open question is what role the data platforms e.g. Databricks and Snowflake will play in this market. They are already working with large enterprise customers incl. banks and telcos to enable custom observability data lakes for monitoring data workloads. It’s unclear yet whether this will become a more direct product strategy focused on IT operations and software engineering teams, tackling the observability vendors head-on.

The next decade of observability

The future of observability won’t be a better dashboard. Near-term it will be AI assistants that automate the toil, reason across fragmented telemetry sources, and give engineers superhuman abilities to debug and optimize their systems. Long-term, the model will flip from reactive incident response to preventative development, leveraging the universal context and memory capabilities of LLMs, reinforcement learning and related AI techniques to finally connect observability with code. It’s possible that AI agents will get to the point where fully-autonomous systems engineering is possible and both AI RCA and observability data lakes will be critical pillars. As autonomous software engineering advances, autonomous systems engineering will be its necessary counterpart, ensuring reliability, adaptability, and control.

Huge thank you to Ian Nowland , Uma Chingunde , Ilan Rabinovitch , Chris Cholette , Andrew Fong , Rakesh Kothari , ??Matthew Boyle , Anish Agarwal , Richard Crowley , Rustam X. Lalkaka , Sesh Nalla , Yotam Yemini for sharing invaluable insights and the time spent helping me shape this article ??

If you’re working on something in this space, see the world differently or want to discuss any of the points above, I would love to continue the conversation. You can reach me at megan at vvus dot com and if you’re leading engineering teams or building at the frontier of software infrastructure and AI, sign up here to join the next infra.community meet-up.

Ed Sim

boldstart ventures, partnering from Inception with bold technical founders building the future - Snyk, Tessl, Protect AI, BigID, Kustomer, Blockdaemon...

2 周

love seeing Grepr in there!

回复
Hannes Lenke

Building Monitoring as Code | Next Gen Synthetic Monitoring | CEO & Co-Founder at Checkly

2 周

Great article Megan! I believe that this part of your sum-up is key to the success of any next gen solution: 'and related AI techniques to finally connect observability with code.'

JJ Tang

Reliability Agents @ Rootly | Forbes U30

2 周

These are great insights, Megan Reynolds. Generative coding assistants like Cursor are accelerating system complexity by introducing everything from undocumented dependencies to bugs. Even the smartest, hardest-working SRE on your team will never be able to keep up. The obvious answer is now machines helping machines. The suite of agents we’re building at Rootly will help you automatically root cause at machine speed. But it's so much more than that. A significant advantage we have is data — data on how actual humans have resolved incidents, the things they tried, and the things they didn’t. This isn’t just the output you’d find in a postmortem but what actually occurred during the incident. That context exists nowhere else. Owning the entire end-to-end loop for reliability (on-call, response/collaboration, and AI SRE) is the type of platform the market demands, not just another AI tool with partial info. IMO this is why companies like Slack are exclusively turning towards Rootly to help pioneer their bets like the agentic marketplace and to define the category.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了