Everything You Ever Heard About Observability is Wrong
Thomas LaRock
Author and data professional with 25+ years of expertise in data advocacy, data science, SQL server, Python ~ Microsoft MVP ~ Relationship builder with Microsoft & VMware ~ M.S. in Data Analytics (2025) and Mathematics ~
Oh, hello.
Thank you for subscribing to my newsletter. It has been a while since I wrote the last edition of Data on the Rocks. Hopefully writing a newsletter is like riding a bicycle or updating NULLs to empty strings. One of those, I'm sure. ?
My unplanned sabbatical continues. I finished my summer class I have a few weeks before my next class. I plan to use this time to reboot the local user group , prepare my sessions for Live 360 , and have as many conversations as possible with other data professionals.
One such conversation recently had me reflecting on the past twenty years. I went from a software engineer to a DBA, built an in-house monitoring system, decided it was better to buy a monitoring product instead, and finally went to work for a vendor landing as a Technical Advocate for data and databases. It has been a hell of a ride and 10 out of 10, I would definitely ride again.
Another conversation reminded me how the tech industry keeps inventing new terminology for stuff we've already seen or done. For example, I read a DevOps e-book a few months back where the author was constantly congratulating themselves for their “inventions”. One such amazing invention involved the pushing of changes to an identical production system. I found myself screaming in my head about how "production parallel" was something we did more than twenty years ago!
Of course, the idea of testing in a non-production environment isn’t anything new. Consider Haggis. No way someone thought to themselves “let me just shove everything I can into this sheep’s stomach, boil it, and serve it for dinner tonight.” You know they first fed it to the neighbor nobody liked. Probably right after they shoved a carton of milk in that same neighbor’s face and asked “does this smell bad to you?”
Another conversation topic was observability, and the hours spent trying to define observability, how it is not the same as monitoring, and why observability matters. All time wasted, in my opinion, because its product marketers looking for ways to explain how their version of observability matters more than yours. All the while not understanding how their target market, system administrators, don't want to read marketing literature or answer sales calls. What sysadmins want is to be productive, to be recognized as having value, and to not be made to feel like a failure every time something is "running slow".
Observability, by itself, can't solve that for sysadmins. Many of the legacy tools on the market can't solve it either. The reason is straightforward: these tools were built for a different era. An era where servers were down the hall from our cubicles. Where most of the traffic was on a LAN, maybe a WAN, but not usually from the outside world. The tools gathered metrics made available by the platforms and operating systems themselves, so you were always at the mercy of the data provided. Often, you didn't have the critical piece of data necessary to troubleshoot an issue.
Sure, you could argue the latest APM tools were built to monitor modern cloud-native systems. That's why we have the golden metrics, and how APM tools are superior to legacy monitoring systems. Look, if a tool or system works for you, that's great. But all these systems, APM or legacy, are still flawed for two reasons.
Fundamental Problems with Observability
First, these tools rely upon gathering metrics from the outputs of the systems they are monitoring. They aren't flexible enough to plug and play into a system they haven't seen before. So, when your company decides to start using Snowflake next month and you ask your vendor if they support Snowflake, you'll get back a polite answer "sorry, not yet". Then the vendor will promise you Nirvana, you just need to purchase every possible extension. But you are still at the mercy of whatever their API has decided to collect. Sure, you can start adding tags to your code to help, but that means you need to know in advance what you are looking for. Chances are the unknown unknowns are going to bite you at some point.
Anyone who has built their own monitoring system will tell you it is not fun to be in a meeting where you are asked "why didn't your system catch the issue?" That's why I ended up buying a product, because it was easier to blame a vendor than myself. The reality is no one, not you, or these legacy vendors, are going to capture everything you need, at the precise times you need it, no matter how much tagging or customizations you add.
The second reason is these tools with cool custom features for "maximum observability" (whatever that means) are built reactive-first. What I mean by reactive-first is the tools require a bunch of data to be ingested and analyzed after an issue has already happened. You don’t get the promise of predictive analytics or anomaly detection without first collecting the data upon which to build a prediction or forecast.
领英推荐
In other words, to be proactive you need to be reactive first. How easy it would be for SETI to find intelligent life if only they could (1) collect every piece of data throughout the observable universe first and (2) send a transmission directly towards the precise region of sky where we know life already exists. If that sounds a bit crazy, well, so does 98% of the product marketing I have been reading for the past ten years.
Summary
The next generation of monitoring and observability tools will be fundamentally different from the legacy tools we know and love. These new tools will ingest streams of data, perform analytics to look for correlations, and use natural language processing to output actionable insights to the end user. So, when you ask this new vendor "hey, do you support Kubernetes" you won't need to wait for them to provide support. You'll be able to point at whatever outputs already exist to find correlations and anomalies the legacy tools would never detect, and instructions on what actions (if any) are needed.
This is one of the reasons why I decided to earn an MS degree in data analytics. I’m looking to help companies use data to make better tools and systems so we can have what everyone wants: happy customers.
Thanks
Thanks for reading this far, and for subscribing. I expect this newsletter to evolve in time, and the format to change a bit as well. But for today, this is enough. I had a story to share and felt this was the right place to do so.
If you find yourself with questions about data, data science, data analytics, or database performance, feel free to reach out and book time with me (https://lnkd.in/gE7_ztk9 ). I'm also happy to field questions regarding SolarWinds products, especially if you are a customer and are wondering if your installations are configured correctly. Or we could talk about product marketing, I have lots of opinions on that subject!
Use the link and find some time for us to chat.
Thanks again for stopping by.
Tom
Database Manager at Sentara Healthcare
1 年My take on monitoring is that all it should do is confirm what I already know, and it lets me make pretty pictures and reports for the bosses.
Amen, preach: "The reality is no one, not you, or these legacy vendors, are going to capture everything you need, at the precise times you need it, no matter how much tagging or customizations you add" Does this mean I can't see everything I would ever want in a "single pane of glass"?