What sucks about observability is that buying the tools is the easy part.
You want whatever you buy to just work but like buying a hammer won’t make you a carpenter, there is a lot to learn before you’ll get value from your purchase.
Often, and something I see a lot, is companies making a push to rollout observability across the org. Usually after a series of incidents that made executives say ‘something’ must be done, where that eventually lands as ‘add some metrics’
Great! You add an o11y project to each teams roadmap, approve budget for the new toolchain, then you’re done.
Except… you’re not. Teams get to their projects, often with varying levels of skepticism about the value, and proceed to run headfirst into the blank page problem.
“So many tools, so much code. Could do anything, really. Add metrics here, some tracing there, I gues that could do with some telemetry…”
It can be a huge waste of time. Worse, it might generate loads of telemetry data that skyrockets your bill, all of it unused. No value delivered, engineers annoyed, incidents continue.
What you need is a strategy, a good understanding of the problems you want to solve, and a vision of where you want to go. You need patterns that people can apply consistently to add observability to systems, and an example of ‘good’ that people can aim toward.
The internet is short of content about exactly how teams rollout observability and what the end result looks like. Which is why I’m super happy Martha has written a post that shares exactly how we did this at incident, including real screenshots and videos of our dashboards and our philosophy for creating them.
These are patterns that can be adopted by any team with similar aims. If you want to get past the blank page, I strongly recommend you read this post!