Shit Breaks - Dao of Troubleshooting
Steve Mushero
Fractional CTO for Startups - Scaling Product, Processes, and People - AI, SaaS, B2C, Infrastructure, DevOps, Security, Operations and more ...
Shit breaks. And often. Especially at high-load, high-complexity sites. And in ways not easily ‘solved’ with auto-scaling, more containers, restarting services, nor fancy scheduling systems. While all those are useful and have their place, they are not where the real work happens, by the grown-up girls.
Fixing these things is made harder by shiny new objects like micro-services, server-less, infinitely-divisible, loosely-connected pieces and parts spread out over everywhere.
Dao is the Way . . .
This leads us to the Dao of Troubleshooting complex systems.
First, Model All the Things. Know what is where, how it’s connected, how it’s configured, and hopefully its behavior. Have & view logical and if needed, physical or network diagrams. With layers, and groupings that make sense, at any scale.
Second, Know All the Knowables. This means knowing the status and configurations of everything, and I assure you this is not exactly what is checked into your code, config, .env, and infrastructure-as-code systems, let alone all the dynamic pieces and parts floating around. Like it or not, the source of truth is what’s really running right now.
Third, Rue the Changes. What has changed in the last relevant time period, by who, when, to what, and what was the effect. Who logged into the server, who pushed any code, changed any config, modified the cloud, etc. Then, what behaviors changed, e.g. whose latency changed, whose correlation dynamics changed, did error rates changed, what resource loading or availability changed? And which of these changes mattered?
Fourth, Exploit Expertise. Directly or indirectly apply knowledge and experience of how all the things, their relationships, dependencies, and especially dynamics and failure modes interconnect. Directly apply expertise via real live experts, on-site, on-line, or via Ouija, or indirectly apply via 7x24 via Expert Systems and Rule Engines with encoded expertise.
Fifth, Seek Clarity. Always ponder additional observations to boost the rule engines and expert brains, especially with low risk, quick answer data that ideally can be automated by the rule engines. There is never enough data, and never time to get it all, but bringing balance brings answers.
Sixth, Explore Effects by making changes or adjustments to the system to observe how they effect things. Especially useful to increase your exclusion list or uncover previously unknown relationships and stuff that never worked anyway.
Seventh, Exclude Exclusives, by not wasting time on problems you cannot have, as they can suck enormous energy, focus, and resources because they weren’t sufficiently excluded early on. Never lose sight of what the problem is not and rigorously exclude by logic and experience.
Eighth, Test Truths, as Late Stage Troubleshooting can end in contradictions and conundrums, where something true must not be - to paraphrase Twain, “The problem ain't what you don't know, it's what you know that just ain't so.” Always be willing to challenge your most basic assumptions, facts, and truths for therein often lies something you know that just ain’t so.
Ninth, Seek Solace as this stuff is hard, there is never enough time nor tools, and the pressure is always high. Continually step back, revisit what you know and think you know, looking at how it’s all connected, cause and effect, and the truth will often reveal itself, often in mysterious ways . . .
Global Field CTO at LaunchDarkly
7 年I love the 8th step...taking that approach works much more often that most people would think.