Shit Breaks - Dao of Troubleshooting

Shit Breaks - Dao of Troubleshooting

Shit breaks. And often. Especially at high-load, high-complexity sites. And in ways not easily ‘solved’ with auto-scaling, more containers, restarting services, nor fancy scheduling systems. While all those are useful and have their place, they are not where the real work happens, by the grown-up girls.

Fixing these things is made harder by shiny new objects like micro-services, server-less, infinitely-divisible, loosely-connected pieces and parts spread out over everywhere.

Dao is the Way . . .

This leads us to the Dao of Troubleshooting complex systems.

First, Model All the Things. Know what is where, how it’s connected, how it’s configured, and hopefully its behavior. Have & view logical and if needed, physical or network diagrams. With layers, and groupings that make sense, at any scale.

Second, Know All the Knowables. This means knowing the status and configurations of everything, and I assure you this is not exactly what is checked into your code, config, .env, and infrastructure-as-code systems, let alone all the dynamic pieces and parts floating around. Like it or not, the source of truth is what’s really running right now.

Third, Rue the Changes. What has changed in the last relevant time period, by who, when, to what, and what was the effect. Who logged into the server, who pushed any code, changed any config, modified the cloud, etc. Then, what behaviors changed, e.g. whose latency changed, whose correlation dynamics changed, did error rates changed, what resource loading or availability changed? And which of these changes mattered?

Fourth, Exploit Expertise. Directly or indirectly apply knowledge and experience of how all the things, their relationships, dependencies, and especially dynamics and failure modes interconnect. Directly apply expertise via real live experts, on-site, on-line, or via Ouija, or indirectly apply via 7x24 via Expert Systems and Rule Engines with encoded expertise.

Fifth, Seek Clarity. Always ponder additional observations to boost the rule engines and expert brains, especially with low risk, quick answer data that ideally can be automated by the rule engines. There is never enough data, and never time to get it all, but bringing balance brings answers.

Sixth, Explore Effects by making changes or adjustments to the system to observe how they effect things. Especially useful to increase your exclusion list or uncover previously unknown relationships and stuff that never worked anyway.

Seventh, Exclude Exclusives, by not wasting time on problems you cannot have, as they can suck enormous energy, focus, and resources because they weren’t sufficiently excluded early on. Never lose sight of what the problem is not and rigorously exclude by logic and experience.

Eighth, Test Truths, as Late Stage Troubleshooting can end in contradictions and conundrums, where something true must not be - to paraphrase Twain, “The problem ain't what you don't know, it's what you know that just ain't so.” Always be willing to challenge your most basic assumptions, facts, and truths for therein often lies something you know that just ain’t so.

Ninth, Seek Solace as this stuff is hard, there is never enough time nor tools, and the pressure is always high. Continually step back, revisit what you know and think you know, looking at how it’s all connected, cause and effect, and the truth will often reveal itself, often in mysterious ways . . .




Joe Byrne

Global Field CTO at LaunchDarkly

7 年

I love the 8th step...taking that approach works much more often that most people would think.

回复

要查看或添加评论,请登录

Steve Mushero的更多文章

  • Theory & Practice of Selling

    Theory & Practice of Selling

    Way back in 1932, a young salesman named David Ogilvy was selling stoves in rural England. He would go on to…

    1 条评论
  • Security Review for SaaS Systems

    Security Review for SaaS Systems

    I am often asked by CEOs and CTOs to do security reviews of their SaaS products, as securing them is critical for their…

    1 条评论
  • My PLC Ladder Logic Standard

    My PLC Ladder Logic Standard

    I love industrial PLCs, industrial automation, and especially Ladder Logic — today everyone one here in Silicon Valley…

    1 条评论
  • DevOps Challenges When Outsourcing Your MVP

    DevOps Challenges When Outsourcing Your MVP

    Numerous early-stage startups use outsourcing to develop their Minimum Viable Product (MVP), quickly building something…

    2 条评论
  • Laying Off People, The Right Way

    Laying Off People, The Right Way

    Laying off people is never easy, especially in challenging times when others are doing it, too. Having done this a few…

    4 条评论
  • Software Development for Early-Stage Startups

    Software Development for Early-Stage Startups

    This is part of my series on DevOps for Early Stage Startups, and is designed for non-technical startup CEOs who don’t…

  • Steve Mushero User Manual

    Steve Mushero User Manual

    Like you, I'm a busy person who values efficiency, getting stuff done, and also personal relationships. So this is a…

    2 条评论
  • Laying Off People

    Laying Off People

    Laying off people is never easy, especially in challenging times when others are doing it, too. Having done this a few…

  • Elasticsearch Indexing Data Flow

    Elasticsearch Indexing Data Flow

    Elasticsearch? is a very powerful and flexible distributed data system, accepting and indexing billions of documents…

  • Is Elasticsearch Hard to Manage?

    Is Elasticsearch Hard to Manage?

    Elasticsearch? is a great system for managing data, be it documents, text, logs, metrics, and more. It can do this at…

社区洞察

其他会员也浏览了