登录查看更多内容

Shit Breaks - Dao of Troubleshooting

Steve Mushero

Fractional CTO for Startups - Scaling Product, Processes, and People - AI, SaaS, B2C, Infrastructure, DevOps, Security, Operations and more ...

发布日期: 2017年5月14日

Shit breaks. And often. Especially at high-load, high-complexity sites. And in ways not easily ‘solved’ with auto-scaling, more containers, restarting services, nor fancy scheduling systems. While all those are useful and have their place, they are not where the real work happens, by the grown-up girls.

Fixing these things is made harder by shiny new objects like micro-services, server-less, infinitely-divisible, loosely-connected pieces and parts spread out over everywhere.

Dao is the Way . . .

This leads us to the Dao of Troubleshooting complex systems.

First, Model All the Things. Know what is where, how it’s connected, how it’s configured, and hopefully its behavior. Have & view logical and if needed, physical or network diagrams. With layers, and groupings that make sense, at any scale.

Second, Know All the Knowables. This means knowing the status and configurations of everything, and I assure you this is not exactly what is checked into your code, config, .env, and infrastructure-as-code systems, let alone all the dynamic pieces and parts floating around. Like it or not, the source of truth is what’s really running right now.

Third, Rue the Changes. What has changed in the last relevant time period, by who, when, to what, and what was the effect. Who logged into the server, who pushed any code, changed any config, modified the cloud, etc. Then, what behaviors changed, e.g. whose latency changed, whose correlation dynamics changed, did error rates changed, what resource loading or availability changed? And which of these changes mattered?

Fourth, Exploit Expertise. Directly or indirectly apply knowledge and experience of how all the things, their relationships, dependencies, and especially dynamics and failure modes interconnect. Directly apply expertise via real live experts, on-site, on-line, or via Ouija, or indirectly apply via 7x24 via Expert Systems and Rule Engines with encoded expertise.

Fifth, Seek Clarity. Always ponder additional observations to boost the rule engines and expert brains, especially with low risk, quick answer data that ideally can be automated by the rule engines. There is never enough data, and never time to get it all, but bringing balance brings answers.

Sixth, Explore Effects by making changes or adjustments to the system to observe how they effect things. Especially useful to increase your exclusion list or uncover previously unknown relationships and stuff that never worked anyway.

Seventh, Exclude Exclusives, by not wasting time on problems you cannot have, as they can suck enormous energy, focus, and resources because they weren’t sufficiently excluded early on. Never lose sight of what the problem is not and rigorously exclude by logic and experience.

Eighth, Test Truths, as Late Stage Troubleshooting can end in contradictions and conundrums, where something true must not be - to paraphrase Twain, “The problem ain't what you don't know, it's what you know that just ain't so.” Always be willing to challenge your most basic assumptions, facts, and truths for therein often lies something you know that just ain’t so.

Ninth, Seek Solace as this stuff is hard, there is never enough time nor tools, and the pressure is always high. Continually step back, revisit what you know and think you know, looking at how it’s all connected, cause and effect, and the truth will often reveal itself, often in mysterious ways . . .

Joe Byrne

Global Field CTO at LaunchDarkly

7 年

I love the 8th step...taking that approach works much more often that most people would think.

要查看或添加评论，请登录

Steve Mushero的更多文章

Theory & Practice of Selling

2023年3月15日

Theory & Practice of Selling

Way back in 1932, a young salesman named David Ogilvy was selling stoves in rural England. He would go on to…

1 条评论
Security Review for SaaS Systems

2023年3月13日

Security Review for SaaS Systems

I am often asked by CEOs and CTOs to do security reviews of their SaaS products, as securing them is critical for their…

1 条评论
My PLC Ladder Logic Standard

2023年3月8日

My PLC Ladder Logic Standard

I love industrial PLCs, industrial automation, and especially Ladder Logic — today everyone one here in Silicon Valley…

1 条评论
DevOps Challenges When Outsourcing Your MVP

2023年3月7日

DevOps Challenges When Outsourcing Your MVP

Numerous early-stage startups use outsourcing to develop their Minimum Viable Product (MVP), quickly building something…

2 条评论
Laying Off People, The Right Way

2023年1月22日

Laying Off People, The Right Way

Laying off people is never easy, especially in challenging times when others are doing it, too. Having done this a few…

4 条评论
Software Development for Early-Stage Startups

2023年1月17日

Software Development for Early-Stage Startups

This is part of my series on DevOps for Early Stage Startups, and is designed for non-technical startup CEOs who don’t…
Steve Mushero User Manual

2022年12月14日

Steve Mushero User Manual

Like you, I'm a busy person who values efficiency, getting stuff done, and also personal relationships. So this is a…

2 条评论
Laying Off People

2021年1月12日

Laying Off People

Laying off people is never easy, especially in challenging times when others are doing it, too. Having done this a few…
Elasticsearch Indexing Data Flow

2020年9月25日

Elasticsearch Indexing Data Flow

Elasticsearch? is a very powerful and flexible distributed data system, accepting and indexing billions of documents…
Is Elasticsearch Hard to Manage?

2020年9月18日

Is Elasticsearch Hard to Manage?

Elasticsearch? is a great system for managing data, be it documents, text, logs, metrics, and more. It can do this at…

See all articles

Shit Breaks - Dao of Troubleshooting

Steve Mushero

Fractional CTO for Startups - Scaling Product, Processes, and People - AI, SaaS, B2C, Infrastructure, DevOps, Security, Operations and more ...

Steve Mushero的更多文章

社区洞察

其他会员也浏览了

HTTP Status code

How to Fix 429 Too Many Requests Error Code?

?? Kernel Panic: A Deep Dive into a Critical System State ??

What Are Some Ways To Handle Failed Network Requests In The Backend ?

Make your performance tests more relevant with Think Time

The Myths behind updating your software

Layer 4 V.S Layer 7 load balancing

Dev: Automating network using postman

HYSTERIX

How to set up a Channel in GO

Steve Mushero的更多文章

Theory & Practice of Selling

Security Review for SaaS Systems

My PLC Ladder Logic Standard

DevOps Challenges When Outsourcing Your MVP

Laying Off People, The Right Way

Software Development for Early-Stage Startups

Steve Mushero User Manual

Laying Off People

Elasticsearch Indexing Data Flow

Is Elasticsearch Hard to Manage?

社区洞察

其他会员也浏览了

HTTP Status code

How to Fix 429 Too Many Requests Error Code?

?? Kernel Panic: A Deep Dive into a Critical System State ??

What Are Some Ways To Handle Failed Network Requests In The Backend ?

Make your performance tests more relevant with Think Time

The Myths behind updating your software

Layer 4 V.S Layer 7 load balancing

Dev: Automating network using postman

HYSTERIX

How to set up a Channel in GO