登录查看更多内容

Spotlights and Floodlights

Kwan Lowe

发布日期: 2022年2月6日

There's an old Internet fable about a plumber charging an obscene amount of money for tapping a pipe with a hammer. When asked to justify the cost, the plumber replies that hitting the pipe was cheap but knowing where to hit was the true reason for the cost. The story actually predates the Internet by several decades but found new life as justification for, among other things, high-priced consultancy fees. I make no judgment on this – deep knowledge is invaluable.

One of the drawbacks of amassing this knowledge, this Lore, is the tendency to apply obscure remediations to pedestrian problems. After all, this is how we learn. When a known solution exists for a given set of symptoms, we recognize those symptoms and use our knowledge to resolve the issue. Problems arise when we focus on the known solution instead?of the set of possible solutions. There's even a name for this - the Einstellung Effect.?

The Einstellung effect is the counterintuitive finding that prior experience or domain-specific knowledge can under some circumstances interfere with problem solving performance.?

(Jessica J. Ellis and Eyal M. Reingold)

Late last year this effect bit me. During an extended period of working from home due to the pandemic, my car had sat unmoved for several months. When I did start it to run a quick errand, it went into a "limp home" mode where it would not leave first gear and the speed stayed under 5 MPH. After watching several YouTube videos I was able to solve the problem by running a sequence of operations including pressing the accelerator at specific intervals, turning the key switch,? and possibly dribbling goat blood on the steering wheel. To my elation, it worked.

Fast forward a few months and the similar symptoms presented themselves. So sure was I that the ritual would solve it that I tried for two hours to replicate my earlier success. No joy. It wasn't until a knowledgeable person ran through proper diagnostics that the culprit – a failed accelerator position sensor – revealed itself.?

I was not immune to making this same mistake in IT. While troubleshooting a particularly irksome issue with a cluster deployment, I found myself looking at DNS, at NTP, at I/O latency on underlying volumes, even kernel tunings to increase timeout thresholds. Ultimately all these were red herrings. And after literal days of troubleshooting the culprit emerged to be a mismatch in a network configuration in this particular environment. How was this mismatch discovered?? By automating a set of tests for everything.

However, there were many troubleshooting related commands that we ran throughout the effort. These included IP address validation, ping latencies, process status, SSL checks, and a host of other CLI issued commands. We ran these as needed. And to our considerable bewilderment, the results from these tests were almost always as expected. All endpoints were accessible, all processes started as needed, all resource utilization metrics were nominal. Yet the issues persisted.

领英推荐

Do Better, Spectrum.

Phil Wilson 1 年前

Callmama: Unveiling the Mystery: Where is Area Code…

CallMama 1 年前

SUSWASA AND THE TELEPHONE -1

Subramani Sarode 3 年前

By running our ad hoc commands we were in essence using a spotlight to locate an elusive?squirrel that popped up randomly, chewed through a wire or two, then disappeared. Now, of course, modern computers are not random but chaotic. A combination of variables – VMs appearing on least-busy hosts, a downstream timeout issue caused by a previous error, a pod failure and restart that mostly worked — all contributed to this appearance of randomness.

So we turned on the floodlights.

Instead of ad hoc commands, we looked at everything continuously. For example, should latency from one host to another be useful, we logged it to a file for all hosts. We captured the IPs as they were allocated in case some were somehow being blocked. We captured SSL certificates from each endpoint in case the infrastructure was intercepting and rewriting them. By scripting the troubleshooting suite and running across multiple nodes from multiple endpoints, we gained a holistic view of the environment.?

As the floating IP appeared on machines on a specific switch, they disappeared from an external TCP/443 query, even as they were accessible within the network and even externally via SSH. After correlating the apparently random error with the IPs and underlying physical hosts, the network mismatch revealed itself.

We had defeated the Einstellung Effect by forcing ourselves to look everywhere rather than chasing squirrels.?

Among the lessons learned was that if something was worth checking, it was worth adding to a test harness. We used a combination of RobotFramework and ad hoc scripts for this harness, some initiated from scripts and some kicked off manually. The salient point is that instead of looking at specific metrics we looked at the entire set.??

To be clear, infrastructure tests already existed. We checked everything from storage space to firewall access to availability of required commands.These tests run prior to deployment and throughout troubleshooting.

Gilbert Sambolin

Information Technology Program Manager @ Royal Caribbean Group | IT Infrastructure and Applications

3 年

Good read !!

1 次回应

JJ Pérez GICSP, GCIH

US NAVY Disabled Veteran | SME IACS | OT Cyber Security Resilience

3 年

Great Read! very eloquent and elegant, thank you for sharing

1 次回应

查看更多评论

要查看或添加评论，请登录

Kwan Lowe的更多文章

The New Shiny

2023年6月26日

The New Shiny

As many others here have done, I looked at the beginnings of ChatGPT early on. There was a blurb on a research paper, a…

1 条评论
Hammers and Screwdrivers

2022年8月19日

Hammers and Screwdrivers

There's an old adage that says, "If the only tool you have is a hammer, every problem begins to look like a nail." In…

1 条评论
OODA Loops Revisited

2020年12月19日

OODA Loops Revisited

A gifted engineer once explained to me the concept of OODA loops. As many of you may know, the OODA loop is a cycle of…
Repurposing Old Hardware

2020年4月21日

Repurposing Old Hardware

Repurposing Old Hardware I'm writing this at 3AM on a Saturday morning in April 2020. Because of COVID-19, we are…

1 条评论
Adventures in Golang

2019年2月18日

Adventures in Golang

Kwan Lowe (February 18, 2019) Over the long President's Day weekend, I decided to learn Go. The Go Programming Language…

7 条评论
Ockham's Razor and IT

2017年3月31日

Ockham's Razor and IT

Ever heard of Ockham's Razor? Of course you have. No, it's not a new gadget that will topple the billion dollar…
Linux Containers with the Cockpit Utility

2016年3月13日

Linux Containers with the Cockpit Utility

Linux Containers with the Cockpit Utility Just thought I'd share what I've been working on over the weekend. Some…

1 条评论
Basic Linear Optimization with Gnu Octave

2015年10月4日

Basic Linear Optimization with Gnu Octave

https://docs.google.
R For SysAdmins: Working with SysStat

2015年9月8日

R For SysAdmins: Working with SysStat

LinkedIn's editor is not brilliant. Please use the Google Docs link below for a better formatted version.
Game Theory in TEOTWAWKI

2015年8月23日

Game Theory in TEOTWAWKI

https://docs.google.

6 条评论

See all articles

Spotlights and Floodlights

Kwan Lowe

领英推荐

Kwan Lowe的更多文章

社区洞察

其他会员也浏览了

Bravo Primus & Congratulations On A Job Well Done!

BREAKER 911: 50-YEAR OLD TECHNOLOGY SAVES LIVES

Beyond the Reset: Uncovering the True Cause of System Failures

What does NG911 actually look like in action?

You Can't Slam A Digital Phone

Value - Day 19 0f 365 STOPPING SCOPE CREEP

911: Amazed by the Past, Excited for the Future

Siyata Mobile Exciting Times-Sales- Clientele Ramp Up-US Navy-1st Responders & More

Fancy a freebie?

领英推荐

Kwan Lowe的更多文章

The New Shiny

Hammers and Screwdrivers

OODA Loops Revisited

Repurposing Old Hardware

Adventures in Golang

Ockham's Razor and IT

Linux Containers with the Cockpit Utility

Basic Linear Optimization with Gnu Octave

R For SysAdmins: Working with SysStat

Game Theory in TEOTWAWKI

社区洞察

其他会员也浏览了

Bravo Primus & Congratulations On A Job Well Done!

BREAKER 911: 50-YEAR OLD TECHNOLOGY SAVES LIVES

Beyond the Reset: Uncovering the True Cause of System Failures

What does NG911 actually look like in action?

You Can't Slam A Digital Phone

Value - Day 19 0f 365 STOPPING SCOPE CREEP

911: Amazed by the Past, Excited for the Future

Siyata Mobile Exciting Times-Sales- Clientele Ramp Up-US Navy-1st Responders & More

Fancy a freebie?