Spotlights and Floodlights

Spotlights and Floodlights

There's an old Internet fable about a plumber charging an obscene amount of money for tapping a pipe with a hammer. When asked to justify the cost, the plumber replies that hitting the pipe was cheap but knowing where to hit was the true reason for the cost. The story actually predates the Internet by several decades but found new life as justification for, among other things, high-priced consultancy fees. I make no judgment on this – deep knowledge is invaluable.

One of the drawbacks of amassing this knowledge, this Lore, is the tendency to apply obscure remediations to pedestrian problems. After all, this is how we learn. When a known solution exists for a given set of symptoms, we recognize those symptoms and use our knowledge to resolve the issue. Problems arise when we focus on the known solution instead?of the set of possible solutions. There's even a name for this - the Einstellung Effect.?

The Einstellung effect is the counterintuitive finding that prior experience or domain-specific knowledge can under some circumstances interfere with problem solving performance.?
(Jessica J. Ellis and Eyal M. Reingold)

Late last year this effect bit me. During an extended period of working from home due to the pandemic, my car had sat unmoved for several months. When I did start it to run a quick errand, it went into a "limp home" mode where it would not leave first gear and the speed stayed under 5 MPH. After watching several YouTube videos I was able to solve the problem by running a sequence of operations including pressing the accelerator at specific intervals, turning the key switch,? and possibly dribbling goat blood on the steering wheel. To my elation, it worked.

Fast forward a few months and the similar symptoms presented themselves. So sure was I that the ritual would solve it that I tried for two hours to replicate my earlier success. No joy. It wasn't until a knowledgeable person ran through proper diagnostics that the culprit – a failed accelerator position sensor – revealed itself.?

I was not immune to making this same mistake in IT. While troubleshooting a particularly irksome issue with a cluster deployment, I found myself looking at DNS, at NTP, at I/O latency on underlying volumes, even kernel tunings to increase timeout thresholds. Ultimately all these were red herrings. And after literal days of troubleshooting the culprit emerged to be a mismatch in a network configuration in this particular environment. How was this mismatch discovered?? By automating a set of tests for everything.

However, there were many troubleshooting related commands that we ran throughout the effort. These included IP address validation, ping latencies, process status, SSL checks, and a host of other CLI issued commands. We ran these as needed. And to our considerable bewilderment, the results from these tests were almost always as expected. All endpoints were accessible, all processes started as needed, all resource utilization metrics were nominal. Yet the issues persisted.

By running our ad hoc commands we were in essence using a spotlight to locate an elusive?squirrel that popped up randomly, chewed through a wire or two, then disappeared. Now, of course, modern computers are not random but chaotic. A combination of variables – VMs appearing on least-busy hosts, a downstream timeout issue caused by a previous error, a pod failure and restart that mostly worked — all contributed to this appearance of randomness.

So we turned on the floodlights.

Instead of ad hoc commands, we looked at everything continuously. For example, should latency from one host to another be useful, we logged it to a file for all hosts. We captured the IPs as they were allocated in case some were somehow being blocked. We captured SSL certificates from each endpoint in case the infrastructure was intercepting and rewriting them. By scripting the troubleshooting suite and running across multiple nodes from multiple endpoints, we gained a holistic view of the environment.?

As the floating IP appeared on machines on a specific switch, they disappeared from an external TCP/443 query, even as they were accessible within the network and even externally via SSH. After correlating the apparently random error with the IPs and underlying physical hosts, the network mismatch revealed itself.

We had defeated the Einstellung Effect by forcing ourselves to look everywhere rather than chasing squirrels.?

Among the lessons learned was that if something was worth checking, it was worth adding to a test harness. We used a combination of RobotFramework and ad hoc scripts for this harness, some initiated from scripts and some kicked off manually. The salient point is that instead of looking at specific metrics we looked at the entire set.??

To be clear, infrastructure tests already existed. We checked everything from storage space to firewall access to availability of required commands.These tests run prior to deployment and throughout troubleshooting.


Gilbert Sambolin

Information Technology Program Manager @ Royal Caribbean Group | IT Infrastructure and Applications

2 年

Good read !!

JJ Pérez GICSP, GCIH

US NAVY Disabled Veteran | SME IACS | OT Cyber Security Resilience

2 年

Great Read! very eloquent and elegant, thank you for sharing

要查看或添加评论,请登录

Kwan Lowe的更多文章

  • The New Shiny

    The New Shiny

    As many others here have done, I looked at the beginnings of ChatGPT early on. There was a blurb on a research paper, a…

    1 条评论
  • Hammers and Screwdrivers

    Hammers and Screwdrivers

    There's an old adage that says, "If the only tool you have is a hammer, every problem begins to look like a nail." In…

    1 条评论
  • OODA Loops Revisited

    OODA Loops Revisited

    A gifted engineer once explained to me the concept of OODA loops. As many of you may know, the OODA loop is a cycle of…

  • Repurposing Old Hardware

    Repurposing Old Hardware

    Repurposing Old Hardware I'm writing this at 3AM on a Saturday morning in April 2020. Because of COVID-19, we are…

    1 条评论
  • Adventures in Golang

    Adventures in Golang

    Kwan Lowe (February 18, 2019) Over the long President's Day weekend, I decided to learn Go. The Go Programming Language…

    7 条评论
  • Ockham's Razor and IT

    Ockham's Razor and IT

    Ever heard of Ockham's Razor? Of course you have. No, it's not a new gadget that will topple the billion dollar…

  • Linux Containers with the Cockpit Utility

    Linux Containers with the Cockpit Utility

    Linux Containers with the Cockpit Utility Just thought I'd share what I've been working on over the weekend. Some…

    1 条评论
  • Basic Linear Optimization with Gnu Octave

    Basic Linear Optimization with Gnu Octave

    https://docs.google.

  • R For SysAdmins: Working with SysStat

    R For SysAdmins: Working with SysStat

    LinkedIn's editor is not brilliant. Please use the Google Docs link below for a better formatted version.

  • Game Theory in TEOTWAWKI

    Game Theory in TEOTWAWKI

    https://docs.google.

    6 条评论

社区洞察

其他会员也浏览了