How Long is your Tool …… Chain?
Chris Petersen
Do-er of the Difficult, Wizard of Why Not, and Certified IT Curmudgeon
Wishing everyone a happy Valentine’s Day a little early!? I hope Cupid’s are the only arrows sticking out of your hearts in the coming days!
?
One of my many … issues?? Mantras?? Catchphrases?? Design constraints?? Is that I try to aim for relatively short tool chains, especially when those tools are either mission-critical or key to incident response for mission-critical workloads.
?
Let’s imagine a very simple scenario.? You’ve got a bunch of mission-critical workloads running on-premises in (virtualized) AIX in IPv4 (Internet Protocol version 4) sub-net A, VLAN (Virtual, Local-Area Network) A’, and behind LAN switches A’’.? You’ve got a trivial log catcher running on-premises in (virtualized) Linux in IPv4 sub-net B, VLAN B’, and behind LAN switches B’’.? One can assume (uh-oh!) that there may be core switches between A’’ and B’’ and that they also perform the IP routing function.? There might even be no firewalls between A and B (oh, the horrors!).
?
So, how long is the chain between your log producers and your log consumer?? Three physical LAN switches, two virtual LAN switches, one router, two hypervisors (plus a bit), and one small piece of custom code.? The message producer and network sender are built into the operating system.? The total physical distance between producer and consumer, as the cable runs, is probably under 100 feet (30ish meters) and around 1 millisecond or less in network latency.
?
You still have to code your application or software component to produce the messages with the appropriate facility and severity codes to get them out of the local box.? The syslog (or rsyslog or syslog-ng) service has to be configured to pass them on, and someone has to write their custom code to receive and store those messages.? But, that’s about it.
?
If we’re thinking about old (old old old) school AIX syslog, then it’s using UDP (User Datagram Protocol) as its transport layer, so SSL (Secure Sockets Layer) inspection and all kinds of other man-in-the-middle tools don’t really play a part.? There are also no acknowledgements at any protocol layer, so we’ll never know if our messages made it or not.? Unfortunately!
?
Assuming that both ends are using enterprise-grade storage across a SAN (Storage-Area Network), that injects a few more possible fault domains.? Even in this very simple scenario, we’re looking at a ton of Murphy’s Law potential.
?
Let’s add one more feature to the mix.? The trivial message catcher looks for a few kinds of messages and sends e-mail if it sees them.? (Groan!)? Well, that’s back out across the LAN, into the on-premises, e-mail relay box(es), out through the firewall, across a section of the global Internet, through a service provider (or two or three or four – darn spam filters!), and then into the e-mail provider itself.? Depending on who’s doing what to whom that day, there may be DNS (Domain Name Service), SMTP (Simple Mail Transfer Protocol), BGP (Border Gateway Protocol – wide-area routing), WAN (Wide-Area Network), or any number of other issues that could crop up.? Our friends at Google are getting quite persnickety about whose e-mail they’ll accept lately!
?
Holy smokes!? That got ugly fast!? And, that’s a truly trivial example.
?
What happens if you’re dealing with a cloud-based logging or observability provider?? That may go across another section of the global Internet, in through who-knows-how-many routers, switches, software layers, etc. just to get into their message catcher.? There’s no telling what their database and storage might be, how many servers they pool together, and all the rest.? Trust them!? They’re professionals!? They do this for money!
领英推荐
?
Things change when they want to use their own protocol layer(s), encrypt the traffic, write their own sending or routing code, and so on.? Of course, not every ISV (Independent Software Vendor) fully supports IBM’s AIX.? They never did, but IBM won’t necessarily tell you that.? Maybe you’re sysloggging out to that same kind of (virtualized) Linux box and then going who-knows-where via who-knows-what, maybe getting SSL (Secure Sockets Layer - largely replaced by TLS = Transport Layer Security) inspected, and on and on and on.
?
Does that make the cloud-based observability provider (whether they’re really providing next-generation observability or not) a bad idea?? Notwithstanding the foregoing (Ha!? Even some legalese for you!), I’d say no.? Getting that log (and trace and event and …) data the heck out of your on-premises or cloud environment and into someone else’s is a good thing when it comes to audit time.? Auditors love to hear “we can’t change or delete it, and it’s not (just) stored in our servers.”? If it’s strongly encrypted at rest (on disk), in transit, etc.; so much the better.
?
No, my point is that creating a really long tool chain with a ton of links that could bend or break in all kinds of interesting ways may not be enough for you.? Remember that BGP mention above?? Once it leaves your local network, you may or may not have any idea whose networks it flows through to get to its final destination.? Missouri to Virginia via China?? It can happen.
?
You may lose a ton of graphing, reporting, and friendly querying functionality.? You may keep your own people busy with what feels like make-work at times: pruning logs, managing servers, and all that.? But, you may have priceless data a whole lot closer and easier at hand when things go wrong.? The ultimate in “close” is on the originating server’s disk, but that can be tough to fully secure.
?
What if you’re a victim of the infamous fiber-seeking backhoe?? Your log data somewhere else may be inaccessible.? The log servers may not get new log data, or your logging data could be competing head-to-head with your business transactions for bandwidth over congested, backup links with less bandwidth and longer network latency.? None of those are really good for using that log data to figure out the problem.
?
While logging is a decent example, it’s far from the only one.? If you’ve got distributed components, you may or may not need industrial-grade message brokers running on a bunch of parallel servers with massive complexity behind the scenes (for the consumers – your infrastructure folks may want a word at raise and bonus time).? Such products, open-source or otherwise, can bring a ton of functionality where they’re needed, but they’re not always needed.
?
Simplify.? Shorten tool chains where you can.? In some cases, look for parallel paths and parallel services so Murphy’s Law doesn’t bite as hard.
?
One more war story, I suppose.? Many a winter moon ago, I was under contract to a large, worldwide organization that was re-working its software and networking layers.? A few years earlier, the simpler, easier, shorter tool (and software) chains on ancient hardware had totally gummed up at around 7-10 customer transactions per second.? Their new machines were somewhere in the neighborhood or 25-40 times faster, some had multiple processors, and they were using all new programming techniques.? However, there were so many hand-offs and message-passes in their new software stack, that those shiny new boxes were rumored to top out about … wait for it … wait for it … 7-10 customer transactions per second.
?
It doesn’t matter how macro or micro the scale.? Shorter tool chains, fewer hand-offs and message-passes can be a very good thing for your system designs…