Down, Slow or No Access
Bill Alderson
Researching Zero-Day Prevention Strategies for Robust, Resilient Enterprises from High Stakes Lessons-Learned Experience
Critical Problem Resolution Case study history from Fortune 100 and U.S. Military, a discussion for management and technologist.
Consumers on the phone with their financial institutions, healthcare or customer support. Tension mounts as the representative attempts to find or load specific account information. Small talk begins, "may I put you on hold for a moment, while I..." likely making something up to pass the time as their system restarts, reloads or the representative on the other end of the line struggles to do their best.
It may be the web site of mobile provider, cable or local utility provider that seems fast to load until it actually has to look something up contextually, like account detail, last statement or transactions.
A credit card consumer calls to change an address. It takes twenty minutes. Not because the new address is so complex, or they have to check identity, but rather the screen loads and simple data entry screen takes so long to load and submit the changes. If the loaded cost per hour for a representative is $90 taking 20 minutes to change one address, and 30% of 5M customers move annually the labor cost to change addresses is $45M. Reducing an address change from 20 minutes to 10 minutes would save $22.5M annually. I call that "optimization opportunity" and it is huge. Trouble is, the IT department doesn't pay those salaries, so a $1M cost to improve performance offers little motivation for IT to spend $1M to save the organization $21.5M every year. These ideas need to go to the CEO and Board of Directors.
The system to enter travel expenses - how much do they pay employees an hour to put in for a $45 reimbursement? Companies have no idea how many employees simply eat the expense rather than spend an hour to get reimbursed. Sometimes it seems that legacy slowness is a cost saving measure.
So what's wrong? Is the network, internet, web server or database slow? What about those WAN Optimizers, Load Balancers, Proxy Servers, Sophisticated Virtual Host Firewalls, Single Sign On, Active Directory and other dependencies? What about "Our own man-in-the middle" devices that change highly theoretical things playing with the mother nature of technology? When they work it's like a miracle to double or greater effective bandwidth, lets just say, when they don't it's an incredibly complex problem. And now putting so much in the cloud and emergence of Software Defined Networks (SDN) the complexity, visibility and documentation will be a challenge. We already have a lot of difficulty capturing packets in the cloud and next SDN's. More features, more complexity.
Consider WAN optimization set up in Iraq, Afghanistan and the Continental USA across satellite circuits. To work as advertised, WAN optimized traffic must pass through the same equipment in both directions - that's rare as complex networks have multiple paths breaking WAN, Application and Web optimization negating benefits. Ever wonder why the largest optimization vendor bought an open source analyzer? It takes a lot of analysis to perfect plug and play optimization like Riverbed accomplished and no doubt, WireShark helps them get it right.
What does this slowness really cost an organization? Why are they willing to suffer to such a great degree?
Actually they aren't willing to suffer, "our organization buys only the finest most sophisticated IT infrastructure from the biggest best of breed providers, hire the best and brightest" - but it's still slow at the hidden bottleneck. Bottleneck identification requires deep packet analysis and system documentation to identify slowness. Just like Riverbed needed WireShark, enterprises need to invest in deep packet capabilities and building protocol analysis expertise.
The IT industry grew at such a fast rate the past 30 years - you could hear the giant sucking sound as it pulled in non-technical people to occupy the power corner offices of information management, the budgets were so enormous that they wanted finance people to run the show too. Not always the case, but from what background did the CIO, CISO, VP's and Directors that making key decisions migrate? I've found some brilliant CIO's from non-technical backgrounds, but they need technologist buy-in on technical decisions.
Here's the result. Vendors figured out that convincing non-technical decision makers is easier than convincing a technologist. And the vendors have great relationship people in those positions to influence. What products do they buy and who do they hire? I read with interest that relationship building and other soft skills are 85% of success while technical skills account for only 15% of ones success. That means CIO's have to leverage that technical 15% to solve complex problems.
Entrepreneurs like Steve Jobs and Bill Gates are out of popularity, few big personalities are technologists today. Where does that leave us when things are slow? What solutions are sought? A forklift upgrade sounds easy, but how many of those can be accomplished without overspending? And by the time the fork lift upgrade occurs accountability has been forgotten as the people have changed so the cycle of costly change continues.
What's the bottom line? Technical people are rarely involved in the diagnosis and decisions to solve a problem. Vendors feet are not kept to the fire to make their products work well or have much life. Why? Is root cause a focus of our IT culture? We may talk about it, but it rarely is sought diligently. Accurate root cause diagnosis means pinpoint spending leading to high return on investments.
Occasionally clients ask to find root cause. After getting the indisputable root cause sometimes its not what they wished it to be.
A while back the CEO of a top 100 company was involved in an end of quarter critical problem. Their key systems were severely degraded affecting business productivity. I was on this problem remotely for about two weeks. I analyze their impacted transactions using deep packet analysis. Complex system and security architecture is a challenge even after 30 years reverse engineering network and application environments as often documentation is poor, inaccurate or completely missing. Effective problem resolution requires large scale diagrams of the environment.
In this case the problem was found thanks to the assistance of smart technologists onsite. It was an optical transparent network tap used to copy network streams to multiple security devices. We found a place in their data center where packets were missing between two central devices, and since the device was passive and transparent there were no changes in address or hop count to give us much to analyze. The technologist I was working with was tenacious after seeing the logic, went into the lights out data center and followed the physical cables between the systems finding the tap. Bypassing the tap caused everything to start working fine, you'd think a celebration would be in order. Although happy the ordeal was over, the engagement sponsor was not pleased that the diagnosis implicated the Tap that was his departments area of responsibility.
How do you report the root cause of a two week problem affecting major systems of a multi-billion dollar organization up to the CEO that was your responsibility? A dose of humility and personal integrity was in order as hard as it may be. I delivered my report admittedly with as many reasonable doubts and other anecdotal problems leaving them to the final conclusions.
Here's some thoughts:
- Believe that technology can be made to work. Consider using more brain cells than budget. Buying new stuff results in new different problems. If the team can't fix this technology what makes us think they will be able to fix new systems?
- Know that without accurate packet flow diagrams things take longer, result in poor system architecture and only a few people can participate meaningfully. Maintain and print diagrams for disaster recovery. At the Pentagon after 911, our biggest challenge was working without diagrams because the online documentation system was damaged by the airplane.
- Be kind to technologists, complexity from client-to-server, network, server and all points between require constant learning making it impossible to know everything.
- Collaboration requires real work and strategy, consider using "Six Hats Thinking" to remove some ego from the process while still embracing critical thinking.
- Be respectful of management, they are making decisions about technology and trusting technologists and vendors. Tell them what you don't know and speak up on what you do.
- Be open to any or multiple diagnosis potentials.
In summary, look for optimization opportunities, document systems well, perform deep packet analysis to identify root cause, avoid ego in interdepartmental problem resolution and speak up on what you know.
Bill Alderson [email protected]