System Outages An Outsiders Inside View
An Outsiders Inside View
The more I think about what it means to be prepared in a lights-out situation, the more I think about what we do here on the ARM team and how we protect a business. ARM stands for Availability HA replication, Recover DRaaS, and Migrate. At first, I thought of business continuity and general preparedness of IT to support a business in an outage or disaster. Then I started to think of how outages have affected me in my regular daily life. You know, think about it, when is the last time you went to a store and the card reader was offline, or their computers were down (code for I have no idea what's going on) and could only take cash or even worse could not transact business at all.
My family and I went to a ski area where we live, and the entire system to read the member IDs or lift tickets was not working. It only had some cards it would read while others did not work at all. It looked like their systems to process payment were not working either and on a busy weekend to boot. They wrote down the card numbers or old-school way of processing (the carbon paper method) purchases. The lines were long, customers were frustrated, people were leaving before they could pay, just setting what they were going to buy on a shelf and walking out. Then there was getting onto the mountain as everything is now digital the gates to get you onto the lift required many employees to let pass holders onto the lift, not knowing if that person's pass was valid or not.?
All I could think of was how or what we could have done to prevent this event from happening. These are the things I think about every day as a veteran in the storage, data protection, business continuity, and general IT operations industry for over twenty-two years. There is a lot here to unpack, so let's get started.
First, I looked at this scenario from the outsider's perspective (me in this case) and then what the business has to manage. You know, how the consumer was affected, how the company was affected, the risk, then work your way back through the chain of events to understand the impact of an outage end to end. Now I do not know precisely what happened in this particular case, but it sparked this thought.
In a labor market that has been decimated by the pandemic, having to get more people in to help alleviate customer frustrations due to credit card readers not working or systems offline is a bit of a challenge. This factor is an outlier because this will ultimately change when we have a strong and healthy retail employment market that is not strapped for employees. Something to take back up to the business is factoring in the risk of a pandemic. Now how are the employees affected? They are working longer, more arduous hours to accommodate the offline systems, not to mention how many people just left their items, not where they picked them up from and walked out, meaning someone now has to put this inventory back on the proper shelf. Now I have unhappy customers, frustrated employees, products that may go missing, and lost revenue.
Break down of what I saw.
1.??????Systems down or offline
2.??????Employee problems and added stress
3.??????Customer frustration
4.??????Lost revenue
a.??????Potential for people who did not have passes to get past the gate check
领英推荐
b.??????People who did not purchase a product that walked out
c.??????People who did not buy anything because they saw there were issues
What I thought of after my initial assessment was where did what fail and when. Was it the network, was it a server, was it ransomware, was it a cloud service, or maybe an application problem? It was the new year, and COVID Omicron was in full swing, so I'll give them the benefit of the doubt, but one can not help but think what the primary holdup was. The outage we experienced happened right after the new year and what we were told was the issue would not be addressed till Monday. So now, in my mind, "where is the IT on-call or person responsible for this system"? This question is why sometimes being the outsider who knows how the inside works can be a bit frustrating. Frustrating is how most of the folks I saw that day looked, so I was no different.
My mind went… ALL OVER THE PLACE … It is funny how we can always start the chain reaction off in our minds of plausible scenarios, and where my mind started was with Information Technology (IT). IT is responsible for these systems and, most importantly, the mountain's IT operations and all that comes with that. You know, card readers, ski-lift card readers, computers used for sales, and many other operational things we don't see. So what could have happened?
Based on my observation, it looked like a network issue which was only my first thought as they had some things operational while others were not. For example, they could ring you up on the computer but not run credit cards. The card readers at the lift worked for some folks and not others, so they had people with badges letting just about everyone on the mountain. It looked like they had a decent contingency plan because they were not turning people away just slowed things down and created some situations with longer than usual lines. Parts of their systems were up and online while others seemed to be down, so the network could have plaid a part but not sure that is where the breakdown happened.
Since I was still not satisfied, I had to dig deeper as if I were the IT person responsible for this operation. My mind immediately went to how many network connections were there and how was the failover setup for these networks. I could not help but think of how the applications could have been automated to handle a network outage or other failure within the system. Sometimes there are synchronization issues where the server doing the processing reaches out to the cloud for updates. Maybe part of the issue is decoupled systems could not get the updated user information. It was probably more than just a network outage or server issue, but that day their systems were down, and while they were operational, it did not create the best experience and most certainly impacted revenue.
See, because I work with highly available solutions powered by Double-Take, I frequently think about the always-on operation, you know, 24/7/365, and what it means to be down, offline, or even worse in a disaster.
This thought process took me down the path of DR and BCDR (Disaster Recovery, Business Continuity, and Disaster Recovery). DR means a lot to many people, and Business Continuity is key to the business staying online and operational in the event of an issue, outage, or disaster. If systems were not getting the data they needed to operate correctly, where did the data transfer break down, and were there offline servers as a result of the outage. The bottom line is that having a good BCDR plan, solutions, and people to back that plan is the difference between success and failure. The business is undoubtedly resilient due to how they addressed the immediate problem. Still, in the improvement phase, it should be noted how the impact of the outage affected the bottom line of the business—this way, the proper attention can be given to the overall framework of how to handle outages.?
In closing, the business would need to assess the risks of an outage on a busy holiday weekend and what systems or people failed. With the proper orchestration, tools, and people, outages and downtime can be minimized exponentially. This is precisely what OpenText's ARM solutions team can help with, as well as having replication and DR/HA we also have security, monitoring, and backup options in our business group to help round out the entire IT side of BCDR. For more information, please reach out to [email protected], and together we can discuss how our team can help your organization.
Principal Consultant at OpenText
2 年Knowing how the process works for functional DR and Availability, then seeing what a partial or full failure event looks like from the Business side that did not have a functional plan is always a cringe moment for me.?The cost of a functional plan would have been much cheaper that the results of a business event like you described Matt.
Sr. Business Development Executive, at Amazon Business
2 年How many of you can relate.....