IT Outage Musings

IT Outage Musings

The lessons from the IT outage

?

My favorite Gary Marcus raised a pertinent point: if a simple update of non-AI regular software can bring parts of the world to its knees, what will happen to super AI-powered systems going wrong?

Forget about mistakes like what happened to the Cloudstrike security software.? Last weekend there was a scheduled downtime of HDFC Bank IT systems and Axis Bank migration of CitiBank credit card customers.? Mostly the whole of Saturday.? I have HDFC Debit and credit cards, and as a back up have CitiBank credit card.? I took my 5-year-old granddaughter to a Children’s arcade and realized that I could not pay either through debit or credit card or through Gpay as even Gpay was linked to my HDFC Bank account.? I scraped my wallet and got some cash but short of the arcade charges.? Looking at the disappointment in the child’s face the attendance allowed me to enter with part payment with a promise to pay the rest as soon as the Bank systems are up.

We are completely dependent on technology. ?I experienced an internet shutdown while on holiday in Udaipur.? There was a state government competitive exam for jobs that day and as a precaution to prevent candidates from cheating in the exam, the government decided to shut data services down!.? It forced all of the tourists to stay indoors as we could not make any payments to any vendors outside with our over-dependence on data mobile technology.

Lots of talk about Microsoft and how a state government in India migrated to Linux and open source apps 10 years back was smart etc..?? This can happen to AWS and Google tomorrow.? This is not a Microsoft issue.?

The whole world is more or less dependent on the top 3 public cloud companies to provide computing and storage.? The cloud companies have set up data centers in multiple locations more for data security purposes but this also acts as a contingency backup.? They have millions of servers and no one cares if a few servers are down as they keep adding servers to maintain the capacity.? It is like our human body.? Cells keep dying and cells keep growing every second and we stay alive!

The risk is in software.? We have virtualized most of the hardware functions into software.? I vividly remember a network hardware expert worrying about network virtualization and how the 99.99999% resilience cannot be guaranteed. ??

On top of it, to protect ourselves from hackers, we also have deployed security agents who are the first point of contact for anyone trying to reach the enterprise services.? It is like building a massive palace and keeping a small door and if the door is jammed , no one can enter the palace!.

For many of us with experience in production support, the first item in the checklist is to ask whether any software patches or updates have happened recently.? In most cases, we ask for rollback for the changes and see if the system is up and running. This is the standard procedure for incident management and investigate deeper on root cause for a fix later. ?

The security companies hire only ethical hackers, super coders who loath checklists, processes and risk management.? It is very clear that some one very smart in the security company, thought this update is too small, too inconsequential requiring any form of risk management.? For example, work with Microsoft and put the update in one of the nodes and see if all is ok for 10-15 minutes and deploy it other nodes one by one.? Then the issue would have been found out in the first node itself and the impact would have been reduced.? Also role of QC and production release processes for the company are in question. May be for minor changes, this company relies on the judgment of the super developer and bypass all the QC processes for major updates.

Why blame software?? Blame humans.?? Blame the person who assessed the risk as zero and did the update, which affected so many enterprises worldwide and caused huge business and productivity losses.

Now getting back to Gary’s worry, maybe AI-based systems will not make these judgement errors and diligently follow QC procedures for even minor upgrades.

More alter,

L Ravichandran

?

Hasan Suhail Siddiqui

Strategic Advisor - Career, Education, Social Welfare, Sustainability. Certified Professional Coach. Certified for DEI at Workplace.

4 个月

This must be the reaction of all '80s techies. How can an update move into production environment without going through a QC test environment? Why did rollback not happen? In spite of so much advancement and AI-led vigilance, the basics failed. Maybe, we grey haired folks should be invited to do QA on these "cloudy" platforms !!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了