IMPORTANCE OF RESILIENT ARCHITECTURE & MASSIVE MULTI PLAYER TESTING

IMPORTANCE OF RESILIENT ARCHITECTURE & MASSIVE MULTI PLAYER TESTING

You have 24 hours to go for your flight; you have done all your client meetings and are packed and ready. You go to sleep knowing that you will wake up, have a good breakfast and head out to the airport for your return home. It’s a VUCA world – One wakes up to a massive outage on Microsoft devices because of a faulty update by CrowdStrike. Flights worldwide (or at least major part of the western world) were disrupted; systems couldn’t cope with the disruption in one part of the overall ecosystem and everything kind of ground to a halt.

Resilient Engineering and Resilient Architecture are the key to building Multi tower Applications incorporating features which are designed with Failure in mind.

?

Resilient Architecture is building a Modern Business Application which has loosely coupled architecture (by design) built on MicroServices architecture with High Availability & Failover design embedded at core. A lot of the systems which are currently in operation have been built in the 70’s and 80s and constantly upgraded (I am being sarcastic; upgrade word is a stretch here..) to be compatible to the latest Operating Systems & database versions. The underlying architecture is still monolithic in the sense each part is intricately dependent on another piece of software within the overall landscape with no safety outlet or alternate strategy.

The focus also has been on testing to make sure these systems never go down; the focus has not been on ‘Imagine a part of your application landscape is down – Now how do you provide Business Continuity’

Resilient architecture addresses that question. There is a need for ensuring that your Business Application incorporates Failover (not just at database level? to synchronize your data and rollback to last commit) – but true failover capabilities to switch to a completely different mode of operation.


Let me illustrate this with an equation. Imagine your passenger system comprises of seven blocks (A to G). Now visualize that your systems are all on a HyperScaler’s cloud services and all of them have best in class HA & DR enabled. So far your strategy seems very sound. You as CIO have also taken the pain to ensure that you regularly test the High Availability and Disaster Recovery models. Your passenger onboarding is a result of A+B+C+D+E+F+G

Upon completion of all 7 steps or when all 7 towers work; everything is fine. Resilient Architecture mandates that you be ready for a scenario wherein (A+B+C+D+E+F+G) the applications highlighted go down and your enterprise has to switch to a mode of working which results in a completely new operating model (albeit temporarily). True DR means you should be able to operate the services which are up while being able to switch to a failover mode of input which could be manual, which could be offline (updated in batch on frequent basis), which could be revert to a different DB node on a different hyperscaler or in house data center – the scenarios could run into hundreds and solutions could run into thousands.

Microservices architecture at the Core means you will design or redesign the entire system keeping loose coupling and Microservices architecture in mind and maybe even design a very robust API economy to deal with scenarios like the ones mentioned above. Innovative design also compels us to design escape routes as interlinked Applications shut down while major applications remain on.

Massive Multiplayer testing - ?Testing remains the most ignored part of the enterprise as repeatedly demonstrated whenever an outage happens. Billions were lost to this outage and I can only imagine the quantum of losses insurance companies and airlines will have to face as passengers need to be accommodated on alternative routes, hotels booked and paid for, baggage lost and paid for or found and re routed to destination, not to mention millions of productive hours of various business executives impacted by this. Leisure travel becomes a pain and leaves you in a state of shock as you start your holiday with Crowdstrike outage. Why Multi Player testing – In today’s interconnected world you don’t control the entire landscape. An enterprise is increasingly dependant on its partners and suppliers to provide services from Limos for its business class executives (which needs integration to its passenger systems), Catering companies which will need to know types of food preparation by flight number, by time by even seat number (again integration), OTAs (for booking, rescheduling & Cancellation of tickets) with access to your core booking engine, Baggage Handling services, Airport systems – I can go on and on. Multi Player testing brings and simulates different parts of the transaction being performed in real time even as you orchestrate disruptive forces which bring down pieces of your A-G linear model. It will help enterprises incorporate Agile Testing and more importantly Agile Recovery and learn from these War games model of testing to build better systems.

?

With the advent of GenAI and AI in general taking up more and more components of service delivery from Design to Deliver; its even more critical to ensure that as Architects we demonstrate Resilience from the very beginning and incorporate Gaming into the Testing phase to ensure business continuity. The cost of this is a fraction of what a single outage could do to the enterprise as demonstrated by this recent event.?

Credits - Paul Downey for the diagram

Vivek Sharma

Go-To-Market @ Sogeti part of Capgemini | Sales and Marketing

2 个月

This is a very insightful post on the importance of resilience and testing in the era of AI and GenAI. Architects need to anticipate and mitigate the risks of potential outages and disruptions in our service delivery, especially when we rely on multiple external partners and suppliers. The idea of massive multiplayer testing sounds very appealing, as it simulates real-world scenarios and helps us identify and resolve any vulnerabilities or bottlenecks in our systems. I also like the suggestion of using microservices architecture and API economy to design more robust and flexible systems that can handle complex and dynamic interactions. Thank you for sharing your thoughts and experience!

Arif Mujawar

Digital Transformation Business Analyst | Sogeti - Capgemini

2 个月

Insightful post Balaji.. The initial anecdote about CrowdStrike outage sets the stage right on for the importance of resilient architecture. I'm curious to know about integration of Gen-AI for resilience planning.. coz AI is capable to do both introduce vulnerabilities and also mitigate them through predictive analytics and automated recovery. Arun Sahu, Balaji Rajagopalan

Aruna Nagarajan

Sr Manager Enterprise Data & Analytics at Rogers Communications

2 个月

Well articulated

Partho Ganguly

Program Manager | Cloud Modernization Lead

2 个月

Being resilient is critical today's dynamic digital landscape. It safeguard system against disruption and what a great example ??. You have summed it up with a great example of microservice with HA and DR. At times, having a multi cloud strategy could also be a good flavor!!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了