Reactive Architecture & COVID-19 response : a novel essay on the mutual influences

Reactive Architecture & COVID-19 response : a novel essay on the mutual influences

Or rather, I should title this blog-essay as, - "How are we (the human civilization) fighting nCOVID-19 by applying reactive patterns."

No alt text provided for this image

As I start, let us accept a fact,

We do not live in a perfect world !!!
It is based on fallacies.
Deny them and Fail !!!
Or
Accept and Address them !!!
and you may succeed !!!

Amid this COVID19 pandemic crisis, our distributed-yet-interconnected world is struggling to stay 'responsive'. With the Observer pattern applied, in this interconnected world, we get subjected to several "stream sources of events". Each 'world-event', emitting over these streams, represents some old-new pattern of problem resolution that is getting re-applied in this evolving new-world. I believe that,

Every crisis, every failure, teaches us something new !!! and has potential to evolve us into better human beings.

I am an observant - a dependent sink - for these world-events. As I continue on my quest for building better, #resilient, #reactive systems - in midst of this crisis, I have gone back to my learnings from last year and have drawn new parallels. Today, I want to exhibit a few drawn similarities between designing reactive systems (on one hand) and the responses of human societies to Corona outbreak (on the other).

No alt text provided for this image

Since the crisis largely deals with how to stay resilient, I will contain my focus on the resiliency aspect of the reactive manifesto.

No alt text provided for this image

Let me start with some self-directed bold under-scorings applied over a borrowed excerpt from the manifesto,

"...We want systems that are Responsive, Resilient, Elastic and Message Driven. We call these Reactive Systems.

Systems built as Reactive Systems are more flexible, loosely-coupled and scalable. This makes them easier to develop and amenable to change. They are significantly more tolerant of failure and when failure does occur they meet it with elegance rather than disaster. Reactive Systems are highly responsive, giving users effective interactive feedback..."

Doesn't the above introduction sound more relevant, now, in the current worldly context !!!
No alt text provided for this image

Optimistically speaking, in this real-world, while dealing with common issue of the virus-spread, all acting Governments, NGOs, societies and organizations are working harder (or pretending to) in their own realms of disparate geographies and influences. Each one of them is independently discovering and applying patterns - like, preventing the cascading outbreak, social distancing, isolation, quarantining, latency controls, patient supervision, proactive monitoring, faster triaging, timely signaling, rapid diagnosis, failure tracing and, externalized system recovery (with ventilators and symptomatic treatments, for now!!!) etc. - for building resiliency in this coming-to-stand-still, ever-more tightly-coupled human society.

No alt text provided for this image
Okay !!! Okay !!! we are more tightly-coupled ...only in terms of travel, trade, commerce and physicality, and definitely not in terms of emotions.
No alt text provided for this image

Irrespective !!! one can always argue that many of these applied patterns, look the same and have been already recognized individually a several times before (perhaps centuries ago - read during all the past occurrences of #plague, #spanishflu, #smallpox pandemics.)

No alt text provided for this image

Regardless of which side of argument you are on, the immediate, shared goal for all of us is,

to have our family units, neighborhoods, cities, societies, countries and world-overall stay robust, resilient, yet flexible and better positioned to deal with this crisis.

After all, we are all in this together !!! Don't you think, that the above mentioned goal is equally relevant while building and operationalizing reactive systems, those operating on a distributed, interconnected infrastructure (read cloud, read internet, read network, read cluster.)

Let me start addressing the drawn parallels in a Q&A form. Starting with the primary question,

#1. What does 'building a Resilient system' means ?

No alt text provided for this image

Simply put, it means that the system is designed to stay responsive in face of failure. This applies not only to highly-available, mission-critical systems —


any system that is not resilient will be unresponsive after a failure.

In other words, Resilience is also the ability of system to handle unexpected situations,

  • Without user noticing it (Best case)
  • With a graceful degradation of service (worst case)

What it is worth for, the million dollar question is,

Are we really resilient as a society ?

Well! It all depends on how responsive we stay as a whole. Partial breakdowns are okay, graceful degradation of human activities is acceptable, but, are we trending towards a complete shutdown - a full outage. While the story is yet developing, I will let my readers answer this question from their own vantage points spread across space and time.

For now, let me take a step-back and delve upon,

#2. What exactly a 'failure' is ?

First and foremost, a Failure is not an error.

Rather,

No alt text provided for this image

A failure is an unexpected event within a service or component that prevents it from continuing to function normally. On top of that a failure will generally prevent responses to the current, and possibly all following, client requests. This is in contrast with an error, which is an expected and coded-for condition—for example an error discovered during input validation, that will be communicated to the client as part of the normal processing of the message (the mighty 400 - HTTP status code.)

So !!! you can contemplate (in your own self-imposed isolation) on whether the nCOVID19-outbreak is an error OR a failure.


No alt text provided for this image

Moreover, failures are unexpected and will require intervention before the system can resume at the same level of operation. This does not mean that failures are always fatal, rather that some capacity of the system will be reduced following a failure. On the other hand, Errors are an expected part of normal operations. They are dealt with immediately and the system will continue to operate at the same capacity following an error.

And Yes !!! I think, nCOVID-affect is in fact a failure, which we could not have predicted for and coded for.

Although, we can continue debating in favor and against, with the advantage of the hind side I am very firm on my take.

Why I say so ? because,

No alt text provided for this image
  1. Firstly, this virus outbreak is a #blackswan event, which no probabilistic model could have predicted in advance.
  2. Secondly, for sure the outbreak is preventing us from functioning normally. For that matter, the outbreak has reduced our individual, institutional and economic productivity as well as capacity.
  3. Finally, the only way we can now recover out (fast) and resume same level of operation is by getting an external master entity intervene on our behalf - by developing a recovery-oriented treatment and by providing us with a vaccine. On top of that society will also need this supervising delegated-master to exercise the vaccination to each and every failing and non-failing entity, with unprecedented assertion and reach.

Till we find one such master (or one reveals oneself out), all you, hunker-down !!! and stay isolated, while yet staying connected.

By the way, in context of distributed system design, some examples of failures are hardware malfunction, processes terminating due to fatal resource exhaustion, program defects that result in corrupted internal state, a running docker container failing, a kubernetes pod geting into a "crashbackoffloop"state, a kubernetes cluster-node crashing down, an akka-cluster which is supporting a single service faces network partitioning, an cloud availability zone faces partial or full power outage, a cloud region shuts down etc.

#3. How is 'Resilience' achieved ?

Resilience is achieved by focusing on three things,

  1. Isolation - of components from each other, and Containment - of failure in each component
  2. Replication - ensuring high-availability and improving chances of a request getting served.
  3. Delegation - of supervisory functions like - monitoring, failure detection, failure escalation and recovering a component from failure - to another external component

Let me elaborate, starting with the first two,

#4. What is 'Isolation' (aka social distancing) ?

No alt text provided for this image

Isolation can be defined in terms of decoupling, both in time (sender and receiver do not have to be present at the same time for communication) and space (sender and receiver do not have to run in the same process, even over an application's lifetime.)


Why Isolation and Containment ?

No alt text provided for this image
No alt text provided for this image
Ask yourself, why Social distancing and quarantining ... and you will get your answer.

Of its many benefits, true Isolation (aka social distancing and quarantining) also gives us compartmentalization and containment of failures, allowing failures to be captured, signaled and managed at a fine-grained level instead of letting them cascade to other components. In short, it,

  1. ensures system should not fail as a whole ;
  2. avoids cascading of failures

One positive effect of Isolation, worth noting - since it enables loose coupling - could also lead to systems that are easier to understand, extend, test and evolve.

How is Isolation achieved ?

Strong isolation between components is built on communication over well-defined (asynchronous) protocols and enables loose coupling. It is generally achieved with applying following four patterns,


No alt text provided for this image
  • Use of Bulkheads (aka quarantine chambers),
Use semaphores, processes and thread pools. Deploy in different VMs, instances, containers or processes. Isolate thru distinct sets of queues. Assign each client a separate service instance.


No alt text provided for this image
  • Ensuring Complete Parameter checking (wear masks, cover your mouth while sneezing and coughing) : Follow postal's law,
Be conservative in what you send, be liberal in what you accept.


No alt text provided for this image
  • Shedding load (with social distancing, reducing traffic and mass transit movements, limit in-flights )
Avoid getting overloaded with too many requests. Install gatekeeper with rate-limiting policies applied or have bounded queue to maintain back-pressure.


No alt text provided for this image
  • Latency controls (taking vitals proactively, identifying outliers, failing fast and implement curfews to prevent a cascading effect.)
have circuit breaking, timeouts to detect and handle non-timely responses. Avoid cascading temporal failures and thus assuring QoS.


#5. What is 'Replication'?

Executing a component or service simultaneously in different places is referred to as replication.

No alt text provided for this image

As we face shortages for all kind of essential objects - like ventilators, masks, and cleaning wipes - a lesson learned here,

Why just in China ?

Have your essential services and manufacturing supported in-house in USA, in Costa Rica, in India, in Africa - in multiple availability zones. Place orders to multiple units simultaneously and maintain a supply chain from all the zones. above all, maintain a cache to respond to a sudden surge in demand for essential products.

No alt text provided for this image


Ironically, the virus also replicates (that bloody RNA) - to scale, sustain and grow !



Why replication for resilience ?

Apart from offering scalability, replication also offers resilience, where the incoming workload is replicated to multiple instances which process the same requests in parallel. It improves the chances of a request getting served, in spite of one or more replicated workload instances failing, thus hiding the failures from the view of end-user.

When you think of resilience ...
Do think of redundancy !!!
No alt text provided for this image

The replication approaches towards ensuring scalability and ensuring resiliency can be mixed, for example, by ensuring that all transactions pertaining to a certain user of the component will be executed by two instances while the total number of instances varies with the incoming load.

No alt text provided for this image

How is replication achieved ?

  1. It can mean executing same workload on different threads or thread pools, processes, containers, kubernetes pods, network nodes, or availability zones, cloud regions, and computing data centers.
  2. Building some active-active redundancy (at multiple levels of the execution and deployment model) in your system can help in achieving resilience.
  3. Remember, I am not suggesting a load-balancing pattern here, what I am rather suggesting is same request going to multiple (more than 1) actors. Thus, from an instrumentation perspective, you could bring in Fan-out and quickest response patterns of messaging for latency control. You could fan out a single request to multiple actors or workers or services or pod instances - Wait for the quickest response - Use the quickest response and discard all other responses. Following this pattern lowers the probability of having high-latency. But, you would have already noticed here a trade-off between waste of resources versus latency. Something we can live with in this cost-optimized world.

#6. What is Delegation?

Delegating a task asynchronously to another component means that the execution of the task will take place in the context of that other component. This delegated context could entail running in a different error handling context, on a different thread, in a different process, a different pod, or on a different network node, or availability zone or region , to name a few possibilities.

Delegation also comes into play from the perspective of performing supervisory duties. Be it continuous monitoring, failure detection, failure escalation or recovering from a failure.

No alt text provided for this image
When you think of the medical staff, or government bodies, NGOs making conscious attempts to prevent the spread, or research staff working to get a cure - think of them as a "delegated authority", trying to assure resiliency and continuity by constantly monitoring for issues, handling the outbreak situations with curfews, quarantine zones and escalating matters, as appropriate, with use of Ventilators, and working in the background to find a cure for the recurring failure .
You would have noticed some governments calling in reserves to perform essential functions ranging from traffic management to food distribution. All the above are a reflection upon how delegation works in real life, amid a failure scenario.
No alt text provided for this image

Why Delegation for Resilience ?

The purpose of delegation is to hand over the processing responsibility of a task to another component so that the delegating component can perform other processing or optionally observe the progress of the delegated task in case additional action is required such as handling failure or reporting progress.

Delegation of some failure recovering responsibilities to another player or a set of overarching external players (say from the control-plane), plays a pivotal role in assuring resiliency by adding time and information to handle failures. (Remember those arguments made in support of flattening of curve.)

No alt text provided for this image

Delegation allows the failing unit to get time to either recover on its own OR get time to be recovered by another supervisor and while the core responsibilities are yet being addressed in parallel.

In case of no-failure this delegating authority, in parallel, could get resources, secure information and perform other supervisory functions, while the core responsibility is being worked upon by the delegated unit.

How Delegation works ?

In the case of an issue in a component, without much deliberation, the control is transferred to another similar component that is running in a completely different context. In case of multiple failures, the control can get transferred to a supervisory actor (a member of the control-plane), who can reassign the task to another entity and make attempts to recover the component facing the issue.

Delegation essentially involves,

No alt text provided for this image
  • Monitoring : It involves observing unit (service) behavior and interactions from the outside. Thus, creating option for automatically responding to detected failures. It enables providing failure handling beyond the means of the single failure unit.
No alt text provided for this image

Failure Handlers : Having a supply of first responders who have single responsibility of handling issues, who are trained, go fight against the failures can help in taking timely next actions for handling failures. As patients are identified triage happens and specialists are called in.

No alt text provided for this image

Escalation : The ability of a system to distribute a work item to a resource or group of resources other than those it has previously been distributed to in an attempt to expedite the completion of the work item is a must to assure resiliency. Having multi-level hierarchies of escalation peers adds more time and information to handle failures.

The best example of delegation from my experience is - what comes with Kubernetes orchestration solution. For instance,

When an application-container hosted in a particular replicated pod (running over a k8s worker-node) fails,

  1. the K8s replication-controller (a distinguished member of the control-plane clan) get a failure-status messages from another control plane actor - the kubelet process, which is running on the worker node (and as per the configured liveness probes for the deployment.)
  2. The replication controller tries to recover the failing container, and it tries again and again.
  3. When it does not succeed in securing enough resources to restart the container, finally, the controller changes the state of the pod itself to "crashbackoffloop" prompting itself to instantiate another pod on any other available and resourceful, may be, different cluster-node.

Similarly, when a worker node goes down,

  1. same master cluster service entity - the replication-controller - gets a time-out for the frequent health-check probe, towards the kubelet process supposedly active on the failed worker.
  2. When this incident happens, right away the replication-controller, gets into a recovery mode and marking this node as unavailable while at the same time spinning up all required pods (as per the replicaset or stateful set configuration) on other available, resourceful worker-nodes, thus, enforcing the desired state management.

In summary,

In a well-designed resilient society, isolation is enforced, along with various latency-control patterns (bulk-heading - setting the borders, applying complete medical checking at check-posts, circuit breaking and shedding additional loads, thus maintaining back pressure on intake), by applying them at multiple levels of organization, thus, truly isolating individual components and clusters from each other; The failures are forcefully contained (with Quarantining) within each component and cluster, and thereby ensuring that parts of the system can fail and recover without compromising the system and its capacity as a whole. Supervision (active monitoring for outbreaks, automatic and prompt response to detected outbreaks, and timely escalation) and recovery (aka building vaccine-driven immunity and with ventilators) is delegated to another external component (aka Government master) and high-availability is ensured by simultaneous active-active replication of the essential functions across isolated clusters where necessary. Most importantly, the client of a component or service is not burdened with handling the failed-component's failures.

And now think of what we are dealing with,

No alt text provided for this image
  1. The Failure Outbreak started in Wuhan, China.
  2. Complimented with a mix of issues like lack of a distributed failure logging, false signals, untimely responses, denials, censorship, absence of fast-enough failure detection mechanisms - the affects of the outbreak started cascading, in and out, up and down, following the synchronous traffic routes of demand and supply. Soon other component clusters started failing and going unresponsive - aka Lombardi, New York, Spain etc. etc.
  3. Isolation and quarantine strategies got applied, however, the social distancing and virus containment plan failed due to 1). lack of individual discipline 2). mercantile tight-coupling between failing entities, 3). not having any failure delegation and recovery strategy with many cluster-masters (read governments), and 4). cascading economic recession associated with having synchronized, JIT economy going unresponsive one after the other.
  4. While humans are still highly available, yet some how, the sources of manufacturing are not replicated enough !!! (Centralized in the same cluster, aka China, where the root of problem lies.) So as China is failing down and other client-economies like USA and Europe are burdened with handling those failures.
  5. The Delegation plan is defunct. Recovery is far-fetched as the failure-cause is new and the delegated masters are not responsive with a well laid out plan covering failure monitoring, handling and recovery - The testing kits are not there, the testing takes time, ventilators are in short supply, the treatment is not there, the vaccine is not there.
  6. What else ..... !!! The resiliency is achievable but it is now a catching up game from this pit.

All said and done - In midst of this outbreak, building resiliency on-the-fly, at a rapid pace is not the best approach to handle failures. However, considering the fact that we live in an imperfect world, striving for resiliency now may appear impractical enough, yet, it is not impossible to attempt for. At least !!! we can go all out to achieve some of tenets like containment, isolation by following at most discipline. While at the same time, we can make every effort to build for replication and go the limit of human ingenuity to invent treatments, vaccines and solutions (at a rapid pace) supplied with unmatched intellectual capital and unbreakable spirit of continuous exploration and failing fast.

Finally, Stay safe !!! Stay Healthy !!! Stay connected ...That is what matters most !!! and yes follow this picture for 20 seconds.

No alt text provided for this image

Namaste .... feedback comments are always appreciated (more than likes.)

Kalyan(Kal) Sambhangi

Technology Strategy I Data & AI | Digital Enablement|Cybersecurity & Resilience | Wharton CTO Alum

4 年

Enjoyed reading it Gaurav ... A system architects perspective ...it's all about resiliency during these times

Layonmai Sarma

VP Technology @ Retisio. Building Data Platform and Data Products for Retisio Commerce

4 年

Great correlation, Gaurav! Explained beautifully.

Shailendra Bade

Engineering Director | AI/ML - Financial/Fraud Risk | Gen AI/RAG |Drive Large Scale Distributed Platforms | Data Engineering & Privacy | "People" first - not "resources"

4 年

This is very novel take on Covid and reactive architecture correlation. As novel as the virus is. The insights are pretty strong drawing equivalence to medical fraternity researching the way out. This article is parallel to John Hopkins publication on covid - reactive architecture

Michael Read

Principal Consultant at Akka - Certified Reactive Akkatect

4 年

Great correlation between Reactive Architecture and our society! Well done Gaurav!

要查看或添加评论,请登录

Gaurav Jain的更多文章

社区洞察

其他会员也浏览了