Resilience Over Perfection: Life Lessons from Distributed System Design
Ryan Plant
Hands-on Cloud Technology Executive | Chief Architect | Software and Systems Engineer | Team Builder
Software engineering, like any field I suppose, is full of interesting metaphors. Sometimes these metaphors come from history and applied to software while others originate from software and applied to life. In a nutshell, software design largely models life and real-world constructs so the preceding statement shouldn't be a surprise.
Personal Context
I've always been a sucker for metaphors and other figures of speech like parables, analogies, and allegories. For me it started with The Old Man and the Sea, continued with Dickens' works, Shakespeare, heightened further by Moby Dick, and so on. About nine years into my career as a Software Engineer, I was forever influenced when I heard the brilliant Pat Helland present his Metropolis metaphor on service orientation. From then on, I realized the power of metaphors and analogies as a means to communicate complex and deep concepts in a manner approachable to the masses. As a true mentor, Pat Helland endowed me with a tool for which I accredit much of my current career success. (Thank you Pat!)
Distributed System Design
Lately, I have been thoroughly studying and researching distributed system design. Since 2003, I have been a practicing distributed systems engineer but in today's world of technology, the scalability requirements of systems continues to grow. Thus I am always looking for patterns, practices, frameworks, and toolkits from the industry to expand my repertoire for innovative solutions to vexing business problems. Most recently I have been immersing myself in functional languages, starting with Scala and complementing it with a framework that implements an Actor model, in this case Akka. In my study I have gained an appreciation for the Reactive Manifesto and the simplicity and succinctness it conveys important principles of system design. Within the context of distributed system design and the reactive principles that enable it, I have found a rich set of metaphors that can be applied to life; of the many upon which I could expound, I will focus on the futility of perfection and the pragmatism of resilience.
Forget Perfection
Perfection in software engineering is a mirage. Good engineers have a deep thirst for attaining perfection in their craft - it's partially what motivates them for an ever-lengthening walk of improvement. However, many become disillusioned, burned out and left cynical when they realize it doesn't exist, learned from many failures and compromises. There is no such thing as perfection in software or in the human condition so don't beat yourself up for not attaining it. There are bugs in our genetic code, our thinking processes, our personalities, attitudes, and so on. There are unexpected inputs that result in unhandled exceptions and error conditions that come from sources outside of our control or anticipation, causing crashes and seemingly infinite loops of malaise that can only be interrupted by a reboot. Despite all we can do, these imperfections are omnipresent and inevitable. Such is software and the pursuit of its creation. Such is life and the pursuit of perfection. So what is the solution?
Strive for Resilience
In software as in life, failure is inevitable. Avoid it the best you can but instead of letting the fear of it keep you from shipping (code, that is), learn how to tolerate and recover fast from failure. This is arguably one of the most aspirational characteristics of a good system design: resilience. Is this far off from life? Through eons of time and varying stages of evolution, life has learned the survivalist trait of adaptation and profound resilience in the face of doom. Part of being a resilient system is capturing as much salvageable context surrounding the failure as possible so that it can be learned from, whether it be from our own failures or those of others. Yes, it is important to recover but if the reason for the failure in the first place cannot be established and used as the basis for adaptation then failure is bound to strike again; just because you can tolerate it doesn't excuse you from taking steps to avoid it in the future. The following are some points of advice to guide you in the design for resilience.
Don't go it alone
I can't think of a single, truly resilient system without peers or counselors. Like distributed systems with peer nodes grouped in a quorum of like purpose/interest, guided and directed by a superior, we as individuals need companionship and direction. Making connections with others is a vital part of our ability to work through our limitations and be resilient. Family and friends are obvious candidates but they aren't always the ideal choices as they may not share common goals and interests; in some cases, those whom we call our kin can be the cause of the issues that bring us down and/or the reason we can't recover or adapt. Some separation is required which is why support groups and interest groups led by available and respectable moderators and experts are a good alternative. In short, find those that you can look up to as role models and could possibly engage as a mentor or sponsor; join groups of peers who share the same goals and interests and use the strength of numbers to improve your resiliency.
Communicate
In a distributed system, communication between peers and coordinators is critical in ensuring the overall system is resilient. When a member of the group is suffering, it should be immediately known so the situation can be handled and responded to. The interaction between members of the group is what makes or breaks the overall integrity of the system and enables each member to share the burden. In life, once you've found your place in a social group of aligned interests, be sure to establish and leverage the proper communication protocol. Sharing your state of mind, health, and emotion is important in enabling trusted peers and counselors to know when and how to help. In a good, resilient group there is typically a stand-out, an appointed member that engages in a special form of communication that elicits more intimate details of the state of an individual. Having this relationship is ideal because it represents a special role in which information can be disclosed in a trusted setting with full discretion and privilege apart from others. Life partners, mentors, and professional counselors are excellent leaders to elect to serve as the special member your group that you can check-in with and get direction. For those fellow systems engineers: I hope you can grok the analogy.
Keep a journal
A key to resilient design is the ability to learn from failure so that it can be avoided in the future.
Those who cannot remember the past are condemned to repeat it. - George Santayana
How does a system record its past (i.e. history)? In some durable record usually called a log. As the system operates it usually will record important events in a human or machine transferable and readable medium, this way in the event of a failure, a retrospective can serve as the basis for identifying the issues that led up to the failure condition and more importantly become the means upon which to learn to avoid or better handle the situation in the future. The goal here is to learn and improve beyond basic resiliency. As humans how do we provide such a utility? A journal. Many people scoff at the prospect of keeping a journal but think about where we would be if no one kept a journal? History, explorations, scientific discovery, philosophy, religion, et cetera would simply not exist. Without a durable record what is there to go on that can be studied, collaborated upon, and ultimately learned from? Practically speaking, a journal at a minimum can assist in helping you remember things that can be crucial in helping you and your chosen intimates to understand the events or triggers that contributed towards a failure condition. To conclude this point I want to say that a journal is something that gains greater value over time because nothing can be a better source of wisdom than experiences and reflections captured for future generations to learn from. It's immensely valuable and irreplaceable much like authentic historical data for the purposes of predictive analysis.
Become a leader
As mentioned before, distributed systems have multiple contributors or peers, each available to perform some task on behalf of the whole. However, there needs to one amongst the peers that can stand out and serve as the leader, coordinator, or master. The process of determining a leader is a fascinating discussion in itself but to summarize, the peer group often elects the leader and the leader serves (for the truly curious, allow me to point you to the wonderful world of consensus algorithms, namely Paxos). In the human aspect of these admittedly loose and mixed metaphors, the aspiration to serve your peers is one of the best ways to solidify a sense of duty and overall commitment to being resilient because it places a great deal of responsibility for instilling and promulgating the means of resiliency not only for yourself but for the entire system of peers. From the humility that comes from joining others in their pursuit of resiliency and seeking guidance from a leader, combined with a journal of your own personal history and introspection, you will be well positioned to serve and in so doing will inevitably serve yourself.
Confirming Amdahl's Argument
To conclude I will use another computing analogy roughly applied to Amdahl's Law which I summarize by stating that a system can only be improved to the extent of its weakest part. In other words, to use a common paraphrase:
A chain is only as strong as its weakest link. - Thomas Reid, Essays on the Intellectual Powers of Man
As a larger system or society of individual peers, we are invariably linked. However large the viewpoint you wish to apply: familial unit, team, workgroup, organization, division, company, country and so on, our ability to succeed is based on what we do to support ourselves and others in the pursuit of resiliency.