Software is Ultimately a Probabilistic Game
Hey Jon, I was the stage manager in Budapest at the big stage where you gave a talk. I remember you because I asked you if you'd be interested in signing my notebook alongside the other speakers and you were happy to do so.
Since then I reached out on Linked In and you've shared with me your e-mail and now here I am writing this e-mail about asking you if you'd be Happy to accept and e-mail interview invitation for my Medium blog and for an Article on Linked in.
Previously, I interviewed Bob Messerschmidt, who designed the heart rate monitor for the Apple Watch and Professor John Martinis, who created the world's first quantum-supremacy demonstrations and as of today I have 54 stories hosted on Medium, all of which are attempts to inspire. ( You can find my stories by typing Patrick Pallagi Medium to the search engine of your preferences)
For this interview I prepared One Question for you and I'd like to think that this may inspire you to create and write something that may just inspire me, my friends and the people your answer will get to touch.
What could be that one question? Here it is with the introduction:
In this interview with Jon Moore, we’ll be talking about the ethos and best practices for fault protection in software design and closely making analogies to the work of Robert D. Rasmussen at the Jet Propulsion Laboratory from the California Institute of Technology.
Jon, you really find great ways to inspire people! I read the paper you highlighted in your talk in Budapest earlier this year with the intention to find analogies I can ask you more about and I soon realized that this paper called Guidance, Navigation, and Control Fault Protection Fundamentals is not simply full of gems and diamonds, it is one BIG UNCUT DIAMOND.
It is super fun to read if you’re looking for inspiration and my intention is to attempt to highlight certain parts of the read and have you reflect on these in as long or as few words as you like. So without further ado, let’s begin!
My One Question for you Jon is this: This is about the decisions that lead to “making things work, even when things aren’t right, which is really the essence of engineering.”
When the paper first focuses on Architectural Integrity, the author writes:
“In the resulting confusion of ad hoc solutions, fault protection systems can become overly complicated, difficult to understand or analyze,
capable of unforeseen emergent behaviours (usually for the worse), impossible to test thoroughly, and so brittle that the suggestion of even trivial change is enough to raise alarms.
These problems are all signs of lost architectural integrity: the absence of conceptual grounding and regular patterns of design that embody and enforce fundamental principles of the discipline. In the worst of circumstances, where these issues are not under control, fault protection may even be detrimental to the reliability of the system in defiance of its very purpose.”
In my understanding a fault in a computer software is seldom a result of decisions that lead to the creation of it.
I would say, and perhaps you’d agree that a fault in a computer software is a result of the lack of decisions that lead to that point.
领英推荐
At the same time, we all know that the speed when it comes to shipping the code is very important because for iterative design practices and fast testing purposes, so the faster we can ship the new code the faster we can test it however,
I take it that if somebody sends a device into outer space, iterative design practices may not apply since the shipping then is costly and often very hard to repeat.
I’d be interested in asking for your opinion what is the model in that case by which one could attempt to think about what it means to be ready to launch and feel confident about it?
Thank you Jon!
Software is ultimately a probabilistic game: we can never be fully sure that our software is bug-free, whether those bugs are straight up implementation bugs (like off-by-one errors) or they are errors in our understanding of the problem domain (such as encountering a situation in production that we did not expected during design).
Even Donald Knuth ended up having bugs he did not expect in his programs!
Therefore, the name of the game is reducing the probability that we've missed something--i.e. increasing our confidence that our software is correct enough. There are many techniques for trying to find bugs in our design, from automated testing (of which there are many variants, such as fuzz testing or generative testing) to code review to formal verification to private beta releases, etc. The trick is that there are diminishing returns to all of these techniques: each additional one we add is likely to find fewer and fewer bugs as we go along.
The end result is that we find as many bugs as we can afford. We may be limited by an upcoming launch deadline, or the amount of time we're willing to allocate to testing vs. development (although my experience suggests that we grossly underestimate the time we end up saving by doing testing when considered against the maintenance lifetime of the software). We then consider how costly an unknown bug would be if it surfaces after release.
For an Internet-hosted service where we have a CI/CD pipeline and can push changes easily, the cost of an unknown bug is much smaller than for a deep space probe where the bug renders the probe unusable.
The author of the paper suggests several guidelines for designing a system that either reduce the cost of testing (for example, as I discussed in my talk, a stateless control function is easier to test than a stateful one) or to reduce the likelihood of incorrectly modelling reality (e.g. making sure we focus on possible?states of the system rather than just expected or desired?states of the system).
But ultimately, coming up with a comprehensible design for a solution to the problem at hand that, while necessarily incomplete, is good enough?for our purposes is the essence of software engineering, and why we all still have jobs!
Hope that helps!
Jon