On Architects, Architecture, and Failures
Let’s consider two things:
1.)???Bad things happen to good people
2.)???Architects are people
Ergo, bad things happen to good architects.?
In other words, at some point - no matter how much effort you and your team put into designing resilient, high-performing, well-architected systems – something is going to blow up spectacularly and make you look silly.?Call it Murphy’s Law.
Why things fail
When we design systems, we usually try to do it well. We write good code, we write test cases, we follow frameworks and best practices. All of these things are under our control (and even these aren’t always bulletproof). However, the problem comes in when things are not under our control or we don’t even consider the possibility that something can go wrong (the unknown unknowns).
A couple of examples of why things fail:
Ultimately, if we look beyond the code that we interact with, there’s a ton of complexity under the surface – from the hardware that something runs on (yes, that’s there, even if you are in the cloud) to operating systems, containers, virtual machines and runtime environments, networks, etc.?
Consider the code below.
System.out.println(“Hello World!”);
A multitude of things have to come together in order for it print some characters to a console. Now, think beyond “Hello World!” to distributed enterprise systems with multiple components, produced by multiple parties, running on multiple different tech stacks.
As such, we should be wary of overconfidence in our own ability – systems are far from trivial, even if they are “simple” systems.
How to make it less painful when things fail
In order to make it easier to deal with failure, we should first accept that failure at some point is pretty much inevitable. Once you’ve made peace with this and it becomes a nagging concern in the back of your mind when you design something, you can start looking past some of your blind spots.
Do not make things more complex than they need to be
Systems are complex already, so if you design something, consider whether it can be simplified (while still being fit-for-purpose).?Unnecessary complexity increases both the likelihood of failure (due to more moving parts) and the difficulty involved in trying to fix a failure. This is a good point to plug in a reminder of Kernighan’s Law.
“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?”
One of the dangers here comes in the form of resume-driven development. Sure, the shiny tech/framework/approach will look great on your CV, but is it actually necessary??
Microservices have lots of benefits, but if your employee leave tracking system will only ever have 50 users – does “LeaveService + EmployeeService + HolidayService + orchestration + all the overhead that goes with it” really give you any meaningful benefit over “LeaveTrackingSystem.jar”??
There’s already a problem to solve, so be aware of the essential complexity and try to avoid creating additional problems through accidental complexity.
领英推荐
Detect failures early
Since there are many moving parts to a system, see if you can find a way to detect issues early.?
For code issues, the first place to catch problems is in your automated testing process. So make sure you have CI/CD pipelines and decent unit- and integration test coverage (we can debate what “decent” means, but it’s definitely not 100%).?This will tell you if something obvious breaks before you put it in production.
Once your application is deployed, you also start caring about all the other moving parts.?This is where tools like log aggregation systems and infrastructure monitoring become critical.?There are loads of fancy commercial offerings, but I’ve also worked on a team where we had our own set of monitoring tools – nothing complex, but enough to let us know if a process didn’t kick off when it was supposed to so that someone can intervene.?This is the canary in your coal mine.
What is key, regardless of what kind of tooling you use, is to make sure that failures are visible as soon as they happen. In the pre-pandemic days, a big screen in the office was a great way to do that. Now that WFH has become the norm, messages from the tooling to Slack or Teams might be a better option.??Nonetheless, if a responses from a service start taking longer than expected or if a database goes down, you want someone to know about it immediately.
Gather as much information as you can
Once you know that something has failed, you want as much information as possible to track down that failure.
There are a couple of ways of doing this.
Make it easy to isolate a failure
Your approach to design can be used to isolate failures.?Think of high cohesion and low coupling, as well as single responsibility for different components.
If you use a layered architecture and you apply coding standards that are clear on separation of concerns and what belongs in each layer, your errors and exceptions alone will provide some context as to where something went wrong.?Database connection issues??Go check the data access layer.?Business-specific exceptions??Go check the service layer.
If you depend on different services and some are not critical, how do you prevent a failure in one of those components from bringing down your entire solution??If a third-party services provides you with a list of public holidays (let’s say that is non-critical to your leave tracking system) and it becomes inaccessible for a while, do you really want to render your entire user interface unusable or do you want to display a message saying something like “we can’t display holidays right now” but leave everything else in a working state? The latter option definitely feels preferable in my mind.
Alternatively, if you have a set of third-parties that you depend on for some or other service (let’s say a delivery partner for your e-commerce system) – do you really want an outage on their side to knock your business out and prevent you from generating revenue??In this scenario, isolate yourself from failures on the other side by putting an asynchronous messaging layer in between. Assuming that it’s not time-critical, your solution can put a message on a queue and something else can process that request once it is available.
Make it easy to recover from failure
What can be a particularly difficult failure to recover from is the kind of failure that brings down an entire environment and requires it to be recreated.?This is easier to work around in cloud-based environments where things like auto-scaling groups can automatically bring up new instances.?You also want to make sure that you a DR strategy in place.
Much of this hinges on making sure that it is actually easy to recreate your environments – infrastructure-as-code (IaC) makes this easy. Having someone set up servers by hand means that it takes much longer to recreate an environment, with increased likelihood of human error and more time-consuming debugging.?Think cattle, not pets.
What to do after things have failed
Once something has failed, don’t just frantically rush to fix it. Do a root cause analysis and figure out exactly why it failed.?Then, put something in place to prevent the same kind of failure from happening again.
If it was a NullPointerException in your code, add a unit test to make sure your code can deal with it if it happens again.
If a server runs out of disk space and your application crashes, allocate more disk space, put a clean up job in place to clean out old files, and make sure you have monitoring to alert you when disk space is low again.
Also, make sure you learn from failures.?Have retrospectives, document the outcomes, and make sure that knowledge is distributed to the rest of your team and your organization. Failure isn’t always cheap, so make sure you don’t have to pay for the same lesson multiple times.
In summary – accept that architectures are not infallible, and be prepared to deal with failures when they happen.?If you have any other thoughts on the topic, let me know in the comments!
Technical Team Lead at FNB South Africa
2 年Good read, thanks Riaan. Think something to add is definitely the ability to stay open minded, people can get caught up in their way of doing things. Without reinventing the wheel, its good to relook at a design, even though it might work, there might be more efficient ways of addressing problems while adhering to the "new" industry standards. I always find it interesting to debate design with new joiners and seasoned architects alike. A small change in perspective can have a domino effect!
Architect, design and code "defensively". Create systems that are "self healing" when you can.
Founder & CEO @ OneDirectory? | Building the World’s #1 Employee Directory Platform
2 年Interesting read Riaan Nel