登录查看更多内容

On Architects, Architecture, and Failures

Riaan Nel

Fractional CTO | Technology Advisor for Growing Startups | Executive MBA

发布日期: 2022年6月2日

+ 关注

Let’s consider two things:

1.)???Bad things happen to good people

2.)???Architects are people

Ergo, bad things happen to good architects.?

In other words, at some point - no matter how much effort you and your team put into designing resilient, high-performing, well-architected systems – something is going to blow up spectacularly and make you look silly.?Call it Murphy’s Law.

Why things fail

When we design systems, we usually try to do it well. We write good code, we write test cases, we follow frameworks and best practices. All of these things are under our control (and even these aren’t always bulletproof). However, the problem comes in when things are not under our control or we don’t even consider the possibility that something can go wrong (the unknown unknowns).

A couple of examples of why things fail:

Code can have unexpected bugs that we couldn’t foresee (“that can be null??No way!”).
Third-party services that we depend on can go down.
?A server can run out of disk space.
Databases can go down (or misconfigured anti-virus software can scan your data files 24/7 and cause horrible slow-downs).?
The network may not be as reliable as you think.

Ultimately, if we look beyond the code that we interact with, there’s a ton of complexity under the surface – from the hardware that something runs on (yes, that’s there, even if you are in the cloud) to operating systems, containers, virtual machines and runtime environments, networks, etc.?

Consider the code below.

System.out.println(“Hello World!”);

A multitude of things have to come together in order for it print some characters to a console. Now, think beyond “Hello World!” to distributed enterprise systems with multiple components, produced by multiple parties, running on multiple different tech stacks.

As such, we should be wary of overconfidence in our own ability – systems are far from trivial, even if they are “simple” systems.

How to make it less painful when things fail

In order to make it easier to deal with failure, we should first accept that failure at some point is pretty much inevitable. Once you’ve made peace with this and it becomes a nagging concern in the back of your mind when you design something, you can start looking past some of your blind spots.

Do not make things more complex than they need to be

Systems are complex already, so if you design something, consider whether it can be simplified (while still being fit-for-purpose).?Unnecessary complexity increases both the likelihood of failure (due to more moving parts) and the difficulty involved in trying to fix a failure. This is a good point to plug in a reminder of Kernighan’s Law.

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?”

Kernighan's Law

One of the dangers here comes in the form of resume-driven development. Sure, the shiny tech/framework/approach will look great on your CV, but is it actually necessary??

Microservices have lots of benefits, but if your employee leave tracking system will only ever have 50 users – does “LeaveService + EmployeeService + HolidayService + orchestration + all the overhead that goes with it” really give you any meaningful benefit over “LeaveTrackingSystem.jar”??

There’s already a problem to solve, so be aware of the essential complexity and try to avoid creating additional problems through accidental complexity.

领英推荐

How can you improve security architecture…

Anil Patil ??"PrivacY ProdigY"?? 8 个月前

ASPM Zen | Peace of Mind with Cycode

Cycode 8 个月前

Building a Holistic Security Architecture with BMIS…

Roozbeh Noroozi 4 个月前

Detect failures early

Since there are many moving parts to a system, see if you can find a way to detect issues early.?

For code issues, the first place to catch problems is in your automated testing process. So make sure you have CI/CD pipelines and decent unit- and integration test coverage (we can debate what “decent” means, but it’s definitely not 100%).?This will tell you if something obvious breaks before you put it in production.

Once your application is deployed, you also start caring about all the other moving parts.?This is where tools like log aggregation systems and infrastructure monitoring become critical.?There are loads of fancy commercial offerings, but I’ve also worked on a team where we had our own set of monitoring tools – nothing complex, but enough to let us know if a process didn’t kick off when it was supposed to so that someone can intervene.?This is the canary in your coal mine.

What is key, regardless of what kind of tooling you use, is to make sure that failures are visible as soon as they happen. In the pre-pandemic days, a big screen in the office was a great way to do that. Now that WFH has become the norm, messages from the tooling to Slack or Teams might be a better option.??Nonetheless, if a responses from a service start taking longer than expected or if a database goes down, you want someone to know about it immediately.

Gather as much information as you can

Once you know that something has failed, you want as much information as possible to track down that failure.

There are a couple of ways of doing this.

If you pass request/response messages around (and you have something that logs those messages), use a unique identifier (think of something like correlation IDs) throughout the entire interaction to tie messages together.?This is particularly useful if a failure is caused by data in the payload of a message.
Log timestamps (and make sure that the time is synced across all the servers in your solution).?This in itself doesn’t give you answer, but it helps you to figure then out.?For example, if something fails exactly 120 seconds after invoking a service, that smells like a timeout.
In your implementation (and it’s up to your team to decide what you deem to be best practice), make sure you log sufficient detail in your log messages to help you trace something through the system.?“Payment failed” is a terrible message, especially if there are hundreds of instances of it in your logs.?“Payment failed.?Payment ref: 123-456. Error: java.lang.NullPointerException at …” is much better.

Make it easy to isolate a failure

Your approach to design can be used to isolate failures.?Think of high cohesion and low coupling, as well as single responsibility for different components.

If you use a layered architecture and you apply coding standards that are clear on separation of concerns and what belongs in each layer, your errors and exceptions alone will provide some context as to where something went wrong.?Database connection issues??Go check the data access layer.?Business-specific exceptions??Go check the service layer.

If you depend on different services and some are not critical, how do you prevent a failure in one of those components from bringing down your entire solution??If a third-party services provides you with a list of public holidays (let’s say that is non-critical to your leave tracking system) and it becomes inaccessible for a while, do you really want to render your entire user interface unusable or do you want to display a message saying something like “we can’t display holidays right now” but leave everything else in a working state? The latter option definitely feels preferable in my mind.

Alternatively, if you have a set of third-parties that you depend on for some or other service (let’s say a delivery partner for your e-commerce system) – do you really want an outage on their side to knock your business out and prevent you from generating revenue??In this scenario, isolate yourself from failures on the other side by putting an asynchronous messaging layer in between. Assuming that it’s not time-critical, your solution can put a message on a queue and something else can process that request once it is available.

Make it easy to recover from failure

What can be a particularly difficult failure to recover from is the kind of failure that brings down an entire environment and requires it to be recreated.?This is easier to work around in cloud-based environments where things like auto-scaling groups can automatically bring up new instances.?You also want to make sure that you a DR strategy in place.

Much of this hinges on making sure that it is actually easy to recreate your environments – infrastructure-as-code (IaC) makes this easy. Having someone set up servers by hand means that it takes much longer to recreate an environment, with increased likelihood of human error and more time-consuming debugging.?Think cattle, not pets.

What to do after things have failed

Once something has failed, don’t just frantically rush to fix it. Do a root cause analysis and figure out exactly why it failed.?Then, put something in place to prevent the same kind of failure from happening again.

If it was a NullPointerException in your code, add a unit test to make sure your code can deal with it if it happens again.

If a server runs out of disk space and your application crashes, allocate more disk space, put a clean up job in place to clean out old files, and make sure you have monitoring to alert you when disk space is low again.

Also, make sure you learn from failures.?Have retrospectives, document the outcomes, and make sure that knowledge is distributed to the rest of your team and your organization. Failure isn’t always cheap, so make sure you don’t have to pay for the same lesson multiple times.

In summary – accept that architectures are not infallible, and be prepared to deal with failures when they happen.?If you have any other thoughts on the topic, let me know in the comments!

Kevin Smith

Technical Team Lead at FNB South Africa

2 年

Good read, thanks Riaan. Think something to add is definitely the ability to stay open minded, people can get caught up in their way of doing things. Without reinventing the wheel, its good to relook at a design, even though it might work, there might be more efficient ways of addressing problems while adhering to the "new" industry standards. I always find it interesting to debate design with new joiners and seasoned architects alike. A small change in perspective can have a domino effect!

1 次回应

Wayne Yan

2 年

Architect, design and code "defensively". Create systems that are "self healing" when you can.

1 次回应

Lawrence Cawood ???? ????

Founder & CEO @ OneDirectory? | Building the World’s #1 Employee Directory Platform

2 年

Interesting read Riaan Nel

1 次回应

查看更多评论

要查看或添加评论，请登录

Riaan Nel的更多文章

The CTO’s Guide: Welcome to a World of Responsibilities

2024年9月23日

The CTO’s Guide: Welcome to a World of Responsibilities

When you transition into a CTO role for the first time, your world changes. You have a bunch of new responsibilities…

2 条评论
Mentorship 101: A Guide to Growth

2022年3月23日

Mentorship 101: A Guide to Growth

I recently presented a talk on mentorship at the Dariel Speaker's Forge. This article covers most of what I spoke about.

8 条评论
Starting your career as a developer? Here's some advice

2021年1月18日

Starting your career as a developer? Here's some advice

I recently had a conversation with one of my colleagues. This post was inspired by that conversation.

11 条评论
Programming: An Exercise In Frustration (and why that's okay)

2020年10月18日

Programming: An Exercise In Frustration (and why that's okay)

“It is common sense to take a method and try it. If it fails, admit it frankly and try another.

2 条评论
Mob Programming Memoirs: Agile Collaboration On Steroids!

2018年10月30日

Mob Programming Memoirs: Agile Collaboration On Steroids!

This article was originally published on the OfferZen blog. "If you want to go quickly, go alone.

2 条评论
On Collaboration and Innovation

2017年10月11日

On Collaboration and Innovation

I recently spent some time at a corporate hackathon. The event brought together people from various backgrounds –…
What makes (dev) teams great?

2017年6月7日

What makes (dev) teams great?

"Great things in business are never done by one person. They're done by a team of people.

7 条评论
Design Patterns: The Strategy and Factory Patterns

2016年12月15日

Design Patterns: The Strategy and Factory Patterns

This is the second article in my series on design patterns. In the first one, we had a look at the Builder pattern.

4 条评论
Design Patterns: The Builder Pattern

2016年12月5日

Design Patterns: The Builder Pattern

I've been meaning to write a series of articles on design patterns for quite a while. Patterns are incredibly valuable…

11 条评论
Programmers - what on earth do they actually do!?

2016年10月14日

Programmers - what on earth do they actually do!?

I am a programmer. Professionally, I've been doing it for somewhere between eight and nine years, but I started coding…

1 条评论

See all articles

On Architects, Architecture, and Failures

Riaan Nel

Fractional CTO | Technology Advisor for Growing Startups | Executive MBA

Why things fail

How to make it less painful when things fail

Do not make things more complex than they need to be

领英推荐

Detect failures early

Gather as much information as you can

Make it easy to isolate a failure

Make it easy to recover from failure

What to do after things have failed

Riaan Nel的更多文章

社区洞察

其他会员也浏览了

Tecplix ThreatTrack Insights -March II

Developer Newsletter - March 2024

Cyber-security Best Practices for Ensuring Robust System Design and Software Architecture

Chaos Engineering in IT Systems: Embracing Failure and Security Testing for a More Resilient Future

BIM Data Security and Privacy

Top Ten Books to Read to strengthen understanding of Security Architecture

Enterprise information security architecture role in fighting the “Infodemic”

Dismantling the rationale for CrowdStrike's aberrant security architecture

???? Dockerfile Best Practices and Security

Revolutionizing Security: How eBPF Could Transform OS Architecture and EDR

Why things fail

How to make it less painful when things fail

Do not make things more complex than they need to be

领英推荐

Detect failures early

Gather as much information as you can

Make it easy to isolate a failure

Make it easy to recover from failure

What to do after things have failed

Riaan Nel的更多文章

The CTO’s Guide: Welcome to a World of Responsibilities

Mentorship 101: A Guide to Growth

Starting your career as a developer? Here's some advice

Programming: An Exercise In Frustration (and why that's okay)

Mob Programming Memoirs: Agile Collaboration On Steroids!

On Collaboration and Innovation

What makes (dev) teams great?

Design Patterns: The Strategy and Factory Patterns

Design Patterns: The Builder Pattern

Programmers - what on earth do they actually do!?

社区洞察

其他会员也浏览了

Tecplix ThreatTrack Insights -March II

Developer Newsletter - March 2024

Cyber-security Best Practices for Ensuring Robust System Design and Software Architecture

Chaos Engineering in IT Systems: Embracing Failure and Security Testing for a More Resilient Future

BIM Data Security and Privacy

Top Ten Books to Read to strengthen understanding of Security Architecture

Enterprise information security architecture role in fighting the “Infodemic”

Dismantling the rationale for CrowdStrike's aberrant security architecture

???? Dockerfile Best Practices and Security

Revolutionizing Security: How eBPF Could Transform OS Architecture and EDR