Extending the Metaphor: Firefighting and Software Development
Software engineers talk a lot about their times playing the firefighter, usually with a notes of frustration and pride. Pride at being the hero and saving the day. Frustration that the fire happened at all, especially if fires are common.
We use the analogy again when talking about networking. Firewalls are essential, protecting us from the outside, keeping us safe. Firewalls are a common tool in construction as well, used to prevent fires from spreading from building to building.
Firewalls aren't the only tool, however, and I think it's time we start looking at other things Firefighters and Architects do to protect buildings and forests.
Fire Partitions
Where firewalls keep fire from the outside from getting in, fire partitions keep the fire contained within a specific area. While the protection doesn't last forever, the idea is to give firefighters more time to put out the fire without damaging the rest of the area.
An excellent example of this technique can be seen in the movie Fight Club - an inferno rages in the protagonist's Ikea apartment while the rest of the building is saved by thick concrete walls.
In software development, we can build these partitions by splitting out business logic into discrete pieces (microservices, components, functions, jobs - the list goes on). Whatever pattern you choose, you should be able to separate your logic in such a way that your SSO provider having an outage or a bug in your admin console doesn't cause the entire site to go down.
This is where versioning really comes in - shared APIs and packages are useful, but to maintain our fire partitions we have to remember to only move one component to the new version at a time. Overarching changes should also be done, as much as possible, one component at a time in a series of small deploys.
This way, we contain the blaze, isolate the problem, and keep a fire on the ninth floor from burning down our whole building.
Smoke Barriers
The old saying "Where there's smoke, there's fire" is only true when you have appropriate smoke barriers in place. I've seen this missed a lot with microservices, especially in health and readiness checks.
The reason why is understandable - if your service is functionally useless without another service being up, it makes sense to fail your readiness. The problem is that doing so causes smoke in the wrong place. If the fire is in the other service a failing health check will make the dev start looking in the wrong place.
领英推荐
This is where fault tolerance and graceful degradation come into play. When at all possible, a fire in another service should have minimal effect or be pointed to as the source of the smoke in the system's error handling. If the SSO service does fail, our service shouldn't also fail, it should let us know the SSO service is failing.
Controlled Burns
In forestry management, one of the tools Indigenous cultures practiced for millennia was controlled burns. This is the practice of setting small blazes in forests in order to clear out deadwood, reduce the risk of uncontrolled forest fires, and encourage new growth. The restricting of this practice is part of why forest fires across Canada and the United States have become so terrifying and unmanageable.
In software development, a similar concept is chaos engineering. We intentionally try to break our systems in order to see what we missed, where our weak points are, and how we can improve our smoke barriers and fire partitions. We control the circumstances, monitor the fire's progress, and keep if from having too much impact.
This can be a hard sell, since taking down even a part of a production system is likely to cost the organisation a great deal of money. The alternative, however, might be an inferno on the busiest day of the year, or right before the launch of a much-lauded new feature.
Even worse, without the knowledge gained in a controlled burn, we might not know what type of fire we're dealing with. In the pressure and chaos, we're likely to default to most common solutions. Our default reactions, unfortunately, can be like throwing water on a grease fire, increasing the damage and time to resolve the problem exponentially.
Cooling Down
Fire partitions and smoke barriers can help us reduce fires, and controlled burns can give us the knowledge needed to stop them if they do start. The last tool is also the simplest.
Just like firefighters will tell you to watch your cooking and never leave a space heater unattended, the best way to prevent software fires is to slow down, be thorough in testing and peer reviewing, and focus on doing one thing at a time.
It's fine for dinner to be a little late if it means the burner isn't left on. It's okay to miss a sprint goal if it means your merge doesn't introduce a new point of failure.