10 Principles of High Availability
Inviolable rules that will save your bacon when the chips are down
As systems become more complex and grow in scale, it becomes a challenge to ensure they continue operating amidst constant changes to the underlying software. From decades of running (and breaking) high scale globally distributed systems, we’ve learned several lessons that are condensed into 10 core principles of availability.
?? 1. Avoid deploying changes everywhere all at once
For globally distributed systems that span multiple geographical regions, do not deploy changes everywhere all at once. A bug can risk taking out your entire product for all customers. Modern deployment environments generally don’t let you do this unknowingly. Yet it is a fundamental principle of availability best not ignored.
?? 2. Start with small regions first when deploying
Size of a region maps to the number of customers in the geography. Starting with small regions result in impacting a smaller blast radius of customers in the event something goes wrong with the deployment. Rolling back smaller infrastructure footprints are faster as well.
?? 3. Rollback first, ask questions later
Resist the temptation to debug availability issues that crop up in production after a deployment. Rollback liberally to first mitigate impact to the customer. Follow up with the root causing once the customer impact is mitigated.
?? 4. Always retry with jitter and exponential backoff
‘Retry storms’ happen when layers of systems retry failed calls indiscriminately. This can cause a single failure to rapidly exhaust infrastructure capacity and grind your product to a screeching halt. Ensure all retry logic for dependency calls have three elements tuned for the application?—?exponential backoff, jitter and timeout. Bake them as defaults in your client configuration files.
?? 5. Avoid Black Boxes
Black boxes are closed systems whose code your organization does not own. Black boxes are hard to debug, and whose performance is generally out of your control. If they are unavoidable, opt for async integration patterns to absorb outages. Consider designing levers that can still let your product operate in a partial state should the black box become unavailable.
领英推荐
?? 6. Invest in canaries
Like canaries in a coal mine that alert miners of dangerous gases, build canaries for your software system that alert you of problems before the customer notices. These are early warning mechanisms that inform you of an impending problem before it spreads wider. Consider your core product features and devise automated tests that validate their functioning on a recurring basis, and alarm when broken.
?? 7. Fix spare tires
Spare tiers are systems that are not used on a day to day basis, but must work when it is needed. These are backup systems, recovery scripts, failover software environments etc. As the systems around them change, spare tires have the tendency to go into disrepair. Flat spare tires can prolong recovery time exponentially. Audit the seldom used systems regularly and treat the fixes needed to keep them running as top priority.
?? 8. Look for dogs that are not barking
Dogs that don’t bark are issues that lurk, silently worsening over time until it turns into catastrophe. Examples are slow memory leaks, ID scheme running out of unique combinations, database fields outgrowing the schema etc. Fixing these usually require significant changes or take a long time to implement. Audit these regularly to spot and fix problems while there is still time.
?? 9. Make security everyone’s business
Don’t treat security as a bolt-on to the end of a project, or a checkmark before you can launch. Design threat models early in the project?—?preferably when the high level design is ready. Partner with your security team throughout the development process, involving them in important code and design reviews. This will minimize last minute surprises and ensure layers of security are integrated within the product.
?? 10. Avoid snowflakes
Snowflakes are parts of the system that break the usual pattern and typically require special handling. Deployment regions that don’t have all the dependencies, parts of the system that require a different builder tools, areas that require manual operations are all examples of snowflakes. As systems change, snowflakes carry the most risk of unintended breakage. Advocate for consistency in architecture. If snowflakes are unavoidable, ensure these get tested first and frequently. Advertise widely to the team about the snowflakes and have a plan of action to eliminate them.
Do these ideas resonate with you? What lessons have you learned in managing highly available systems? Drop a note!