登录查看更多内容

10 Principles of High Availability

Arvind Suryakumar

Engineering Leader

发布日期: 2024年4月12日

Inviolable rules that will save your bacon when the chips are down

As systems become more complex and grow in scale, it becomes a challenge to ensure they continue operating amidst constant changes to the underlying software. From decades of running (and breaking) high scale globally distributed systems, we’ve learned several lessons that are condensed into 10 core principles of availability.

?? 1. Avoid deploying changes everywhere all at once

For globally distributed systems that span multiple geographical regions, do not deploy changes everywhere all at once. A bug can risk taking out your entire product for all customers. Modern deployment environments generally don’t let you do this unknowingly. Yet it is a fundamental principle of availability best not ignored.

?? 2. Start with small regions first when deploying

Size of a region maps to the number of customers in the geography. Starting with small regions result in impacting a smaller blast radius of customers in the event something goes wrong with the deployment. Rolling back smaller infrastructure footprints are faster as well.

?? 3. Rollback first, ask questions later

Resist the temptation to debug availability issues that crop up in production after a deployment. Rollback liberally to first mitigate impact to the customer. Follow up with the root causing once the customer impact is mitigated.

?? 4. Always retry with jitter and exponential backoff

‘Retry storms’ happen when layers of systems retry failed calls indiscriminately. This can cause a single failure to rapidly exhaust infrastructure capacity and grind your product to a screeching halt. Ensure all retry logic for dependency calls have three elements tuned for the application?—?exponential backoff, jitter and timeout. Bake them as defaults in your client configuration files.

?? 5. Avoid Black Boxes

Black boxes are closed systems whose code your organization does not own. Black boxes are hard to debug, and whose performance is generally out of your control. If they are unavoidable, opt for async integration patterns to absorb outages. Consider designing levers that can still let your product operate in a partial state should the black box become unavailable.

Priyal Walpita 1 年前

??GovCon Insights by G2Xchange | 2-8-24

G2Xchange 9 个月前

Cost Efficiency: The Financial and Strategic Advantage…

Jason Gray 5 个月前

?? 6. Invest in canaries

Like canaries in a coal mine that alert miners of dangerous gases, build canaries for your software system that alert you of problems before the customer notices. These are early warning mechanisms that inform you of an impending problem before it spreads wider. Consider your core product features and devise automated tests that validate their functioning on a recurring basis, and alarm when broken.

?? 7. Fix spare tires

Spare tiers are systems that are not used on a day to day basis, but must work when it is needed. These are backup systems, recovery scripts, failover software environments etc. As the systems around them change, spare tires have the tendency to go into disrepair. Flat spare tires can prolong recovery time exponentially. Audit the seldom used systems regularly and treat the fixes needed to keep them running as top priority.

?? 8. Look for dogs that are not barking

Dogs that don’t bark are issues that lurk, silently worsening over time until it turns into catastrophe. Examples are slow memory leaks, ID scheme running out of unique combinations, database fields outgrowing the schema etc. Fixing these usually require significant changes or take a long time to implement. Audit these regularly to spot and fix problems while there is still time.

?? 9. Make security everyone’s business

Don’t treat security as a bolt-on to the end of a project, or a checkmark before you can launch. Design threat models early in the project?—?preferably when the high level design is ready. Partner with your security team throughout the development process, involving them in important code and design reviews. This will minimize last minute surprises and ensure layers of security are integrated within the product.

?? 10. Avoid snowflakes

Snowflakes are parts of the system that break the usual pattern and typically require special handling. Deployment regions that don’t have all the dependencies, parts of the system that require a different builder tools, areas that require manual operations are all examples of snowflakes. As systems change, snowflakes carry the most risk of unintended breakage. Advocate for consistency in architecture. If snowflakes are unavoidable, ensure these get tested first and frequently. Advertise widely to the team about the snowflakes and have a plan of action to eliminate them.

Do these ideas resonate with you? What lessons have you learned in managing highly available systems? Drop a note!

Leadership with some Lavazza

277 位关注者

要查看或添加评论，请登录

Arvind Suryakumar的更多文章

Newsletter has moved to substack!

2024年7月19日

Newsletter has moved to substack!

Hello readers! Thank you to have joined me on this journey, as we explored various topics in leadership and tech. From…
Global Climate Crisis: Is AI a bane or a panacea?

2024年7月12日

Global Climate Crisis: Is AI a bane or a panacea?

The conversation about climate change has been ongoing since I was in high school back in late 90s. The term was…
Weekly AI Roundup - Deepfakes Meddle with U.S Elections, Again

2024年7月5日

Weekly AI Roundup - Deepfakes Meddle with U.S Elections, Again

U.S elections in turmoil, again.
AI Weekly Roundup - Treating Mental Health

2024年6月29日

AI Weekly Roundup - Treating Mental Health

Scaling challenges in treating mental health During my last trip to India, I struck up a conversation with some people…
Adversity - The Great Teacher

2024年6月28日

Adversity - The Great Teacher

Time for a personal essay. After nearly 10 years of working for Amazon, I woke up one day to an impersonal email about…

1 条评论
Weekly AI Round-up: Benchmark Wars and the impending plateau of AI

2024年6月22日

Weekly AI Round-up: Benchmark Wars and the impending plateau of AI

New Models Ratchet up the Benchmark Wars Every decade has its own benchmark wars. In the 1990s, it was processor…
5 Organizations Empowering People of Color in Tech

2024年6月21日

5 Organizations Empowering People of Color in Tech

This week we celebrate Juneteenth by highlighting 5 organizations that are doing strong work in uplifting people of…

2 条评论
Weekly AI Roundup - Truly Understanding AI Models and more

2024年6月15日

Weekly AI Roundup - Truly Understanding AI Models and more

Does it bring you comfort that no AI company fully understand what goes on inside is AI models? Sam Altman for instance…
Infiltrating the Underworld: How the FBI Became Criminals' Favorite Phone Company

2024年6月14日

Infiltrating the Underworld: How the FBI Became Criminals' Favorite Phone Company

Today we make a departure from the style of topics we've been exploring lately and journey through a fascinating story.…
AI's Gold Rush - Strike Rich or Get Burned?

2024年6月8日

AI's Gold Rush - Strike Rich or Get Burned?

Remember the dot-com bubble? Many investors lost fortunes, but a few but a few who read the signs early reaped huge…

2 条评论

See all articles

10 Principles of High Availability

Arvind Suryakumar

Engineering Leader

Inviolable rules that will save your bacon when the chips are down

?? 1. Avoid deploying changes everywhere all at once

?? 2. Start with small regions first when deploying

?? 3. Rollback first, ask questions later

?? 4. Always retry with jitter and exponential backoff

?? 5. Avoid Black Boxes

领英推荐

?? 6. Invest in canaries

?? 7. Fix spare tires

?? 8. Look for dogs that are not barking

?? 9. Make security everyone’s business

?? 10. Avoid snowflakes

Leadership with some Lavazza

277 位关注者

Arvind Suryakumar的更多文章

社区洞察

其他会员也浏览了

The vicious circle of legacy technology

The Basic Concepts Of Performance Test - Capacity

Navigating Edge Computing in Defense: Balancing Open Architectures, Custom Solutions, and SWaP-C Optimization

Common Design Patterns

Revamping a legacy application to improve scalability & remove redundancy

Our Performance Optimization Services Uncovered

Mastering TCP Socket Management in Node.js: A Guide to Detecting Leaks and Enhancing Application Performance

A Beginner's Guide to Node.js Event Loop

Getting a Communication System for Distributed Applications

CrowdStrike: Rising Phoenix from the Ashes of the July 19 System Crash

Inviolable rules that will save your bacon when the chips are down

?? 1. Avoid deploying changes everywhere all at once

?? 2. Start with small regions first when deploying

?? 3. Rollback first, ask questions later

?? 4. Always retry with jitter and exponential backoff

?? 5. Avoid Black Boxes

领英推荐

?? 6. Invest in canaries

?? 7. Fix spare tires

?? 8. Look for dogs that are not barking

?? 9. Make security everyone’s business

?? 10. Avoid snowflakes

Leadership with some Lavazza

277 位关注者

Arvind Suryakumar的更多文章

Newsletter has moved to substack!

Global Climate Crisis: Is AI a bane or a panacea?

Weekly AI Roundup - Deepfakes Meddle with U.S Elections, Again

AI Weekly Roundup - Treating Mental Health

Adversity - The Great Teacher

Weekly AI Round-up: Benchmark Wars and the impending plateau of AI

5 Organizations Empowering People of Color in Tech

Weekly AI Roundup - Truly Understanding AI Models and more

Infiltrating the Underworld: How the FBI Became Criminals' Favorite Phone Company

AI's Gold Rush - Strike Rich or Get Burned?

社区洞察

其他会员也浏览了

The vicious circle of legacy technology

The Basic Concepts Of Performance Test - Capacity

Navigating Edge Computing in Defense: Balancing Open Architectures, Custom Solutions, and SWaP-C Optimization

Common Design Patterns

Revamping a legacy application to improve scalability & remove redundancy

Our Performance Optimization Services Uncovered

Mastering TCP Socket Management in Node.js: A Guide to Detecting Leaks and Enhancing Application Performance

A Beginner's Guide to Node.js Event Loop

Getting a Communication System for Distributed Applications

CrowdStrike: Rising Phoenix from the Ashes of the July 19 System Crash