登录查看更多内容

A crisp guide on "Design for failure" to build highly reliable and available software products

Laxmikant Agarwal

Director Software Development | Cybersecurity | Cloud | SASE Firewall | Sophos | Juniper | RSA | Amazon | McAfee | TAPMI | MANIT

发布日期: 2025年2月23日

For a software product to be built for high reliability and availability, it has to be highly resilient to failures, recovers on its own from failures and has redundancy built-in. It also necessitates fault tolerance at various layers. The higher the business criticality level of the product/system, higher the need to design for failure.

For cloud based applications, cloud platforms provides high availability through its multiple availability zones, load balancers and seamless auto-scaling at infrastructure level.

Three factors from the twelve-factor app principles methodology are key to designing for failure

Disposability: Maximize robustness with fast start-up and graceful shutdown. Produce lean container images and strive for processes that can start and stop in a matter of seconds.
Logs: If part of a system fails, troubleshooting is necessary. So ensure material for forensics exists in the form of logs. Standardize logging for the whole system to enable easy correlation of log data from various services.
Dev/prod parity:?Keep development, staging, and production as similar as possible.

领英推荐

Make Your Rails Upgrades Strategic, Not Just Technical

Infinum 1 个月前

The 80/20 Rule in Performance Engineering

Samson Jaykumar 1 个月前

Performance Testing using Neocortix LoadTest and JMeter

Jakub Dering 3 年前

Design techniques to build self-healing, highly resilient systems

Circuit breaker: This detects failures and prevents the application to perform the action that is doomed to fail?(until it's safe to retry),
Retry: When failure happens, learn about cause of failure, go back to previous step, retry either in the same way for finite times or in a different way.
Timeout: Instead of waiting indefinitely or for too long,? throw an exception and come out freeing up the resources/threads.
Rollback: In a failure situation un-do the action that lead to the failure so to go back to a known-good state.
Partitioning/decoupling: To ensure failure in one service does not affect another service
Degrade/derate: The whole system continue to run, though at a degrade level not impacting majority of customers, when there is a failure in one part of the system.
Reconciliation/offline service: To protect systems against failures is to have a reconciliation service for handling failures or data inconsistencies, that also analyses & monitors failures, can enable primary services to remain focussed on time sensitive critical job.

And follow these up with a robust fault injection techniques that involves introducing faults/errors into the software to evaluate resilience, fault tolerance, and self recovery. Run fault injection experiments in isolated sandboxed environments preferably using injection tools such as chaos monkey or Gremlin etc.

Use these fault injection techniques and measure using metrics such as fault detection rate, false positives & false negatives

Invalid Inputs: Feed the software with invalid or malformed input data to assess its input validation and error-handling capabilities.
Bit Flipping: Modify specific bits in data, instructions, or configuration files to simulate data corruption or code manipulation.
Random Code Injection: Inject random code snippets or instructions into the application to simulate unpredictable errors.
Memory Exhaustion: Allocate excessive memory or induce memory leaks to observe how the software handles out-of-memory situations.
CPU Overload: Overload the CPU by consuming excessive CPU resources to assess the system's responsiveness and recovery.
Boundary Testing: Test the software with inputs at or beyond the boundary of valid ranges to evaluate boundary condition handling.
Packet Dropping: Simulate network packet loss or network partitioning to assess how the software handles network failures.
Latency Injection: Introduce artificial network latency to evaluate the system's responsiveness and timeout settings.
Clock Manipulation: Alter the system clock to simulate time-related issues, such as expired certificates or time-sensitive operations.
Timeout Testing: Adjust timeouts for network requests or critical operations to test how the software reacts to slow or unresponsive services.
Service Interruption: Temporarily stop or disrupt external services, databases, or dependencies that the software relies on to assess its resilience and error recovery.

Sridhar Kotha

Director Engineering | Available Immediately | Open to New Opportunities

3 周

Very helpful????

1 次回应

Sridhar Kotha

Director Engineering | Available Immediately | Open to New Opportunities

3 周

Very helpful ????

查看更多评论

要查看或添加评论，请登录

Laxmikant Agarwal的更多文章

Building and Running a Highly Motivated, Successful Software Engineering Organisation

2025年2月7日

Building and Running a Highly Motivated, Successful Software Engineering Organisation

Three Foundational Pillars of a Successful Software Engineering Team The success of any software engineering team in an…

8 条评论

A crisp guide on "Design for failure" to build highly reliable and available software products

Laxmikant Agarwal

Director Software Development | Cybersecurity | Cloud | SASE Firewall | Sophos | Juniper | RSA | Amazon | McAfee | TAPMI | MANIT

领英推荐

Laxmikant Agarwal的更多文章

社区洞察

其他会员也浏览了

Demystifying Non-Functional System Design

FMECA: Identifying Potential Systemic Failures

Improving Software Reliability with Fault Injection Testing

Design Shutters - Part 1

Point 5. Feature Flags - In the context of reducing bugs in production

Precision Engineering for Software: Embracing Vertical Integration in Mission-Critical Systems

Best Practices for Environment Variable Management in Software Development

Preventing Software Outages and Understanding Different Lines of Defense

The Cost of Cheap

The New Approach to Infrastructure Validation: Modularity, Helm Chart Integration, and Decentralized Validation

领英推荐

Laxmikant Agarwal的更多文章

Building and Running a Highly Motivated, Successful Software Engineering Organisation

社区洞察

其他会员也浏览了

Demystifying Non-Functional System Design

FMECA: Identifying Potential Systemic Failures

Improving Software Reliability with Fault Injection Testing

Design Shutters - Part 1

Point 5. Feature Flags - In the context of reducing bugs in production

Precision Engineering for Software: Embracing Vertical Integration in Mission-Critical Systems

Best Practices for Environment Variable Management in Software Development

Preventing Software Outages and Understanding Different Lines of Defense

The Cost of Cheap

The New Approach to Infrastructure Validation: Modularity, Helm Chart Integration, and Decentralized Validation