A crisp guide on "Design for failure" to build highly reliable and available software products

For a software product to be built for high reliability and availability, it has to be highly resilient to failures, recovers on its own from failures and has redundancy built-in. It also necessitates fault tolerance at various layers. The higher the business criticality level of the product/system, higher the need to design for failure.

For cloud based applications, cloud platforms provides high availability through its multiple availability zones, load balancers and seamless auto-scaling at infrastructure level.

?

Three factors from the twelve-factor app principles methodology are key to designing for failure

  1. Disposability: Maximize robustness with fast start-up and graceful shutdown. Produce lean container images and strive for processes that can start and stop in a matter of seconds.
  2. Logs: If part of a system fails, troubleshooting is necessary. So ensure material for forensics exists in the form of logs. Standardize logging for the whole system to enable easy correlation of log data from various services.
  3. Dev/prod parity:?Keep development, staging, and production as similar as possible.

?

Design techniques to build self-healing, highly resilient systems

  • Circuit breaker: This detects failures and prevents the application to perform the action that is doomed to fail?(until it's safe to retry),
  • Retry: When failure happens, learn about cause of failure, go back to previous step, retry either in the same way for finite times or in a different way.
  • Timeout: Instead of waiting indefinitely or for too long,? throw an exception and come out freeing up the resources/threads.
  • Rollback: In a failure situation un-do the action that lead to the failure so to go back to a known-good state.
  • Partitioning/decoupling: To ensure failure in one service does not affect another service
  • Degrade/derate: The whole system continue to run, though at a degrade level not impacting majority of customers, when there is a failure in one part of the system.
  • Reconciliation/offline service: To protect systems against failures is to have a reconciliation service for handling failures or data inconsistencies, that also analyses & monitors failures, can enable primary services to remain focussed on time sensitive critical job.

?

And follow these up with a robust fault injection techniques that involves introducing faults/errors into the software to evaluate resilience, fault tolerance, and self recovery. Run fault injection experiments in isolated sandboxed environments preferably using injection tools such as chaos monkey or Gremlin etc.

?

Use these fault injection techniques and measure using metrics such as fault detection rate, false positives & false negatives

  • Invalid Inputs: Feed the software with invalid or malformed input data to assess its input validation and error-handling capabilities.
  • Bit Flipping: Modify specific bits in data, instructions, or configuration files to simulate data corruption or code manipulation.
  • Random Code Injection: Inject random code snippets or instructions into the application to simulate unpredictable errors.
  • Memory Exhaustion: Allocate excessive memory or induce memory leaks to observe how the software handles out-of-memory situations.
  • CPU Overload: Overload the CPU by consuming excessive CPU resources to assess the system's responsiveness and recovery.
  • Boundary Testing: Test the software with inputs at or beyond the boundary of valid ranges to evaluate boundary condition handling.
  • Packet Dropping: Simulate network packet loss or network partitioning to assess how the software handles network failures.
  • Latency Injection: Introduce artificial network latency to evaluate the system's responsiveness and timeout settings.
  • Clock Manipulation: Alter the system clock to simulate time-related issues, such as expired certificates or time-sensitive operations.
  • Timeout Testing: Adjust timeouts for network requests or critical operations to test how the software reacts to slow or unresponsive services.
  • Service Interruption: Temporarily stop or disrupt external services, databases, or dependencies that the software relies on to assess its resilience and error recovery.

Sridhar Kotha

Director Engineering | Available Immediately | Open to New Opportunities

3 周

Very helpful????

Sridhar Kotha

Director Engineering | Available Immediately | Open to New Opportunities

3 周

Very helpful ????

回复

要查看或添加评论,请登录

Laxmikant Agarwal的更多文章

社区洞察

其他会员也浏览了