Key patterns for resiliency in Microservices Architecture

I wrote application architecture principles a few years ago when I headed the application architecture group. I found it on my old Miro board.


Most of the above principles are still valid and effective. Most organizations implement microservice architecture in some of their projects, primarily when they focus on modernizing their landscape from legacy systems.

Similarly, a few years back, I wrote patterns for microservices architecture and categorized them into different categories. Look at some of the significant microservices patterns below.



This needs to be refreshed now. I want to detail resiliency patterns more with my learning by implementing a microservices architecture.

Resiliency in microservices architecture refers to the system’s ability to handle failures and disruptions gracefully without affecting overall performance or availability. Since microservices are distributed, independent services communicating over a network are more susceptible to failures such as network outages, service crashes, or performance bottlenecks. A resilient microservices system can continue functioning or degrade gracefully when some components fail.

Critical Aspects of Resiliency in Microservices:

  • Fault Tolerance: The ability of a system to continue operating correctly in the event of failure of some of its components. In a resilient microservices architecture, the failure of one microservice should not bring down the entire system.
  • Graceful Degradation: When specific components fail, the system does not crash entirely but continues to provide limited functionality or simplified services.
  • Self-Healing: Resilient systems can recover from failures without human intervention. This might involve restarting failed services, rerouting requests, or retrying operations after a brief delay.
  • Isolation and Containment: Resilient architectures use patterns like bulkheads to isolate services so that failures in one service don’t propagate to others. This prevents cascading failures and allows unaffected parts of the system to continue operating.
  • Redundancy and Failover: Microservices systems often use redundancy (duplicate services or resources) to ensure that another can take over if one service instance fails. This provides failover support, ensuring continuous service.
  • Monitoring and Observability: A resilient system requires real-time tools to monitor health, performance, and failures. Distributed tracing, logging, and health checks help detect problems early and facilitate rapid recovery.

Importance of Resiliency in Microservices:

  • Minimizes Downtime: Resilient systems help avoid downtime, ensuring high availability and better user experience, even when parts of the system fail.
  • Handles Dynamic Traffic: Microservices need to handle traffic spikes and fluctuations. Resiliency patterns ensure that the system doesn’t crash even during heavy loads.
  • Mitigates Complex Failures: Distributed systems introduce complexity, and failures in one part of the system can have unforeseen impacts. Resiliency ensures that failures don’t cascade through the system.
  • Improves Scalability: A resilient architecture can more easily handle the scaling of individual microservices without compromising the entire system’s stability.

How to Achieve Resiliency:

Resiliency in microservices architecture is achieved through various patterns and practices, such as:

  • Circuit Breakers: Stop failing services from being overwhelmed by blocking further requests.
  • Retries: Automatically retry failed requests due to transient errors.
  • Timeouts: Set limits on how long to wait for a service response to avoid excessive delays.
  • Bulkheads: Isolate services to prevent one service’s failure from impacting the rest of the system.
  • Fallback Mechanisms: Provide alternative responses when a service fails.

1. Circuit Breaker Pattern

  • Detailed Purpose: In a distributed system, if one service is failing or under heavy load, continually sending requests can worsen the problem. The Circuit Breaker pattern helps avoid this by stopping unnecessary requests to a failing service and giving it time to recover.

How it works: The pattern has three states:

Closed: Requests usually flow, assuming the service is healthy.

Real-life Example: Netflix’s Hystrix library (now retired but still widely discussed) used the Circuit Breaker pattern extensively. For instance, if a video recommendation service was slow or down, Hystrix would stop routing requests, and users would be presented with a default list of trending content.

2. Retry Pattern

  • Detailed Purpose: Many failures in distributed systems are transient (e.g., network glitches, temporary unavailability). The Retry pattern mitigates these temporary issues by automatically reattempting failed requests after short delays.

How it works:

A configurable delay (or backoff strategy) is used between retries to avoid flooding the service with immediate retries.

Real-life Example: In payment systems like Stripe, if a transaction processing service times out due to network instability, the system retries the request a few times before giving up. This helps in cases where temporary outages are expected but should not cause a permanent failure.

3. Timeout Pattern


  • Detailed Purpose: Long-running requests in microservices can lead to resource exhaustion (e.g., holding up threads or connections). The Timeout pattern ensures that requests that take too long are aborted.

How it works:

Each service call is assigned a maximum duration for how long it can take. If the service doesn’t respond within that window, the request is canceled, and the system can either retry or use a fallback.

Real-life Example: In an online ordering system like Grubhub, if a request to the payment service takes too long, the system times out the request and tries an alternative payment method or notifies the user about the delay. This prevents users from waiting endlessly.

4. Bulkhead Pattern

  • Detailed Purpose: In microservices, failures in one part of the system can lead to cascading failures. The Bulkhead pattern prevents this by isolating resources (like threads, memory, or database connections) between different services, ensuring that failure in one service doesn’t overload others.

How it works:

Resources are partitioned into separate “bulkheads.” For example, each service might be assigned a separate pool of threads or connections.

Real-life Example: In a hotel booking system, a surge in requests to the room availability service (e.g., during holiday seasons) might overwhelm its resources. By applying the Bulkhead pattern, the hotel search and payment services won’t be affected, allowing users to continue searching or making payments even if availability checks are delayed.

5. Fallback Pattern

  • Detailed Purpose: In cases where a service fails, it’s often better to return a degraded response (like cached data or a default value) than to throw an error that impacts the user experience.

How it works:

When a service call fails, a pre-defined fallback response is returned.

Real-life Example: In a ride-hailing app like Uber, if the fare estimation service fails, the system might show an average fare for similar routes instead of leaving the user without any information.


Full article on https://medium.com/techartifact-technology-learning/key-patterns-for-resiliency-in-microservices-architecture-992966edbd67





Ankit Tyagi

Principal Architect, Clinical at Boston Scientific

5 个月

Great work Vinay, pretty informative!

Jatinder Arora

Developing AI-Powered Solutions | LLMs & Cloud Enthusiast, Senior Solution Architect, Emerging Technologies

6 个月

Very informative, Vinay

Jignesh Karnik

VP and Head of Strategy Europe @ Novigo | Chief Architect

6 个月

Great details about the patterns Vinay Kumar

Mauro Morelli

Enterprise Architect, author, founder of alaraph.com

6 个月

well done, Vinay!

要查看或添加评论,请登录

Vinay Kumar的更多文章

社区洞察

其他会员也浏览了