Resilience4j - Fault Tolerant Microservices - Part I
Aneshka Goyal
AWS Certified Solutions Architect | Software Development Engineer III at Egencia, An American Express Global Business Travel Company
When we are developing an application using micro service architecture, each service will be responsible for one concern and application will require a collection of such services. In this architecture one service might depend on a number of other services in order to fulfil its job. For example a service that receives order information for a user might need to connect with (another service)user service to get details about the user. But when we have a network of micro services there are chances of failures and faults. There can be a service is down, is responding but taking longer time, is overwhelmed by the number of requests or the REST calls are failing. All these can be considered as deviations from an acceptable ideal scenario. These are the common Faults that can occur.
A service that is able to deal with these faults is said to be fault tolerant. In order to achieve this fault tolerance we can make use of the resilience4j library.
Resilience4j is a lightweight, easy-to-use fault tolerance library inspired by??Netflix Hystrix, but designed for Java 8 and functional programming.?With Hystrix going into maintenance mode, the new and active projects are shifting focus towards leveraging Resilience4j for their fault tolerance needs. One of the major difference between the two is that Hystrix adopted an object oriented style while Resilience4j is based on the functional programming concept, where the functions and Java 8 Lambdas can be decorated by other functions that provide fault tolerant capabilities. When we talk about decorators, the decorator pattern comes to our mind that lets us decorate an object with as many decorators as we want. The same holds true here for Resilience4j as well. We will see this when we explore each of the capabilities in details. Apart from this, Resilience4j decorators are available for sync as well as Async calls, thus support reactive programming paradigm as well.
In this article we will talk about two patterns i.e Circuit breaker and Bulkhead and how we can leverage Resilience4j with a spring boot application to implement these patterns and what fault tolerance capability we achieve. When we implement Resilience4j with spring boot, we can leverage the decorator way, following three steps i.e create a config, register a decorate and use it to decorate the function, or create configurations in yml and annotate the methods. In this series we would be leveraging the annotation way.
Circuit Breaker Pattern
Fault: Consider a scenario where we are trying to connect to a micro service that is taking longer than threshold time to respond or responding with errors. There is no point if we keep on sending requests to such a service or wait for its response.
Tolerance: The above is a scenario where circuit breaker pattern fits best. Circuit breaker as the name suggest will act like an electric circuit and will resolve the circuit to open state in case the above fault occurs. It will try to make some threshold number of calls to the service to check if its still under error condition or has recovered. This will be the half-open state of the circuit. If the service has started responding without any delays or issues (exceptions/errors) the circuit will resolve to closed state from half-open else it will resolve to open state again. Thus the circuit moved form closed to open state on error/delay detections. Open to half-open state in order to check if resolution has been done (allowing only a few set of calls to go to the faulty service), and from half-open to either open or closed state subject to resolution done or not. While the circuit is not allowing some calls to go to concerned service we can specify a fallback to fetch results maybe from a cache (or some other means). One difference to note here is that Hystrix allows only 1 call to check transition from Half open state while this number is configurable with resilience4j.
Code Example: Let's say we have a simple Service called A and we are trying to connect to that service. Let's try to see how we can leverage the circuit breaker pattern here.
We would use spring Initializr to intialize 2 services.
Service A's POM looks like below
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="https://maven.apache.org/POM/4.0.0" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.7.7</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<groupId>com.example</groupId>
<artifactId>service</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>A</name>
<description>Demo project for service A</description>
<properties>
<java.version>11</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>
Service A just has a controller with dummy endpoints
@RestController
@RequestMapping("/dummy/v1")
public class AController {
@GetMapping("/values")
public String getValue(@RequestParam boolean fail) throws Exception {
if (fail){
System.out.println("Throwing exception");
throw new Exception();
}
else
return "Something";
}
@GetMapping("/delayed-values")
public String getValueWithDelay(@RequestParam boolean fail) throws Exception {
Thread.sleep(50000);
if (fail){
System.out.println("Throwing exception");
throw new Exception();
}
else
return "Something";
}
}
Controller Advice looks something like below
@ControllerAdvice
public class ExceptionHandler {
@org.springframework.web.bind.annotation.ExceptionHandler(Exception.class)
@ResponseStatus(value = HttpStatus.INTERNAL_SERVER_ERROR)
public void handleError(){
// just return 500 error code
}
}
The above service when called with a failed true on dummy/v1/values endpoint will return with an exception and 500 error code.
Let's take a look at the caller service. We call it the Resilience service.
In order to add resilience4j and circuit breaker we need to have the following set of dependencies along side spring boot in our POM file
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-circuitbreaker-resilience4j</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
The actuator dependency is needed since we would be accessing the actuator endpoints for health and events check (we will see that in a short while). Then we need the circuit breaker dependency . The AOP dependency is needed for annotation based method decoration. Since Annotation based method decoration come from spring AOP, it also works on proxy concept hence the caller and annotated methods must belong to different classes for annotations/proxies to work.
Let's now take a look at the Resilience service Controller layer.
@RestController
@RequestMapping("/demo/v1")
public class DemoController {
@Autowired
private final DemoService demoService;
public DemoController(DemoService demoService) {
this.demoService = demoService;
}
@GetMapping("/circuit-breaker")
public void circuitBreaker() throws InterruptedException {
for(int i =0;i<5;i++){
demoService.callA(true);
}
//Thread.sleep(60000);
for(int i =0;i<5;i++){
demoService.callA(false);
}
}
}
A simple controller that makes 5 calls to the service layer to call the Service A (we initialized above) and then another set of 5 calls (without any delay first)
Let's now take a look at the Service layer for our Resilience service.
@Service
public class DemoService {
private RestTemplate restTemplate = new RestTemplate();
@CircuitBreaker(name = "backendA", fallbackMethod = "fallback")
public void callA(boolean fail) throws InterruptedException {
restTemplate.getForObject("https://localhost:8081/dummy/v1/values?fail={value}", String.class, fail);
System.out.println("Successfully completed execution");
}
public void fallback(boolean fail, Exception e) {
System.out.println(e);
}
}
The Service method callA is decorated with a Circuit Breaker named backendA and has a fallback method specified (incase of errors from backend A). The fallback method just prints the exception.
Let's now take a look at how we configured the circuit breaker named backendA.
resilience4j:
circuitbreaker:
configs:
default:
registerHealthIndicator: true
slidingWindowSize: 10
permittedNumberOfCallsInHalfOpenState: 3
slidingWindowType: TIME_BASED
minimumNumberOfCalls: 5
waitDurationInOpenState: 50s
failureRateThreshold: 50
writablestacktraceEnabled: true
eventConsumerBufferSize: 100 #default
recordExceptions:
- org.springframework.web.client.HttpServerErrorException
instances:
backendA:
baseConfig: default
This configures the default config for a circuit breaker, here our instance inherits that (it could have overwritten some params as well) .
Here we have configured a circuit breaker that has a time based sliding window i.e. it analyses calls made in last 10 seconds. It has a minimum number of calls as 5 which means it will calculate the failure percentage only if 5 calls are there in a 10 second window. Once the failed calls percentage is 50 or above we switch to open state. The circuit remains in open state for 50 seconds and then automatically moves to half-open state. In half open state we analyse the result of 3 calls in order to move to closed or open state. Record Exceptions is used to configure circuit breaking if some particular exceptions are seen. Apart from these we have registered circuit breaker health indicator with actuator health and also we would be making a call to get the circuit breaker event history for a buffer size of 100 (last 100 events). Let's take a look at all the configurable parameters.
Let's now take a look at the out put when we try to hit the endpoint demo/v1/circuit-breaker of our Resilience service.
As shown after 5 calls the calculated failure rate threshold went above the agreed threshold specified hence the none of the next 5 calls went through and the circuit fell OPEN.
We can even subscribe to the events published by circuit breaker when it gets errors or makes a state transition. Thus these events can be observed and allow us to take specific actions. We can even have actuator endpoint deliver the last x events history. For this case we configured the buffer size as 100. We can even capture the metrics for circuit breaker.
There needs to be some extra configurations in order to be able to get the events history. The additional code in yml looks something like below.
management:
health:
circuitbreakers:
enabled: true
endpoints:
web:
exposure:
include: '*'
endpoint:
circuitbreakerevents:
enabled: true
When we hit the endpoint actuator/circuitbreakerevents we get the following response
{? ?
"circuitBreakerEvents": [
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.697854+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 6670,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.709307+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 8,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.717220+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 7,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.724860+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 7,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.732837+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 7,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "FAILURE_RATE_EXCEEDED",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.733789+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "STATE_TRANSITION",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.741170+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": "CLOSED_TO_OPEN"
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "NOT_PERMITTED",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.742287+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "NOT_PERMITTED",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.742635+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "NOT_PERMITTED",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.742878+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "NOT_PERMITTED",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.743097+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "NOT_PERMITTED",
? ? ? ? ? ? "creationTime": "2023-01-27T19:39:35.743323+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": null
? ? ? ? }
? ? ]
}
These events give us a picture of when the state transition took place and post that the calls were NOT_PERMITTED.
We we make a slight code change and uncomment Thread.sleep in our DemoController. We would see a different response as below.
领英推荐
This response is expected since the State transition to HALF_OPEN would be made post 50 secs and we have waited for 60 sec and in HALF_OPEN 3 calls succeeded and thus the circuit was again CLOSED. Let's also take a look at the events published by our circuit breaker in this case.
{? ?
"circuitBreakerEvents": [
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:54:51.854469+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 66,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:54:51.862414+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 5,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:54:51.868655+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 5,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:54:51.873058+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 4,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "ERROR",
? ? ? ? ? ? "creationTime": "2023-01-27T19:54:51.878572+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": "org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 : [no body]",
? ? ? ? ? ? "durationInMs": 5,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "FAILURE_RATE_EXCEEDED",
? ? ? ? ? ? "creationTime": "2023-01-27T19:54:51.879084+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "STATE_TRANSITION",
? ? ? ? ? ? "creationTime": "2023-01-27T19:54:51.883987+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": "CLOSED_TO_OPEN"
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "STATE_TRANSITION",
? ? ? ? ? ? "creationTime": "2023-01-27T19:55:51.889326+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": "OPEN_TO_HALF_OPEN"
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "SUCCESS",
? ? ? ? ? ? "creationTime": "2023-01-27T19:55:51.949792+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": 60,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "SUCCESS",
? ? ? ? ? ? "creationTime": "2023-01-27T19:55:51.953897+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": 3,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "SUCCESS",
? ? ? ? ? ? "creationTime": "2023-01-27T19:55:51.956261+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": 2,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "STATE_TRANSITION",
? ? ? ? ? ? "creationTime": "2023-01-27T19:55:51.956592+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": null,
? ? ? ? ? ? "stateTransition": "HALF_OPEN_TO_CLOSED"
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "SUCCESS",
? ? ? ? ? ? "creationTime": "2023-01-27T19:55:51.959390+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": 2,
? ? ? ? ? ? "stateTransition": null
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "circuitBreakerName": "backendA",
? ? ? ? ? ? "type": "SUCCESS",
? ? ? ? ? ? "creationTime": "2023-01-27T19:55:51.962432+05:30[Asia/Kolkata]",
? ? ? ? ? ? "errorMessage": null,
? ? ? ? ? ? "durationInMs": 2,
? ? ? ? ? ? "stateTransition": null
? ? ? ? }
? ? ]
}
When we talk about circuit breaker one thing that comes to mind is about concurrency and how it handles the state transitions in case there are number of parallel calls.
The CircuitBreaker is thread-safe as follows :
That means atomicity should be guaranteed and only one thread is able to update the state or the Sliding Window at a point in time.
But the CircuitBreaker does not synchronize the function call. That means the function call itself is not part of the critical section. Otherwise a CircuitBreaker would introduce a huge performance penalty and bottleneck. A slow function call would have a huge negative impact to the overall performance/throughput.
If 20 concurrent threads ask for the permission to execute a function and the state of the CircuitBreaker is closed, all threads are allowed to invoke the function. Even if the sliding window size is 15. The sliding window does not mean that only 15 calls are allowed to run concurrently. If we want to restrict the number of concurrent threads, we can use a Bulkhead pattern (something we will talk about next).
Thus here we learnt that the CircuitBreaker is implemented via a finite state machine with three normal states: CLOSED, OPEN and HALF_OPEN and two special states DISABLED and FORCED_OPEN. It can be implemented using a sliding window that can be count based or time based. The count-based sliding window aggregates the outcome of the last N calls. The time-based sliding window aggregates the outcome of the calls of the last N seconds. Here we also saw how we can leverage circuit breaker in a spring boot application.
Bulkhead Pattern
Fault: Consider a scenario where we have a cluster running replicas of a service and all instances have started failing health checks. This happened because all threads (200) for each service instance was consumed in connecting to another service that was experiencing some issue and held the connection indefinitely.
Tolerance: This fault can be very easily handled and tolerated using the Bulkhead pattern. The bulkhead pattern is used to restrict resources and hence reduce resource exhaustion and the associated failures. Consider in the above scenario had we limited the threads that can be used to connect to faulty service to just 10 then other threads could be used to serve requests that don't need to go through the connection to faulty service, and the service would still be up and serving a set of requests without any issues. This would have helped us prevent cascading of failures. This is what bulkhead pattern does. Like Circuit breaker this pattern also comes form real life situation i.e a ship is split into small multiple compartments using Bulkheads. Bulkheads are used to seal parts of the ship to prevent entire ship from sinking in case of flood. A similar thing that we try to achieve in software design.
Resilience4j provides two ways to implement the bulkhead pattern and restrict concurrent executions.
Here in this particular example we would see how we implement bulkhead using Semaphore followed by configurations for thread-pool-bulkhead. Again we will be connecting to service A, we implemented above in the circuit breaker section. Our caller service i.e Resilience service would have some code and configuration additions in order to implement Bulkhead logic.
Our POM will have a dependency addition for bulkhead. This dependency version suits best with my spring boot and cloud version.
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-bulkhead</artifactId>
<version>1.3.0</version>
</dependency>
Next we enhance the controller to add another Get endpoint for bulkhead
@GetMapping("/bulkhead")
public void bulkhead() {
for (int i=0; i<4; i++) {
CompletableFuture.runAsync(()->demoService.callAWithBulkhead(false));
}
}
Here we are trying to make 4 parallel (async) calls to the demo service method callAWithBulkhead and sending failure as false.
Let's take a look at the annotated method that actually attempts to connect to service A.
@CircuitBreaker(name = "backendA", fallbackMethod = "fallback")
@Bulkhead(name = "backendABulk", fallbackMethod = "fallback")
public void callAWithBulkhead(boolean fail) {
restTemplate.getForObject("https://localhost:8081/dummy/v1/values?fail={value}", String.class, fail);
System.out.println("Successfully completed execution in bulk head " + Thread.currentThread().getName());
}
Here, the circuit breaker config is the same that we used in the circuit breaker section. We have decorated the method with two annotations, circuit breaker and bulkhead. What's new is the bulkhead with name backendABulk and the same fallback method that just prints the exception message. One point to note about fallback methods in general is that their signature should be exactly same as the method they act fallback to.
Let's now take a look at the configuration for our bulkhead.
resilience4j:
bulkhead:
instances:
backendABulk:
maxConcurrentCalls: 1
maxWaitDuration: 1ms
Here, we have configured a semaphore bulkhead with max concurrent calls as 1 and the max duration that a thread will wait as 1ms.
Let's see the results when we hit the bulkhead endpoint.
As shown the bulkhead rejected all calls except for one because we restricted max con current calls to 1. Just like circuit breaker, bulkhead also publishes events and these can be observed or subscribed. Let's also take a look at the events published by hitting actuator/bulkheadevents
? ?{
"bulkheadEvents": [
? ? ? ? {
? ? ? ? ? ? "bulkheadName": "backendABulk",
? ? ? ? ? ? "type": "CALL_PERMITTED",
? ? ? ? ? ? "creationTime": "2023-01-28T19:46:43.435146+05:30[Asia/Kolkata]"
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "bulkheadName": "backendABulk",
? ? ? ? ? ? "type": "CALL_REJECTED",
? ? ? ? ? ? "creationTime": "2023-01-28T19:46:43.435146+05:30[Asia/Kolkata]"
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "bulkheadName": "backendABulk",
? ? ? ? ? ? "type": "CALL_REJECTED",
? ? ? ? ? ? "creationTime": "2023-01-28T19:46:43.435146+05:30[Asia/Kolkata]"
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "bulkheadName": "backendABulk",
? ? ? ? ? ? "type": "CALL_REJECTED",
? ? ? ? ? ? "creationTime": "2023-01-28T19:46:43.435146+05:30[Asia/Kolkata]"
? ? ? ? },
? ? ? ? {
? ? ? ? ? ? "bulkheadName": "backendABulk",
? ? ? ? ? ? "type": "CALL_FINISHED",
? ? ? ? ? ? "creationTime": "2023-01-28T19:46:43.710162+05:30[Asia/Kolkata]"
? ? ? ? }
? ? ]
}
These also tell us which call was permitted and when it got finished and when calls were rejected. Point to note: By default, the bulkhead is respected before the circuit breaker during execution. But this order can be configured and changed.
If we want to configure a thread pool bulkhead we can specify the configuration like below
thread-pool-bulkhead:
instances:
backendDBulk:
coreThreadPoolSize: 1
queueCapacity: 1
maxThreadPoolSize: 2
writablestacktraceEnabled: true
Here we have restricted core pool size to 1 and max size to 2 while our queue capacity is 1. WritablestacktraceEnabled writes the stacktrace for bulkhead exceptions (we could have specified that above as well). If we want to understand thread pool params we can follow a basic example.Starting thread pool size is 1, core pool size is 5, max pool size is 10 and the queue is 100.
As requests come in, threads will be created up to 5 and then tasks will be added to the queue until it reaches 100. When the queue is full new threads will be created up to?maxPoolSize. Once all the threads are in use and the queue is full tasks will be rejected. As the queue reduces, so does the number of active threads.
Point to note: when using thread-pool-bulkhead with annotation, we need to specify the type param as well as by default it is semaphore bulkhead.
@Bulkhead(name = "backendDBulk", type = Bulkhead.Type.THREADPOOL, fallbackMethod = "fallbackT")
Thus here we learnt about how restricting resources can be achieved using bulkhead pattern and thus we can prevent cascading failures and restrict impacts.
We will talk about the other three fault tolerant techniques in the next part.
Sources of knowledge: