Can't cache this!

Can't cache this!

Introduction

We often see companies publishing a short incident recap after they had a major outage. A couple of recent ones were?Facebook ?and?Cloudflare .

Although knowing the actual root cause is interesting, I think there is a lot to learn from how the engineers managed to troubleshoot the issue.

So I decided to take a stab at it and walk you through the troubleshooting process that took place in one of our recent incidents.

The beginning of it all

A few weeks ago, my team was preparing for the launch of an alpha version of a new product. Engineers were heads down wrapping up the last features and smashing bugs when our QA raised a bug in production.

The users should have been redirected to the payment page. However, quite often, the underlying request would timeout when anyone was attempting to click on the Buy button.

This was a regression and a major one, hence we decided to jump on it.

Investigation

The first step was to dive into the?backend app logs. Here is what we saw:

No alt text provided for this image

Judging by the logs, it looked like the request?froze??? at some point during the execution path. This was hard to believe since the app is simple enough that shouldn't lead to any deadlocks. However, we should have seen an?Exception?somewhere and that wasn't the case here.

Also, what made it even more interesting was the fact that we also saw some requests that executed normally.

No alt text provided for this image

Nevertheless, since the bug was fairly easy to reproduce, we needed to keep digging.

Diving into the code

The next step in the investigation was to analyze the log lines in the code and pin point ?? the section that was causing the request thread to freeze. This can simply be done by?looking at the last log line reported and the next log line that should have been reported.

This exercise brought to light the first suspect. As part of the request, we?fetched a large JSON file from an external service. Since this file didn't change very often, we added?a caching layer?to save some network requests and reduce the latency of the request.

To implement this caching layer, we decided to use a?Guava ?cache. The cache was configured so that it would return old values while the new values were being fetched.

Here is a fragment of code showing how the cache was configured:

var cache = CacheBuilder.newBuilder()
        .refreshAfterWrite(Constants.CACHE_REFRESH_TIME_MS, TimeUnit.MILLISECONDS)
        .build(buildCacheLoader(baseUrl));

private static CacheLoader<String, String> buildCacheLoader(
      String baseUrl) {
    return new CacheLoader<>() {
      @Override
      public String load(final String environment) throws Exception {
        log.info("Fetching data");
        var builder =
        HttpClient.newBuilder().connectTimeout(Duration.ofSeconds(Constants.TIMEOUT_IN_SECONDS));

        var client = builder.build();
        final HttpRequest request =
            HttpRequest.newBuilder()
            .uri(URI.create(baseUrl))
            .timeout(Duration.ofSeconds(Constants.TIMEOUT_IN_SECONDS))
            .build();

        final HttpResponse<String> response =
            client.send(request, HttpResponse.BodyHandlers.ofString());

        return mapper.readValue(response.body(), Data.class);
      }
  };        

At this point, it was clear that the cache wasn't behaving as we expected since:

  • It wasn't logging the potential?Exception?that would explain what was causing the request thread to hang.
  • It was causing the request thread to hang.

Thus, I decided to spend some time trying to understand in depth how Guava caches work.

Setting up the playground

In situations like this one, I usually create a playground project where I can remove all the noise (business logic) and focus on what matters ??.

I had to come up with a setup as close to production to ensure I was able to reproduce the exact same issue.

To achieve that, I decided to:

  • Interact with the cache in?separate threads. Since the web server framework we rely on executes each request on a different thread (within a thread pool).
  • Use?WireMock?to simulate the calls to the external service. That way we can make sure the network stack is exercised and that it is easy to set up edge case conditions (delays, error responses, etc).
  • Use JUnit to setup the different scenarios and run them.

To ensure we have visibility in everything that happens during the different test cases, the threads will run the following Runnable which includes several log lines:

@Slf4j
public class CacheReader implements Runnable {
  private LoadingCache<String, Data> cache;

  public CacheReader(LoadingCache<String, Data> cache) {
    this.cache = cache;
  }

  @Override
  public void run() {
    try {
      log.info("Reading value at {}", System.currentTimeMillis());
      log.info("Value {} at {}", this.cache.get("something"), System.currentTimeMillis());
    } catch (ExecutionException e) {
      log.error("Exception while loading value at {}", System.currentTimeMillis(), e);
    }
  }
}        

? Mystery 1 - Exception MIA

The first mystery to solve was the missing?Exception. For that, there are two potential failures we could have experienced in production that should have reported an?Exception:

  1. The cache failed to be populated on the first attempt.
  2. The cache was populated but an attempt to refresh the cache later on failed (either due to an error on the external service or a network timeout).

?? Test 1 - Given an empty cache and a slow network connection, when fetching the cache, it should throw an Exception

@Test
@SneakyThrows
void givenEmptyCache_whenRemoteServiceTimeout_shouldThrowException() {
    var cache = CacheFactory.buildCacheWithoutExceptionHandling(wireMockServer.baseUrl());

    setupRemoteResponse("value", 2 * Constants.TIMEOUT_IN_SECONDS);

    var thread = new Thread(new CacheReader(cache));
    thread.start();
    thread.join();
}        

Which reports the following logs:

1861 [Thread-8] INFO  org.example.CacheReader [] - Reading value at 1637941200937
1872 [Thread-8] INFO  org.example.CacheFactory [] - Fetching data
2925 [Thread-8] ERROR org.example.CacheReader [] - Exception while loading value at 1637941202006
java.util.concurrent.ExecutionException: java.net.http.HttpTimeoutException: request timed out 
... <omitted stacktrace> ...        

When the cache is empty, the thread loading the cache seems to be the one that receives the?Exception.

Since we didn't see any?Exception?in our logs, this means this wasn't the failure we observed in production. Let's keep going.

?? Test 2 - Given an empty cache and a external service outage, when fetching the cache, it should throw an?Exception

@Test
void givenEmptyCache_whenRemoteServiceReturns500_shouldThrowException() throws InterruptedException {
    LoadingCache<String, String> cache =
            CacheFactory.buildCacheWithoutExceptionHandling(wireMockServer.baseUrl());

    setupFailedResponse();

    var thread = new Thread(new CacheReader(cache));
    thread.start();
    thread.join();
}        

Which reports the following logs:

1793 [Thread-8] INFO  org.example.CacheReader [] - Reading value at 1637944243058
1807 [Thread-8] INFO  org.example.CacheFactory [] - Fetching data
2070 [Thread-8] ERROR org.example.CacheReader [] - Exception while loading value at 1637944243341
java.util.concurrent.ExecutionException: com.fasterxml.jackson.databind.exc.MismatchedInputException: No content to map due to end-of-input
 at [Source: (String)""; line: 1, column: 0]
... <omitted stacktrace> ...        

In this case, the calling thread is getting a serialization?Exception?since the response body was not the expected one.

Again, since we didn't see any?Exception?in our logs, we need to continue the investigation.

?? Test 3 - Given a populated cache due for refresh and a slow network connection, when fetching the cache, it should get the old value

@Test
@SneakyThrows
  void givenCacheDueForRefresh_whenRemoteServiceTimeout_shouldGetOldValue() {
    var cache = CacheFactory.buildCacheWithoutExceptionHandling(wireMockServer.baseUrl());

    setupRemoteResponse(new Data("old_value"));

    var thread = new Thread(new CacheReader(cache));
    thread.start();
    thread.join();

    waitForCacheToExpire();

    setupRemoteResponse(new Data("new_value"), 2 * Constants.TIMEOUT_IN_SECONDS);

    thread = new Thread(new CacheReader(cache));
    thread.start();
    thread.join();
}        

Which reports the following logs:

1834 [Thread-8] INFO  org.example.CacheReader [] - Reading value at 1637945326283
1846 [Thread-8] INFO  org.example.CacheFactory [] - Fetching data
2099 [Thread-8] INFO  org.example.CacheReader [] - Value Data(value=old_value) at 1637945326554
3114 [Thread-9] INFO  org.example.CacheReader [] - Reading value at 1637945327569
3115 [Thread-9] INFO  org.example.CacheFactory [] - Reloading
3115 [Thread-9] INFO  org.example.CacheFactory [] - Fetching datacom.google.common.cache.LocalCache$Segment$1 run
WARNING: Exception thrown during refresh
java.util.concurrent.ExecutionException: java.net.http.HttpTimeoutException: request timed out
    at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:566)
4296 [Thread-9] INFO  org.example.CacheReader [] - Value old_value at 1637941240139
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:527)
... <omitted stacktrace> ...        

In these logs, we see that the thread that attempts to load the value after the cache is due for refresh?will block?(we will follow up on this later) in attempt to reload the new value. However, if that fails, it will get the old value and an Exception will be logged in the background.

When we look into the?Guava documentation ?for the type of cache we are using (refreshAfterWrite), we find the following:

If an exception is thrown while refreshing, the old value is kept, and the exception is logged and swallowed.

This explains why we never saw the Exception in the logs. It was logged by Guava and shallowed which was being filtered out when our app sends those logs to NewRelic. This was a decision we made to avoid polluting NewRelic with logs that we don't care about but it ended up biting us in the arse ??

Mystery 1 solved.

? Mystery 2 - Request thread hanging

While troubleshooting the previous issue, we found that, when the cache is due for refresh, the thread that attempts to fetch the new value will block and, if it fails, it will eventually get the old value.

The next thing to figure out is what happens if many theads attempt to get the value from the cache at the same time when it is due for refresh

?? Test 4 - Given a populated cache due for refresh and a slow network connection, when many threads attempt to fetch the cache, they should all get the old value.

@Test
@SneakyThrows
  void givenCacheDueForRefresh_whenRemoteServiceTimeout_allThreadsShouldGetOldValue() {
    var cache = CacheFactory.buildCacheWithoutExceptionHandling(wireMockServer.baseUrl());

    setupRemoteResponse(new Data("old_value"));

    var thread = new Thread(new CacheReader(cache));
    thread.start();
    thread.join();

    waitForCacheToExpire();

    setupRemoteResponse(new Data("new_value"), 2 * Constants.TIMEOUT_IN_SECONDS);

    runManyAndWait(cache, 3);
  }        

Which reports the following logs:

1861 [Thread-8] INFO  org.example.CacheReader [] - Reading value at 1637945351638
1871 [Thread-8] INFO  org.example.CacheFactory [] - Fetching data
2118 [Thread-8] INFO  org.example.CacheReader [] - Value Data(value=old_value) at 1637945351899
3132 [Thread-11] INFO  org.example.CacheReader [] - Reading value at 1637945352913
3132 [Thread-10] INFO  org.example.CacheReader [] - Reading value at 1637945352913
3132 [Thread-9] INFO  org.example.CacheReader [] - Reading value at 1637945352913
3132 [Thread-9] INFO  org.example.CacheFactory [] - Reloading
3132 [Thread-10] INFO  org.example.CacheReader [] - Value Data(value=old_value) at 1637945352913
3132 [Thread-9] INFO  org.example.CacheFactory [] - Fetching data
3132 [Thread-11] INFO  org.example.CacheReader [] - Value Data(value=old_value) at 1637945352913com.google.common.cache.LocalCache$Segment$1 run
4185 [Thread-9] INFO  org.example.CacheReader [] - Value old_value at 1637942442820
WARNING: Exception thrown during refresh
java.util.concurrent.ExecutionException: java.net.http.HttpTimeoutException: request timed out
... <omitted stacktrace> ...        

Here, we observe that the unlucky?Thread-9?blocks to fetch the new value while all the other threads get the old value and move on with their lives.

This gives us enough information to conclude what happened in production!

When our QA was running their regression test, their requests would often be the ran by an unlucky thread that gets blocked while fetching the new value. Since we have multiple instances of this application, this could be reproduced easily.

Funny enough, when multiple people were trying at the same time, if they were lucky enough to hit the same instance, everything would work just fine, which is what we saw in the logs.

Mystery 2 solved!

Are we done?

Not quite!

We have learnt that, when the cache is due for refresh, there is always an unlucky thread that has to pay the latency cost of refreshing the value.

To provide the best experience for our users, we should make sure the cache refresh takes place asynchronously and that all requests get the old value while the new one is being fetched.

So ... how do we do that? ??

Actually fixing the problem

If we look again into?Guava's documentation ?we can see a couple of comments about asynchronous reload:

[...] As specified in LoadingCache.refresh(K), refreshing a key loads a new value for the key,?possibly asynchronously?[...]
[...] If CacheLoader.reload is implemented to be asynchronous, then the query will not be slowed down by the refresh [...]

Unfortunately, they forgot to add an example on how to do it! ??

It's alright, ?? happens. A couple of DuckDuckGo searches later, we found that we could achieve this by configuring the cache to use?asyncReloading:

public static LoadingCache<String, Data> buildCacheWithAsyncLoading(
      String baseUrl, ExecutorService pool) {
    return CacheBuilder.newBuilder()
        .refreshAfterWrite(Constants.CACHE_REFRESH_TIME_MS, TimeUnit.MILLISECONDS)
        .build(CacheLoader.asyncReloading(CacheFactory.buildCacheLoader(baseUrl), pool));
}        

In this case, the refresh will take place on a thread from a separate thread pool we have to pass on during instantiation.

To validate this works as expected, we write another test ??:

@Test
@SneakyThrows
void solution_whenFutureResponsesTimeoutAndAsyncLoading_returnOldValue() {
    ExecutorService pool = Executors.newFixedThreadPool(10);
    try {
      var cache = CacheFactory.buildCacheWithAsyncLoading(wireMockServer.baseUrl(), pool);

      setupRemoteResponse(new Data("old_value"));

      runAndWait(cache);

      waitForCacheToExpire();

      setupRemoteResponse(new Data("new_value"), 2 * Constants.TIMEOUT_IN_SECONDS);

      runManyAndWait(cache, THREAD_COUNT);
    } finally {
      Thread.sleep(10000);
      pool.shutdown();
    }
}        

Which reports the following logs:

1772 [Thread-8] INFO  org.example.CacheReader [] - Reading value at 1637945095606
1783 [Thread-8] INFO  org.example.CacheFactory [] - Fetching data
2032 [Thread-8] INFO  org.example.CacheReader [] - Value Data(value=old_value) at 1637945095871
3049 [Thread-9] INFO  org.example.CacheReader [] - Reading value at 1637945096888
3049 [Thread-10] INFO  org.example.CacheReader [] - Reading value at 1637945096888
3050 [Thread-11] INFO  org.example.CacheReader [] - Reading value at 1637945096889
3050 [Thread-11] INFO  org.example.CacheReader [] - Value Data(value=old_value) at 1637945096889
3050 [Thread-10] INFO  org.example.CacheReader [] - Value Data(value=old_value) at 1637945096889
3053 [pool-2-thread-1] INFO  org.example.CacheFactory [] - Reloading
3053 [pool-2-thread-1] INFO  org.example.CacheFactory [] - Fetching data
3065 [Thread-9] INFO  org.example.CacheReader [] - Value Data(value=old_value) at 1637945096904
4063 [pool-2-thread-1] ERROR org.example.CacheFactory [] - Loading Exception at 1637945097902
... <omitted stacktrace> ...        

We can see that there isn't an unlucky thread anymore and that the?Exception?is reported by the thread in the pool.

Conclusion

When we encounter an issue in production, depending on its severity and impact, we often perform an RCA (root cause analysis) investigation.

The goal of this process is to find the root cause of the problem, fix it and also to come up with ways we can prevent issues like this happening in the future or, at least, improve our automated monitoring to ensure we catch these issues as soon as they happen.

For this particular problem, apart from fixing the bug, we decided to:

  • Use?NewRelic scripted browser synthetics?to periodically run a smoke test similar to what QA was doing as part of the regression test. This will allow us to catch user-impacting issues early.
  • Tweak our request?latency alerts. With a similar goal as the one above.
  • Come up with a plan to?implement browser-based E2E automation. This will allow us to catch regressions in internal environments before they reach production.

And that's the end of it all!

If you are curious to see the playground code, you can find it in:


要查看或添加评论,请登录

社区洞察

其他会员也浏览了