Race condition inside .NET Core MemoryCache or Story
how we lost 1GB RAM within a half of a day with a happy end.

Race condition inside .NET Core MemoryCache or Story how we lost 1GB RAM within a half of a day with a happy end.

Hi there,

We would like to share with you an interesting technical case we have resolved at one of our projects, it can be useful for a lot of solutions using MemoryCache in .Net Core.

The solution we work on is a Windows endpoint security service (filters browsers traffic, with a great number of endpoints) which among others uses a well-known .NET Core Extensions MemoryCache class.

We got a report from one of our end-customers that he got memory leaks at one of his machines: the amount of consumed memory by the application was significantly increasing during its work for no apparent reason. It was pretty interesting because a .NET's garbage collector is used and it manages the allocation and release of memory for application.

OK, let’s start debugging.

The process of investigation

We analised the memory dumps from the machine and found a few hints that helped us to find the problem. We couldn’t be sure about the exact place since the solution uses the obfuscation mechanism. We started to reproduce it on our side and managed to find the scenario which led to significant memory consumption. To achieve that, we used a Crawler which generated thousands of redirects between URLs within an hour. After leaving it for one night, we finally got to a state when memory consumption increased up to more than 1 GB and nothing was released even if all the browsers were closed and the service ‘cooled down’.

We collected a number of locally-generated dumps from different stages, started analysing them and found a problem in the cache that we have mentioned before, which we use internally: it unsuccessfully tried to find the expired records, created massive arrays for this and stored all this ‘mass’ in memory in dozens of threads.

It is always hard to blame Microsoft's code, so we double-checked it. Our findings were confirmed - tests with extensive usage of the cache even without any extra logic were behaving in the same way.

We found the source code of this caching extension - it is a part of .NET Core runtime library which is available on GitHub and as a public NuGet package

We started an investigation of how it searches for the expired values and found that it works as follows. The launch of a thread (StartScanForExpiredItems function) for going through the values array is performed on each call to cache: on SetEntry, TryGetValue, Remove and on EntryExpired. It is performed without any locks, allowing to run multiple tasks when a single one is really needed. This happens because of the race condition inside StartScanForExpiredItems routine - it allows many threads to pass the “if check” since it doesn’t use any locking mechanisms.

On the one hand it is a good thing: if we do not call cache, it won’t be scanned, because nobody will call StartScanForExpiredItems - this might save us some execution time.

The key problem is that, if we load our processor and these hundreds of threads get under, let’s say, SetEntry, we’ll be able to launch hundreds of tasks for checking the expired items and each one will put the pressure on memory while also holding the locks. It works for small caches, but ours is big, thousands of records which are stored inside ConcurrentDictionary, which is not the best idea from the locks point of view.

As a result, we got a ‘critical mass’ of threads, which were concurrently checking the expired items. Because of the locks they did not have enough time to check everything before the new cycle was started, so their total number became larger and larger and we finally got ~1GB RAM consumed RAM within a half of a day. 

The solution we found

First of all, we got caching part of the runtime library in a separate project. We tried two potential solutions:

  1. Added locks to cache, slowing it down and adding a proper synchronisation to its functions. It became too slow, so that it wasn’t an option for us. 
  2. Transferred the responsibility of launching the checking thread to a client side, changing the cache design a bit. Now the client uses the public interface once in a while and starts the process of verification the expired items. The locks aren’t needed for such an approach, because verification isn’t competitive anymore: a single client calls it once in a while.

After finalizing the second approach and a few all-night tests with a maximum memory consumption, the fix was confirmed.

Also, it is not the only problem with the cache in a multithreaded environment:

https://github.com/dotnet/runtime/issues/36499

https://blog.novanet.no/asp-net-core-memory-cache-is-get-or-create-thread-safe/

要查看或添加评论,请登录

社区洞察

其他会员也浏览了