登录查看更多内容

Race condition inside .NET Core MemoryCache or Story how we lost 1GB RAM within a half of a day with a happy end.

Klaudia Zaika

CEO at Apriorit | Talk about cybersecurity | Custom software development | Expertise in SaaS, XDR, EDR, SIEM, SOAR, DLP | Windows, Linux, and MacOS kernel & driver development | Embedded Systems | Reverse Engineering

发布日期: 2020年10月9日

Hi there,

We would like to share with you an interesting technical case we have resolved at one of our projects, it can be useful for a lot of solutions using MemoryCache in .Net Core.

The solution we work on is a Windows endpoint security service (filters browsers traffic, with a great number of endpoints) which among others uses a well-known .NET Core Extensions MemoryCache class.

We got a report from one of our end-customers that he got memory leaks at one of his machines: the amount of consumed memory by the application was significantly increasing during its work for no apparent reason. It was pretty interesting because a .NET's garbage collector is used and it manages the allocation and release of memory for application.

OK, let’s start debugging.

The process of investigation

We analised the memory dumps from the machine and found a few hints that helped us to find the problem. We couldn’t be sure about the exact place since the solution uses the obfuscation mechanism. We started to reproduce it on our side and managed to find the scenario which led to significant memory consumption. To achieve that, we used a Crawler which generated thousands of redirects between URLs within an hour. After leaving it for one night, we finally got to a state when memory consumption increased up to more than 1 GB and nothing was released even if all the browsers were closed and the service ‘cooled down’.

We collected a number of locally-generated dumps from different stages, started analysing them and found a problem in the cache that we have mentioned before, which we use internally: it unsuccessfully tried to find the expired records, created massive arrays for this and stored all this ‘mass’ in memory in dozens of threads.

It is always hard to blame Microsoft's code, so we double-checked it. Our findings were confirmed - tests with extensive usage of the cache even without any extra logic were behaving in the same way.

We found the source code of this caching extension - it is a part of .NET Core runtime library which is available on GitHub and as a public NuGet package

We started an investigation of how it searches for the expired values and found that it works as follows. The launch of a thread (StartScanForExpiredItems function) for going through the values array is performed on each call to cache: on SetEntry, TryGetValue, Remove and on EntryExpired. It is performed without any locks, allowing to run multiple tasks when a single one is really needed. This happens because of the race condition inside StartScanForExpiredItems routine - it allows many threads to pass the “if check” since it doesn’t use any locking mechanisms.

On the one hand it is a good thing: if we do not call cache, it won’t be scanned, because nobody will call StartScanForExpiredItems - this might save us some execution time.

The key problem is that, if we load our processor and these hundreds of threads get under, let’s say, SetEntry, we’ll be able to launch hundreds of tasks for checking the expired items and each one will put the pressure on memory while also holding the locks. It works for small caches, but ours is big, thousands of records which are stored inside ConcurrentDictionary, which is not the best idea from the locks point of view.

As a result, we got a ‘critical mass’ of threads, which were concurrently checking the expired items. Because of the locks they did not have enough time to check everything before the new cycle was started, so their total number became larger and larger and we finally got ~1GB RAM consumed RAM within a half of a day.

The solution we found

First of all, we got caching part of the runtime library in a separate project. We tried two potential solutions:

Added locks to cache, slowing it down and adding a proper synchronisation to its functions. It became too slow, so that it wasn’t an option for us.
Transferred the responsibility of launching the checking thread to a client side, changing the cache design a bit. Now the client uses the public interface once in a while and starts the process of verification the expired items. The locks aren’t needed for such an approach, because verification isn’t competitive anymore: a single client calls it once in a while.

After finalizing the second approach and a few all-night tests with a maximum memory consumption, the fix was confirmed.

Also, it is not the only problem with the cache in a multithreaded environment:

https://github.com/dotnet/runtime/issues/36499

https://blog.novanet.no/asp-net-core-memory-cache-is-get-or-create-thread-safe/

Race condition inside .NET Core MemoryCache or Story how we lost 1GB RAM within a half of a day with a happy end.

Klaudia Zaika

CEO at Apriorit | Talk about cybersecurity | Custom software development | Expertise in SaaS, XDR, EDR, SIEM, SOAR, DLP | Windows, Linux, and MacOS kernel & driver development | Embedded Systems | Reverse Engineering

更多精彩文章

社区洞察

其他会员也浏览了

This Website Is No Longer Running As The Root User

A Beginner Friendly Intro to Containers, VM and Docker

Did the Constraints of REST Save the Internet?

Stop the Thundering Herd Problem Before It Starts ?

Introducing Shodan Trends

Fix Apache log4j2 in vCenter Appliance

Grow up your business at speed of light

Journey From ClusterIP To Ingress Controllers

WebSockets with Spring, part 1: HTTP and WebSocket

What caused Discord to switch from Go to Rust?

How To Use AI To Your Business’s Advantage

2024年7月2日

Invest In Project Management Before Shifting To The Metaverse

2024年5月28日

Going Beyond An MVP: 4 Ways To Prepare Your Product For Future Growth

2024年5月10日

How Can Your Business Use AI For Fraud Detection Without Overspending?

2024年4月22日

How To Work With Clients At Each Stage Of Small Business Growth: Tips For IT Outsourcing Companies

2024年3月25日

Four Things To Consider When Joining A Software Development Outsourcing Tender

2024年3月11日

Six Crisis Management Lessons For IT Companies

2024年2月26日

13 year anniversary with Apriorit - a hard work or a big stage?

2018年8月23日

RSA: Looking for opportunities or attending a vanity fair?

2018年5月11日

How to define entry points to solution: Building Threat Models and Attack Trees

2018年5月4日