If It Works... Don't Touch It?
Ashutosh Pareek
VP - Lead Software Engineer|.Net Enterprise Cloud Platform Architect|Innovator|C#, AWS, .Net , SQL, Web APIs, IBM MQ, Kafka, Terraform | Banking domain, ATM, EMV(Chip card tech)
A few weeks ago, I found myself in the middle of a interesting tech mystery: our UI was randomly throwing 502/504 errors. Diving into the logs, I noticed something bizarre—lightweight calls, like our liveness check, were taking up to 15 seconds. That's like asking someone for the time and them taking a coffee break before answering. I started to suspect our requests were being throttled at Kestrel, causing them to queue up so long that the load balancer threw a fit and gave us those lovely 502/504 errors.
The team mentioned this issue had cropped up with the latest release and had never been seen before. So, I began the fun task of scrutinizing the new code changes for any blocking calls. Surprise! There were none. Instead, several changes had been made to convert blocking calls to non-blocking ones, which really threw a wrench into my initial hypothesis.
Determined to crack the case, I set up the API locally and bombarded it with production load using JMeter. I wielded dotnet-counters like Sherlock's magnifying glass to check for thread pool starvation. Sure enough, the metrics showed that after the initial load, the thread pool was as busy as a one-armed paper hanger. The code was straightforward: one API call led to a database call, followed by spawning 10 to 20 tasks with Task.Run, each making another database call and an HTTP API call. All these were non-blocking, with awaits in the right places, so I expected threads to return to the pool quickly.
But nope! After running the load and taking multiple memory dumps, I discovered that all threads were on an extended coffee break while creating Oracle DB connections and adding them to the DB pool. Oracle finally fixed this after a month, which explained why we were seeing issues during initial load.
Still puzzled why this issue was tied to the new code since the Oracle client library version hadn't changed, I revisited the pull request. There it was, hiding in plain sight: a PR review comment stating, "It's best practice to use Task.Run instead of Task.Factory.StartNew." Previously, the team used Task.Factory.StartNew with the long-running option, which didn't use thread pool threads. The change to Task.Run used thread pool threads, effectively putting them on babysitting duty when Kestrel needed them to serve incoming requests.
领英推荐
Key Takeaways:
Why was the change suggested from Task.Factory.StartNew to Task.Run?
Almost all articles will tell you Task.Factory.StartNew is bad and should be replaced with Task.Run for I/O calls. But few mention that Task.Run uses thread pool threads. In web APIs, Kestrel and your code share the same thread pool. Blocking threads means Kestrel might be left twiddling its thumbs, waiting for new threads to process incoming requests. So, avoid using Task.Run for anything that might block a thread for a long time.
What was the potential benefit of changing from Task.Factory.StartNew to Task.Run?
This change would have increased the app's scalability since all calls were non-blocking I/O calls. But did we need this scalability for the business functionality? Nope. This change was made as part of a best practice. Which brings us back to our software engineering mantra: "If it works, don't touch it." Always ensure that changes are necessary and beneficial through rigorous testing—otherwise, you might end up fixing something that wasn't bgroken in the first place.
Engineering at JPMorgan | Mentor@Scaler, AlgoTutor | Tech Enthusiast | Author | Speaker | Campus Mentor | Problem Solver
9 个月Good one.. thanks for sharing Ashutosh Pareek