A practical use case for OpenTelemetry
Sword Luxembourg Experience
Shaping your data into insightful information
Context:
During a mission for a customer, on a Dot Net Core project using Blazor, Entity Framework Core, Kafka and related technologies, I encountered performance issues and high memory usage. Here is what I used to troubleshoot the issue and improve the project, taking into account the limitations imposed by the customer environment.
Overview:
The programming environments we use on a daily basis have all evolved since the last few decades, bringing many enhancements to the languages, rendering code more readable and understandable, automating resource allocation and more importantly, deallocation, and many more helping hands (IntelliSense and Copilot for instance) …
Nevertheless, I ended up in a situation where my application was consuming a nice amount of memory, and it kept increasing over time. Strange you might say, the Garbage Collector of the Dot Net Framework should take care of cleaning unused memory structures for us in a transparent manner… Also, no matter the amount of logging I was adding, I could not pinpoint the source of the problem.
First attempt at a solution
For such cases, Visual Studio has tools to help the developer find issues in his code, for instance the “Performance profiler” we can call using “Alt-F2”. The profiler contains several tools, including one that tracks the creation and deletion of objects in memory, so I tried that right away:
The problem is that this feature consumes a crazy amount of disk space (several GB after only a couple minutes). And the development PC provided by the customer only had 250GB of SSD, causing the profiler to crash before I could find anything useful… I asked if they could provide an additional storage, so they replaced my PC with another one… with the same amount of storage. After months of asking, I finally got the approval to have the support team install an additional 1 TB SSD, but you imagine I could not wait that long to have a working application. I will omit mentioning the surprises that came along, like security software that prevented writing to the new disk for obscure reasons (it was considered as an external drive) and applying very restrictive quotas that made Docker crash after a few hours...
Second attempt at a solution
In the past, I have used third party services like New Relic, and later, Azure Application Insights. These tools are really wonderful in the way they can collect large amounts of useful information about your processes, store them in the cloud, and provide tools to dig into this mountain of data and find issues you weren’t even aware of. The timing of function calls, the number of objects created and deleted, the time spent in each routine or SQL call, all really useful tidbits that can help you find where the system is spending more time than it should. Combine this with the possibility to upload your own logs, and have the system correlate these with the insights it generates along the way, make it a breeze (usually) to find where some effort should be placed…
But you guessed already, such a solution was not authorized by the customer. They never used that tool and would need to dedicate a team to evaluate it before allocating resources to use it. Also, they were not keen at the idea of having some of their data in the cloud…
Third attempt at a solution
So, I tried to find other ways to get some information about what was happening in my app, and found several possible ways:
Jaeger
This project is under the umbrella of the Linux Foundation. It runs under Docker (which is fortunately accepted by the customer) and makes use of another OSS project called “OpenTelemetry”.
OpenTelemetry
This is a project managed by the CNCF, the Cloud Native Computing Foundation (https://cncf.io/). It allows the collection of metrics from any app, in any language, in a common format that is platform agnostic. The latest version of Azure Application Insights is compatible with this library, a testament to its robustness and effectiveness. It would also ease the effort, if any, needed if the customer wants to use AAI in the future…
How things are linked together
This solution is composed of the following components:
The end result
The solution I chose is the following:
When everything is configured and running, we can inspect what Jaeger highlights:
When we find an event that took more time than expected, we can drill into it and see what it is composed of:
Using this view, I was able to detect the paths in the code that were taking more time than expected. For instance, my app was saving its state to a JSON file too frequently. Also, it was doing it in the same thread as the main processing, so isolating this action to its own separate thread called by a timer solved the performance issues.
Memory consumption
Regarding the memory usage, the dump tool of Dot Net helped me find 2 areas of improvements, when distilled in Visual Studio:
In this case, event types have a limited number of possibilities, but are used in all internal objects. With interning, the string will be stored by the Dot Net runtime in a shared location, and a pointer to it is returned, saving a ton of memory allocations. Of course, this should not be used for storing all strings, it only makes sense for the ones that you know beforehand will be used in many places. An article explaining how this works is available in the Microsoft KB: https://learn.microsoft.com/en-us/dotnet/fundamentals/runtime-libraries/system-string-intern; ?An alternative for strings you know have one of a few fixed values, is to store those values in a static class, and reference them instead, reaching the same goal as above, but limiting the scope to your own process;
Dotnet-counters
The “System.Diagnostics.Metrics” namespace contains useful classes to track internal counters and gauges, which I used in complement to what OpenTelemetry provided:
We can see the standard meters provided by the Dot Net runtime, as well as the ones I implemented. The tool accepts either the process ID of the application to monitor or, like in this case, the name of the application. We can find both information using the “dotnet-counters ps” command:
Of course, we can also expose these values in our own application if needed, like I did in an admin page of my Blazor app:
Final word
I hope you found this useful and have learned a few new tricks. Don’t hesitate to contact me if something is not clear.
Have fun adding observability to your code and reap the benefits of knowing what really happens under the hood.
Frédéric MAUROY, Consultant .NET