Reactive Programming: Performance and Trade-Offs
Ramo Karahasan-Riechardt
(#HIRING) - CTO at itembase.com - Integrate with 500+ eCommerce Shopping Carts, Join the eCommerce Network
TLDR; What this is all about
In the following article, third part in a series of articles (full series) focused on the usage of Reactive Programming at Itembase, we are going to discuss the performance characteristics of a Reactive Programming application, first by introducing the topic of performance measuring: how to do it, how to avoid skewing results with “false” readings, how to properly benchmark code in a networked environment and so on. Then we will actually run the numbers on a somewhat realistic reactive application, discussing the results obtained and seeing how the classical approach differs from the reactive one.
In the last part we will briefly talk about the trade-offs introduced when adopting this Reactive Model and discuss if it’s actually worth implementing based on the results obtained.
The conclusion will summarize the topic discussed and provide additional talking points for the next article.
Introduction
Performance measurement is a technique aimed at assessing the capabilities of a system, evaluating how it performs based on a set of predefined metrics we want to track.
Given that we deal with software systems, we are not interested in all the possible metrics of a system, just its software metrics, defined as “a measure of software characteristics which are quantifiable or countable”.
With this definition of software metric, it follows that we can only measure what we can count or quantify, which means that we can only act on data that has such properties, like number of requests in an interval, memory consumption, I/O time and so on, while we cannot act on data that doesn’t have those properties, like “code legibility” which, no matter how much we try to squeeze this property into a number, still remains mostly a question of personal preference and thus it cannot be quantified easily (or at all in certain cases).
So, with the theory out of the way, what are we going to look at today? What metrics will we measure and how are we going to do that?
The answer is in the next chapter, but the gist of it is:
- We will have a look at the core metrics of a web application (requests per second, request completion time at Nth percentile, average request latency, throughput)
- We are going to write two applications, one in the “classic” blocking style with Spring MVC and the other one following Reactive Model with Spring Webflux and compare them across different scenarios
- We will craft different varieties of load testing and see how they differ across the metrics that we will track
A Discussion on Performance Testing
Discussions on Performance Testing should start with a warning, and we’ll not make an exception here:
Any performance measurement carries an implicit level of inaccuracy due to a number of wildly different factors that affect the system under test.
There is no such thing as perfect measurement test, only “accurate enough for the level of detail we care about” tests.
The factors that influence a test can range from the low level like CPU caches L1 or L2 being cold/warm, Hypervisor resource allocation, Bare Metal vs Virtual CPU, NUMA design, and so on, to the high-level Kernel used, DMA, VDSO, io_uring, epoll vs kqueue and so many others.
These needs to be taken into account when discussing numbers and the results that those numbers suggest.
While it is clearly possible to go low level enough to precisely control and measure the effect of most of these factors, doing so in this article would be extremely time consuming and, for the reader, likely not very useful as it would require a lot of tweaks to standard software to reproduce the results with very few added benefits.
For the purpose of discussing our point, reactive vs classic style of writing applications, we will ignore most of these factors given that we are interested in the general picture, it would not make much of a difference if we added a couple of nanoseconds to a measurement that is usually expressed in milliseconds.
By doing so, mostly ignoring these factors we discussed above, we will be able to keep the measurement process simple while still be able to meaningfully discuss the performance trend in our application code, checking which style is faster (and at what cost) and discuss the characteristics of the system in terms of the metrics that have been produced.
While we now know how our measurement process can be influenced when gathering the relevant metrics, there’s also another point that need to be briefly discuss before diving into the code, and it is statistics.
When discussing numbers and results, statistics is almost always involved because it provides the right framework to put these numbers in context.
Statistics can help us reduce the error bias in our results or help us normalize the error we introduced in our measurements, it can give us a confidence interval inside which our results are meaningful and it can also give an estimate of the theoretical trend of the performance of our application.
As before, detailing all these techniques and explaining how to apply them is way beyond the scope of this article, but there’s one key point that we can take away and use during our tests:
Repeated measures greatly reduce outlier effects and errors and thus provide a more accurate view of the real trends of the system. (source)
This is a concept we can grasp intuitively with a real-world example that involves a database, a system that most of us are familiar with.
Imagine you want to measure the reading speed of a database, and you do this by running a simple “SELECT * FROM example_table;”, and the following happens:
- You run this query once and the result is 40ms
- You run this query in loop 10 times, averaging each result with a simple mean, and the result is 35ms
Which one is more likely to be true? I would say the second result because when we ran our query the first time, we might have gotten an unlucky timing and our database could have been receiving writes (which cause locking that introduces delays) or maybe the thread scheduler put our task in a low priority queue and that delay does not belong to the database, or even we could have gotten a sudden latency spike in the system and this is irrelevant to our measure and so on.
Running the query multiple times and averaging the results mitigates these effects, and the more times we run it (up to a limit), the “better” our results will be, because these effects that we mentioned before, assuming they happen randomly, will be spread across N iterations so their influence will be way smaller.
This is because if I add 5ms to 1 iteration, the effect is that the whole amount of 5ms will skew the result, while if I add 5ms to 1 result but that result is part of a N-series iteration, the effect is that the amount added will be (5/N)ms, which, given a sufficiently large N, it’s a small number that does not skew the result as much as in the first case.
All of this forewarning is to say that one needs to be very, very careful when throwing numbers around and taking business critical decisions based on those numbers.
It’s important to understand the level of detail required for a given measurement, the level of accuracy and confidence that’s needed for a given metric and consider if what is being measured is really relevant, measured correctly and if there are hidden effects at play that influence the result.
When evaluating the performance of a system it’s important to measure the key metrics that make up that performance, but it’s also important to be aware of the methodology (and its flaws) used to gather those metrics and take well-informed decisions on those results.
Alright, now that we covered this, let’s dive in to the code.
Testing Methodology and Code
The test results are split into two parts to better highlight the difference between blocking (classic) code and non-blocking (reactive) code. Link to repository
The first part is a comparison between two web apps that return a static string response (text/plain media type), one written using Spring 4 WebMVC and the other is written using the newer Spring 5 Webflux.
Both applications expose the same endpoint called “/endpoint” that is begin used in this first part of the test.
Note that there’s a substantial difference when running the applications:
- The first application written in the blocking style uses a default number of 200 threads to serve the requests
- The second application will use 1 thread for the server and $CPU_CORES for workers thread, which means that if you have a quadcore processor, you will have 4 worker threads to serve requests
It goes without saying that the CPU and RAM consumption are very different using 200 threads vs just a handful (e.g. 4), which is a key point in what we are going to discuss now.
The tests have been performed using this gatling scenario (link):
We can see that we are using “https://localhost:19080” as our base URL and the endpoint we want to test will be GET “/endpoint”.
The test runs for 2 minutes and will keep incrementing the number of users (256 at a time) for 8 times, and each user will keep “using” the service (our test application) for 20 seconds, with 10 seconds of space between user increments.
This kind of load testing is close to the reality of a service being exposed to the real world, with traffic that comes in ramps or spikes and, at least for some time, keeps increasing until it reaches a plateau where the number of users does not change significantly.
This is the code used for the block (classic) web application:
(The missing closing brace is because there’s another method below that we will discuss later)
It’s a very simple and straightforward method that is aimed at showing the full capabilities in terms of speed for this blocking model.
No I/O, no operations performed, no serialization involved, nothing that can interfere with the basic request->response flow.
The equivalent non-blocking (reactive) code is:
(The missing closing brace is because there’s another method below that we will discuss later)
The endpoint is the same (“/endpoint”) for the reactive handler too.
So, running the above test we see these results (on my machine, Intel Core i7-7700HQ with 16GB of RAM):
For the blocking web application:
For the Reactive web application:
Let’s have a look at the metrics we have obtained with this test and discuss what they are.
The first thing we can see is that the total number of requests processed is greater for the reactive application, with a total of 2 million, compared to the 1.6 million for the blocking web application.
This is a direct consequence of the second metric, requests per second, the more the application processes in the same timeframe (our test time was 2 minutes), the bigger the total.
We can see that the number of requests per second in this scenario are quite close, around 13 thousand for the blocking web app compared to the 16 thousand for the reactive application, but we should still keep in mind that the first one uses 200 threads and the second one uses just 4.
The Mean and Std Dev values indicate the mean time for a request to be completed and the Std Dev value indicates how much the Mean represents the overall data sample, with the idea that the smaller the Std Dev value is, the better the Mean value represents the whole sample.
Again here, the two Mean values are quite close, 50ms vs 56, with a lower (and thus “better” approximating the whole sample) value for the reactive web app.
The real difference comes when looking at the 99th percentile data, the maximum amount of time that 99% of the request took to complete.
This is the equivalent of saying: 99% of the total requests completed in this time or less.
For the blocking web app this value is 202ms, while this is half for the reactive web app, 94ms.
This is important because it indicates that even with 200 threads (and possibly just because of that) the requests cannot be served as fast as the non-blocking model.
Keep in mind that we are not doing anything in this code. Literally all we are doing is returning the same hardcoded response for every request, which is the most basic thing you can do with a web app.
These preliminary results start to give us an idea of the capabilities and real-world power of the Reactive Model, where the increasing need for performance and diminished latency is always present.
Using fewer resources to perform the same amount (if not more) of work in the same timeframe equals more users served with the same time, and in the era of the cloud computing, time is literally money.
If your applications don’t run in cloud, then time is still money because you can serve more users with the same hardware instead of upgrading it, which translates to money saving.
The Code used for the blocking endpoint simulation is the following:
For the blocking web app:
For the equivalent reactive (non-blocking) code:
Let’s run our tests with this code and see what are the results.
The gatling test is exactly the same, just the endpoint is different, so now it tests “/slow” instead of “/endpoint”.
Tests results follow below.
Blocking web app with 100ms delay:
Reactive web app with 100ms delay:
Let’s gather the data and discuss it in a moment:
We can see that the difference in this case is staggering, with a 99th percentile value that’s 10x smaller for the reactive web app, the number of total requests served, it’s 3x greater and the Mean is 106ms vs 341ms.
In this more realistic case, the Reactive Model shines and shows that the classic way of doing things (blocking threads) starts to show its problems when we require performance and horizontal scalability.
One of the reasons why the Reactive Model is able to serve requests that fast is because it’s offloading the task of “checking if an operation has completed” to the Operating System, which has an optimized way of doing that (kqueue/epoll on Linux for example) and thus avoiding the overhead of syscalls and userspace/kernel switching.
Another reason for such a big difference is the overhead of managing threads.
It goes without saying that 200 threads have to “coordinate” themselves to run on a CPU that can accommodate 4 of them in parallel (e.g. on a quadcore system).
There is a cost involved in the context switch for threads that the system is going to pay every time one thread is ready to run and moves from a paused state into a running state.
According to queue theory (link) there is an upper limit above which adding more resources to a system has the opposite effect than the one desired, rendering the system slower.
This is because threads, like any other resource, have a cost associated to their handling, they take memory in the Kernel Thread Struct list, have additional data structures associated to them (TLB) that needs to be updated and it takes time to switch a thread from paused to running, which is a very frequent operation in the blocking model when there are 200 of them competing for the same CPU.
Resource contention is a real problem in the blocking model and it’s one of the reasons why even with more threads the throughput is significantly lower than the reactive model.
The non-blocking I/O used in the Reactive Model allows the CPU threads to keep running (so there’s no context switch for the paused->running or running->paused transition) and simply ask the Operating System to be notified when a given task has been completed.
This way of operating allows the CPU to keep a low number of worker threads always busy but with different units of work, in this case with different requests.
As soon as a blocking operation needs to be performed, in the blocking model that thread is being paused, while in the reactive model a callback is being registered to be notified when that operation has completed and the worker thread proceeds with another unit of work that is ready to be processed.
The key point here is that the Reactive Model, by its very nature, is much more performant and suited for a high-throughput application that wants to keep its resource consumption low (CPU, RAM).
A Look at the Trade-Offs
As we mentioned in our (first article) there is no Silver Bullet and each solution comes with a set of trade-offs that need to be considered before implementing it.
We wrote at length about this topic in our (second article) and we will not repeat what’s been already said, but instead we will approach this topic from a slightly different angle.
In terms of requirements for the use, the Reactive Model needs a shift in thinking about systems and their module’s composability.
What this means is that it’s not sufficient starting to write Mono and Flux everywhere and wrapping blocking calls into Mono.fromCallable methods, because if we do that, we will just have blocking code in a different style.
What’s needed to write performant and appropriate Reactive Code is a functional mindset, a way of thinking about systems as a composition of functions that are, as much as possible, pure and that each one fulfils a specific business logic need.
In this way, it comes natural writing our modules based on the responsibilities they have and the defining the reactive pipelines with the appropriate source and sink streams.
Putting it like this makes it look very simple and immediate, but the reality is that this shift requires some time spent studying the solution domain (what are pure functions, lambdas, reactive streams and so on) and also time spent making mistakes, understanding where there’s a problem in the pipeline, how it can be fixed and why it occurred in the first place.
These skills don’t come easily and they require patience to learn a new paradigm that, most likely, has never been seen by the developer who is approaching the Reactive Model.
Another trade-off is debuggability.
While there are a lot of helpers already included into the framework to help with this, the reality is that the shift from synchronous code and asynchronous/non-blocking code brings a lot of change and the code flow is not linear anymore, which means that stack traces are not accurate enough (with some exceptions) and we need to rely on other mechanism to understand where our code failed.
For example, given that there’s no guarantee that the same task will be completed by the same thread once there’s a blocking operation involved, it’s important to embrace the concept of immutability (data structures that, once created, cannot be modified) as a way of preventing unwanted modification to data and keeping the state space low, allowing the developer to establish invariants about the code and the data that the code should operate on.
There’s also the very important psychological effect of feeling like a beginner again, and this might not be comfortable for everyone, especially if they reached a position of seniority because of their extensive experience with a “classic” technology.
The Reactive Model forces the developer to re-evaluate his experience and more often than not requires that the design of a system is adapted to fit it, rather than the contrary.
It is not possible to rewrite a blocking/classic system into the a reactive one without changing (at least some parts) of the architecture, and this is one of the reasons why, if possible, it’s simply easier to start a new greenfield project following the reactive model rather than adapting an older system.
It can of course be done, but converting it is going to cause some major pains that could be avoided if the rewrite is not absolutely necessary.
There’s also the topic of Error Handling, where the functional approach baked into the Reactive Model treats the errors as a form of result, and thus the usual exception-style of handling errors is not appropriate anymore.
Instead what’s required is designing our reactive pipelines in such a way that we can handle errors as a “normal” occurrence and signal this fact to whoever subscribes to us, as opposed of throwing exceptions and catch them somewhere else.
The list of trade-offs for the Reactive Model is two-sided, as it contains both technical points like non-blocking capabilities of the O/S, new system design skills and so on, and psychological, as developers need to be ready to face some amount of frustration, due to the completely new way of doing things and the relatively short lifespan of this technology, which means that the online resources available are fewer if compared to the traditional blocking model.
By all means this list should not scare the reader from approaching the Reactive Model, and should instead serve as a reminder that some frustration is expected and it means that a change is happening, and keeping up the effort and the practicing will soon lead to efficient production-ready reactive code that will enable substantial performance gains for the application.
Conclusions
Here at Itembase adopting the Reactive Model allowed us to scale much faster in terms and commerce systems supported.
We have an established way of developing reactive microservices and if you’re interested in knowing what it is, check back next week because the next couple of articles will just talk about that!
Your organization can start adopting the Reactive Model if it’s comfortable with some learning phase where there will inevitably be some delays as the developers get used to this shift in thinking that we discussed above.
Starting from a simple greenfield project (in terms of business requirements) and adding the relevant testing is an excellent way of getting comfortable with this technology and will prepare the developers to tackle bigger problems when the time will come.
The tests are clear, they indicate a win in terms of performance, and, after the previous article, hopefully the trade-offs are clear too.
The ability of taking well-informed decisions makes the difference between a failed and a successful project, so before diving into the Reactive Model because you or your organization are attracted by the seemingly easy performance gains, make sure you fully and thoroughly understand the trade-offs involved and the level of effort required to start using this proficiently.
Stay tuned for more updates on this topic and for new articles we publish weekly!
This article first appeared on the Itembase blog