Snail's tale of virtual function calls
"If you don't measure it, you can't improve it."
Most people using C++ for latency critical applications would know that virtual (indirect) function calls tend to be slower than direct function calls. In high-frequency trading, where nanoseconds can separate a winning trade from a missed opportunity, the cost of a slower function call may lead to a significant drag in performance and PnL. However, knowing something in theory is rarely enough to justify complex optimizations. This realization pushed me to finally measure the impact of virtual function calls on performance.
Before diving into the profiling results, let’s look at the three primary kinds of function calls. For each example below, I consider a simple code where two random integers are generated, and the add(int a, int b) function is called to calculate the resulting sum. We are primarily interested in how the call to the add function is resolved.
1. Direct function call
In a direct function call, the function address is known at compile, and calling the function resolves in a call assembly instruction with a fixed address as argument. Most standard function calls are direct function calls, for example:
In the x86 assembly generated above, the two arguments for the function call are moved to registers edi and esi, and the add function is called. The function body is stored in the text segment and the call instruction moves the instruction pointer to the function's address in the text segment (see memory layout of programs for more info on segments). An interesting thing to note here is that large code binaries may lead to an instruction cache miss, although most modern CPUs are good at predicting and preloading the instructions.
What’s the cost of this function call? The function call introduces two ‘mov’ instructions and a ‘call’ instruction. The actual add instruction is done inside the function. These are 3 extra instructions already for a simple add. Note that if there were more arguments, they would be pushed on the stack due to which we would see ‘mov’ and ‘push’ instructions (see line 96 in assembly output here). Further, the function would pop the arguments from the stack introducing some ‘pop’ instructions. Primitive data types (int, char, etc) are usually returned in a register while structs and classes are pushed on the stack on function return. To summarise, the following occur on a function call:
Performing these for simple function calls is unnecessary and the compilers would usually optimize such function calls by making them ‘inline’. Let’s look at them next.
Like the article so far? Continue reading on my Substack for a much nicer experience!
2. Inlined function call
For an inlined function, the compiler copies the code from the function definition directly into the code of the calling function rather than creating a separate set of instructions in memory. In the assembly code below, the ‘call’ to add function is replaced by the ‘add’ instruction in the main program. Calling this “inlined function call” may be a bit of cheating as there is no function call anymore.
Inlined functions are quite performant as the variables do not need to be pushed and popped on/from the stack anymore and the CPU does not need to jump to a different instruction address as it would to execute a call instruction. There’s no free lunch though, so inlining does come with a cost. Imagine a function that has 100 lines of assembly, and it is called 100 times. When inlining, the function body will be copied each time it’s called which can lead to 100*100 lines of assembly. If the function were not inlined, the only additional lines of code over the function body are the argument variables being pushed/popped/moved. Hence, for larger functions called multiple times, inlining will most likely lead to larger binary sizes.
3. Indirect function call (virtual function)
An indirect function call is used when the function being called cannot be resolved at compile time. Consider the following code where ‘AdderBase’ is a base class defining a virtual function add(int, int). getAdder returns one of the two implementations of this base class randomly. In this case, the actual ‘add’ function called can only be determined at runtime. This scenario is quite common in C++ when dealing with interfaces.
领英推荐
The indirect call uses an instruction call with a register as argument (here rax). Specifically in C++, the function to be called is resolved using a virtual table and virtual pointer. Looking closely at the assembly generated below, the register rax gets its value just a line above the call instruction (blue arrow). The QWORD PTR [rdi] instruction is used to specify that an 8-byte value is being accessed in memory at the address contained in the rdi register. The rdi register here stores the address of the vtable for the adder object, and the first 8 bytes of the vtable store the address of the add function.
So, what’s the cost of an indirect function call in C++? On the surface, the primary overhead seems to be the need to resolve the actual function’s address through the vtable, which is an additional step compared to a direct call. However, a deeper issue lies in the limitations this imposes on compiler and CPU optimizations. In a direct function call, the CPU knows exactly which function will be executed, allowing it to optimize performance by prefetching the function’s instructions into the cache, reordering instructions, or even running the function ahead of time. In contrast, an indirect call makes it more difficult for the CPU to perform these optimizations. This is why many compilers attempt to devirtualize calls (e.g., devirtualization in GCC). On top of that, modern CPUs typically include an indirect branch predictor (IBP), a hardware feature designed to predict the target addresses of indirect branches and improve performance.
Like the article so far? Continue reading on my Substack for a much nicer experience!
Measuring function call performance
To profile the performance of different types of function calls, I created a Vector class with three coordinates: x, y, and z. This class includes getter and setter methods for the coordinates, which can be implemented as inline, non-inline, or virtual functions. For non-inline functions, I used the [[gnu::noinline]] attribute to give the compiler a hint, though it’s important to note that this is just a suggestion and not a guarantee. To ensure accuracy, I verified that the function was indeed not inlined for this test by inspecting the assembly output.
Next, I created 1,000 Vector objects and assigned random values to their coordinates. I then profiled the time it took to compute the sum of all the vectors, which involved calculating the sum of the x, y, and z coordinates across all vectors and storing the result in a new vector. For more details and the complete code, you can checkout the code on GitHub.
Below are the results from running the program 100 times for each type of function call, with the average time taken measured for each. I compiled the code using GCC 14 with optimization level -O3 on an Apple M4 (ARM) chip. To prevent virtual function calls from being inlined, I used the -fno-devirtualize GCC flag. The results indicate that non-inline function calls are roughly 1.5 times slower than inline calls. Meanwhile, a run with virtual function calls is about 400ns slower than direct calls. Since each run includes 1,000 vectors with 3 coordinates each (resulting in 3,000 function calls), this suggests that the overhead of a virtual function call is approximately 0.13ns per call. For most systems, this overhead is minimal and unlikely to cause significant performance issues.
As I mentioned earlier, most modern CPUs use Indirect Branch Prediction (IBP) to optimize virtual function calls, making them faster. In my test above, the same virtual function was called repeatedly through a pointer to the base class. This is not the best representation of virtual calls. In real-world scenarios, different virtual functions are often called via a base class pointer, leading to more variation in the target function addresses.
To simulate this behavior, I extended the virtual call test to call two different objects’ virtual methods in a pattern (object A, then B, then A, and so on), and at random (object A or B at random). The results from this new test suggest that virtual calls, in this more realistic scenario, are roughly twice as slow as direct non-inlined function calls! This highlights the impact of IBP and the additional complexity when predicting targets for virtual function calls in dynamic, unpredictable patterns.
Many standard constructs, such as std::function and std::shared_ptr, might use virtual function calls internally. While the latency difference may not be significant for most systems, it becomes an important consideration when writing latency-critical code, particularly in performance-sensitive hot paths. It's essential to be mindful of these overheads in such cases. In a future article, I’ll dive deeper into profiling branch prediction and instruction cache misses to explore how these factors can further impact performance.
Thanks for reading! Checkout my Substack if you found this interesting, and let me know your thoughts in the comments!
Senior C++/Rust Developer @ Wabtec
1 个月No need to measure it, if your software logic needs dynamic dispatch, you have to use virtual, there is no other way around it, using virtual is not a question of should or should not, you either have to use virtual or you need to implement your own dispatcher which contains a vtable. If you understand the difference between static dispatch i.e. CRTP, you will understand deeper behind the scene.
Mobile/Embedded Systems Software Architect
2 个月Great article. Every C++ dev should know and apply zero-cost abstraction wherever they can. In brief: anticipating polymorphism resolution to compile time. https://medium.com/@rahulchakraborty337/zero-cost-abstraction-in-c-fbc9be45772b
Upcoming Quant Researcher @Pace | prev @ Stokhos Research Capital , @Shiksha Sopan | C++ 20 developer | 9k @linkedin | CSE@NIT-H Y24 | ML | DSA
2 个月and for virtual function call , compiler get the address at runtime , CPU overhead to optimize this call .
Upcoming Quant Researcher @Pace | prev @ Stokhos Research Capital , @Shiksha Sopan | C++ 20 developer | 9k @linkedin | CSE@NIT-H Y24 | ML | DSA
2 个月right , that's why inline function its body is internally directly call and str , avoiding the function call eliminating the overhead of stack
SWE-II @ Rubrik | Nutanix | Samsung | CS @ BITS Pilani
2 个月Highly insightful!