In Compiler We Trust
Introduction
Agner Fog in his C++ optimization guide, Chapter 11 [1] instead of using following function:
float AddFloatV1(float a, float b, float c, float d)
{
return a + b + c + d;
}
recommends using this function:
float AddFloatV2(float a, float b, float c, float d)
{
return (a + b) + (c + d);
}
to enable Out of Order (OoO) execution feature [2] of CPUs to execute (a + b) independent from (c + d). While in AddFloatV1, the dependency chain in ((a + b) + c) + d computation paralyzes the OoO execution. It is mentioned in the guide that compilers may not optimize the AddFloatV1 function because of the floating-point precision loss.
Now let's change the argument type and the return value type of these two functions from float to int which leads to the following functions:
int AddIntV1(int a, int b, int c, int d)
{
return a + b + c + d;
}
and
int AddIntV2(int a, int b, int c, int d)
{
return (a + b) + (c + d);
}
which has three main differences as follows with the initial case:
Comparison
By knowing these differences, we are going to see how different compilers generate machine code for the AddIntV1 and AddIntV2 functions.
C# (.NET 6.0.101)
In the .NET ecosystem, first the C# compiler converts the code to Microsoft Intermediate Language (MSIL) [4] which looks like as follows for the functions of interest (regardless of whether the optimization is enabled or not, the MSIL will look the same in this case):
.method public hidebysig instance int32
Add(
int32 a,
int32 b,
int32 c,
int32 d
) cil managed
{
.maxstack 8
// [30 13 - 30 34]
IL_0000: ldarg.1 // a
IL_0001: ldarg.2 // b
IL_0002: add
IL_0003: ldarg.3 // c
IL_0004: add
IL_0005: ldarg.s d
IL_0007: add
IL_0008: ret
} // end of method AddIntV1
and
.method public hidebysig instance int32
BetterAdd(
int32 a,
int32 b,
int32 c,
int32 d
) cil managed
{
.maxstack 8
// [35 13 - 35 38]
IL_0000: ldarg.1 // a
IL_0001: ldarg.2 // b
IL_0002: add
IL_0003: ldarg.3 // c
IL_0004: ldarg.s d
IL_0006: add
IL_0007: add
IL_0008: ret
} // end of method AddIntV2
which are not the same for both functions. However, at runtime the Just-In-Time (JIT) compiler generates the same assembly code based on Linux Application Binary Interface (ABI) for both functions as shown below:
AddInt(int, int, int, int):
add edi, esi
lea eax, [rdi+rdx]
add eax, ecx
ret
in which the compiler optimization was enabled. It is also surprising that the compiler tricks one add instruction for (a + b) + c summation with a lea instruction to take advantage of potential vacant execution units besides ALUs for lea (which could be Address Generation Unit or AGU in some old CPUs [5]).
So, in C# AddIntV1 and AddIntV2 are the same.
C++ (Clang x86-64 10.0.0 compiler)
Now it is the C++ turn. The Clang compiler in C++ with the first level of optimization (-O1 switch) generates the same assembly code too based on Linux ABI for both functions:
AddInt(int, int, int, int): # @AddIntV1/V2(int, int, int, int)
lea eax, [rdi + rsi]
add eax, edx
add eax, ecx
ret
So, in C++ using Clang with O1 optimization, AddIntV1 and AddIntV2 are the same.
C++ (GCC x86-64 9.3.0 compiler)
And finally we stay with C++, but we switch to another compiler. The GCC compiler in C++ with the first level of optimization (-O1 switch) generates following Linux assembly codes for each function:
AddIntV1(int, int, int, int):
add edi, esi
add edi, edx
lea eax, [rdi+rcx]
ret
and
AddIntV2(int, int, int, int):
add edi, esi
lea eax, [rdx+rcx]
add eax, edi
ret
in which AddIntV2 has this potential to run faster due to lacking dependency chain between the first add instruction and the next lea instruction.
The AddIntV2 function is able to run faster, but does the CPU in reality allow that? To answer this question, we need to dive into a microarchitecture. Let's pick Intel Ivy Bridge microarchitecture as a sample which is not new, but also not that much old (before superscalar processors era [6]). Following is the simplified execution engine of an Ivy Bridge core [7]:
Now we need to take a look into the microarchitecture instruction table [8] to see how the scheduler dispatches the add and the lea instructions to the executions units. These two instructions do not have a Micro OPeration Fusion (uOP Fusion) [9], so after fetching and decoding add can go to Port 0, Port 1, or Port 5, and lea can go to Port 0 or Port 1 for execution. It means the probability is high that the CPU executes these two instructions in parallel.
So, in C++ using GCC with O1 optimization, AddIntV2 is faster.
Conclusion
By changing the argument type and the return value type of a function, the whole idea behind the function optimization can be ruined. Also the optimization trick for a function in one programming may not be valid for the same function in other languages. We can go further and conclude that even a performance enhancement for a function does not necessarily have same behavior for different compilers in a same language. The story does not end there because the optimized instructions can run differently in various CPUs which can be eliminated sometimes using the -march switch in some compilers if we are certain about our target hardware. So, it is well worth benchmarking our optimization tricks before judging their performance boost.