登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

In Compiler We Trust

Armin Kassemi Langroodi

C++ Software Engineer at ASML

发布日期: 2022年3月20日

+ 关注

Introduction

Agner Fog in his C++ optimization guide, Chapter 11 [1] instead of using following function:


float AddFloatV1(float a, float b, float c, float d)
{
   return a + b + c + d;
}

recommends using this function:

float AddFloatV2(float a, float b, float c, float d)
{
   return (a + b) + (c + d);
}

to enable Out of Order (OoO) execution feature [2] of CPUs to execute (a + b) independent from (c + d). While in AddFloatV1, the dependency chain in ((a + b) + c) + d computation paralyzes the OoO execution. It is mentioned in the guide that compilers may not optimize the AddFloatV1 function because of the floating-point precision loss.

Now let's change the argument type and the return value type of these two functions from float to int which leads to the following functions:

int AddIntV1(int a, int b, int c, int d)
{
   return a + b + c + d;
}

and

int AddIntV2(int a, int b, int c, int d)
{
   return (a + b) + (c + d);
}

which has three main differences as follows with the initial case:

There is no floating-point precision loss which gives compilers more freedom to optimize the functions;
According to the x64 calling convention [3] of operating systems (which is slightly different between Linux and Windows), compilers use XMM registers to pass float arguments and General-Purpose Registers (GPR) to pass integer arguments;
In many microarchitectures, the execution unit for integers (usually an ALU) is different with the execution unit for floats (usually a FP).

Comparison

By knowing these differences, we are going to see how different compilers generate machine code for the AddIntV1 and AddIntV2 functions.

C# (.NET 6.0.101)

In the .NET ecosystem, first the C# compiler converts the code to Microsoft Intermediate Language (MSIL) [4] which looks like as follows for the functions of interest (regardless of whether the optimization is enabled or not, the MSIL will look the same in this case):

.method public hidebysig instance int32
  Add(
    int32 a,
    int32 b,
    int32 c,
    int32 d
  ) cil managed
{
  .maxstack 8

  // [30 13 - 30 34]
  IL_0000: ldarg.1      // a
  IL_0001: ldarg.2      // b
  IL_0002: add
  IL_0003: ldarg.3      // c
  IL_0004: add
  IL_0005: ldarg.s      d
  IL_0007: add
  IL_0008: ret

} // end of method AddIntV1

and

.method public hidebysig instance int32
  BetterAdd(
    int32 a,
    int32 b,
    int32 c,
    int32 d
  ) cil managed
{
  .maxstack 8

  // [35 13 - 35 38]
  IL_0000: ldarg.1      // a
  IL_0001: ldarg.2      // b
  IL_0002: add
  IL_0003: ldarg.3      // c
  IL_0004: ldarg.s      d
  IL_0006: add
  IL_0007: add
  IL_0008: ret

} // end of method AddIntV2

which are not the same for both functions. However, at runtime the Just-In-Time (JIT) compiler generates the same assembly code based on Linux Application Binary Interface (ABI) for both functions as shown below:

AddInt(int, int, int, int):
       add      edi, esi
       lea      eax, [rdi+rdx]
       add      eax, ecx
       ret

in which the compiler optimization was enabled. It is also surprising that the compiler tricks one add instruction for (a + b) + c summation with a lea instruction to take advantage of potential vacant execution units besides ALUs for lea (which could be Address Generation Unit or AGU in some old CPUs [5]).

So, in C# AddIntV1 and AddIntV2 are the same.

C++ (Clang x86-64 10.0.0 compiler)

Now it is the C++ turn. The Clang compiler in C++ with the first level of optimization (-O1 switch) generates the same assembly code too based on Linux ABI for both functions:

AddInt(int, int, int, int):              # @AddIntV1/V2(int, int, int, int)
        lea     eax, [rdi + rsi]
        add     eax, edx
        add     eax, ecx
        ret

So, in C++ using Clang with O1 optimization, AddIntV1 and AddIntV2 are the same.

C++ (GCC x86-64 9.3.0 compiler)

And finally we stay with C++, but we switch to another compiler. The GCC compiler in C++ with the first level of optimization (-O1 switch) generates following Linux assembly codes for each function:

AddIntV1(int, int, int, int):
        add     edi, esi
        add     edi, edx
        lea     eax, [rdi+rcx]
        ret

and

AddIntV2(int, int, int, int):
        add     edi, esi
        lea     eax, [rdx+rcx]
        add     eax, edi
        ret

in which AddIntV2 has this potential to run faster due to lacking dependency chain between the first add instruction and the next lea instruction.

The AddIntV2 function is able to run faster, but does the CPU in reality allow that? To answer this question, we need to dive into a microarchitecture. Let's pick Intel Ivy Bridge microarchitecture as a sample which is not new, but also not that much old (before superscalar processors era [6]). Following is the simplified execution engine of an Ivy Bridge core [7]:

Ivy Bridge Core Execution Engine (simplified)

Now we need to take a look into the microarchitecture instruction table [8] to see how the scheduler dispatches the add and the lea instructions to the executions units. These two instructions do not have a Micro OPeration Fusion (uOP Fusion) [9], so after fetching and decoding add can go to Port 0, Port 1, or Port 5, and lea can go to Port 0 or Port 1 for execution. It means the probability is high that the CPU executes these two instructions in parallel.

So, in C++ using GCC with O1 optimization, AddIntV2 is faster.

Conclusion

By changing the argument type and the return value type of a function, the whole idea behind the function optimization can be ruined. Also the optimization trick for a function in one programming may not be valid for the same function in other languages. We can go further and conclude that even a performance enhancement for a function does not necessarily have same behavior for different compilers in a same language. The story does not end there because the optimized instructions can run differently in various CPUs which can be eliminated sometimes using the -march switch in some compilers if we are certain about our target hardware. So, it is well worth benchmarking our optimization tricks before judging their performance boost.

要查看或添加评论，请登录

Armin Kassemi Langroodi的更多文章

Heap allocated, or not heap allocated, that is the question

2023年9月30日

Heap allocated, or not heap allocated, that is the question

Introduction C++ exceptions can make the programmers' life easier on error handling. By determining the error…
Are Linux Kernel 5.x and Hyper-threading close friends?

2021年8月9日

Are Linux Kernel 5.x and Hyper-threading close friends?

Introduction Intel introduced Hyper-threading Technology in 2002 [1] in order to use CPU resources more efficiently…
Is Fast Inverse Square Root still Fast?

2021年4月27日

Is Fast Inverse Square Root still Fast?

Introduction Fast Inverse Square Root (Fast InvSqrt) is an algorithm that quickly estimates the inverse of the square…

4 条评论
Why does "nm" matter in CPUs?

2021年4月5日

Why does "nm" matter in CPUs?

According to Moore's law, we can roughly expect the number of CPU transistor to be doubled every two years [1]. That…

2 条评论

Introduction

Comparison

C# (.NET 6.0.101)

C++ (Clang x86-64 10.0.0 compiler)

C++ (GCC x86-64 9.3.0 compiler)

Conclusion

Armin Kassemi Langroodi的更多文章

Heap allocated, or not heap allocated, that is the question

Are Linux Kernel 5.x and Hyper-threading close friends?

Is Fast Inverse Square Root still Fast?

Why does "nm" matter in CPUs?

社区洞察