Exploring Bit Scan Forward
Henrique Bucher
Financial Engineer | Substack Bestseller Author | C++/FPGA/Quant/Trading/HFT/ULL | PhD/MSc/BSc Engineering
Do you know what software engineers do when they are bored? They just mess around with anything that can be thought to be cool. Wednesday is one of these days. I got bored and set to explore. What better to explore than Godbolt's compiler explorer which recently completed 10 years of service?
So I'm looking at this ffs operation that roughly speaking returns the index of the first non-zero bit from right to left (trailing bit). This feature is used commonly in the implementation of fast bitsets.
DISCLAIMER: the current article will likely not improve your market value. But if you're up for some typical nerd lunch-time divergence, stick around.
Built-ins
GCC has implemented these two built-ins that do roughly the same operation. The difference is just that the first accepts a signed integer and returns one plus the index of the least significant 1 bit of the argument while the second accepts an unsigned and returns the 0-based index of the same bit.
I'm not aware of the history of these two implementations (if you are, please add in the comments) but I suspected that as built-ins are supposed to encapsulate low level assembly instructions in a nice language wrap, it had to do with how these instructions are available in the different hardware that GCC supports. The wikipedia entry for First Find Set reads:
Looking at everyone's favorite Intel x86, we can see that from the early days Intel had BSF (bit scan forward) since the early Pentiums/AMD K7s and later on TZCNT with the BMI/BMI1 instruction sets , which were introduced first with AMD PileDriver in 2012 and Jaguar (low power) in 2013 and Intel Haswell also in 2013. The main advantage of TZCNT to BSF is that the later has a high latency, typically 3 cycles up to 8-10 cycles on the early CPUs (Prescott, K7) while the newer instruction not only has typically better latency (2 cycles on AMD and 3 on Intel) but also better throughput: on AMD Zen(1-3) two such instructions can be executed in the same cycle, which is perfect for bit-heavy operations.
Basic Test Code
So just to have a sense of what to expect from these instructions, I just created a baseline on compiler explorer . This program cycles setting each bit on, and also setting all to zero, and recording the result for each iteration. I compiled it with optimizations on and for the basic x86 instruction set with x86-64, which should work with every Intel x86 chip ever designed.
x86-64 Assembly
Just as a crash intro to x86-64 assembly, arguments and results in 64-bit assembly are returned as registers as much as possible, in contrast with 32-bit assembly where arguments and return values are always returned using the stack. Integral (integer-like) arguments are passed in the DI, SI, DX, CX, R8, R9 in this order and results are returned in the AX register. The "R" prefix means 64-bits while "E" indicates 32-bits so RAX and EAX are the 64-bit and 32-bit versions of the same register. The biggest advantage of this register-passing is that there is no memory access involved, even if this access is through a fast L1-cache.
In our particular case, the argument "x" is passed as the EDI register, 32-bit version of the DI register and the result is returned in the 32-bit EAX register.
The two built-ins are implemented in assembly as the same instruction BSF (or BSFL for 32-bit arguments). The only difference is that __builtin_ffs uses a conditional move if the zero flag is not set - in short, it checks for a zero argument in an optimized way.
However the CTZ built-in implementation matches exactly the assembly instruction behavior, which is very important to keep in mind as we develop algorithms on top of it - if you already know that the argument is non-zero, this using this built-in instead of FFS will save you a double check.
Checking the input argument
We could try to do better than the compiler by manually checking for the zero but the result is eye-rolling: the assembly becomes exactly the same. Thank that to SSA optimizations!
The results of the run are as follows:
Note that the two built-ins are pretty much equivalent with the exceptions: (1) ffs returns index plus one and (2) ffs handles zero arguments while ctz returns garbage as expected.
BMI extensions
When we walk up the Intel architectures and switch on Haswell that introduced the BMI instruction set either with "-march=haswell" or with "-march=x86-64-v3" and above or with "-march=bdver2" for AMD family 15h with BMI, we see the compiler giving preference to the TZCNT instruction.
Returning std::pair<>
Now this C-like functions with semantics on the return value are sort of outdated. Why don't we try some C++ abstractions to see what happens. First, let's wrap the result in a nice std::pair<bool,int> where we can check the second argument for a zero. We can use std::get<1> to retrieve the result - we are going to ignore the bool for now in the print.
Ends up the two implementations are exactly the same, which does not surprise me a bit because they would ultimately result in the same AST graph after SSA squeezing. We can see that the std::pair is being returned as a single integer in the RAX (64-bit A) register, the index in the high 32-bit and the boolean in the low 32-bit. Nice and compact, I like it.
Can we do better perhaps? What if we try to invert the order of the parameters in the pair? Would it make any difference? We switch the order and make sure to use std::get<0> in main to get the result.
No, there is no difference. Apart from switching the A with C register so that the boolean is in the high 32-bit and the index in the low 32-bit, they are the exactly same six instructions. If you want to mess around with the code above, here is the link .
领英推荐
Returning std::tuple<>
OK but what about a tuple? Maybe it optimizes better? Let's check. We change std::pair with std::tuple and something fundamental changes: the result is not being returned by value anymore, it is being returned by reference! The implementation has only five instructions but this can be bad as calling this function will always results in a memory access.
It looks like the first RDI argument is now a pointer to the result value and the second (fanthom) argument in ESI is the actual "x" value.
This is effectively equivalent to calling a function with this signature:
void bitpos1( std::tuple<bool,int>& result, int x );
So let's call this a win for syntax sugar! But why does this happen? I'll let this for you C++20 standard rats to figure out and add to the comments section!
Regarding the actual difference between FFS and CTZ, there is none, as before.
For completeness, let's now reverse the order of the fields in the tuple. Nothing really changes much, just the order of operations.
Returning a struct
But I'm a curious cat and I want to check what happens if instead of a std::pair or std::tuple I use a plain old school struct.
And now Yikes! It added a branch, the source of every performance evil! It explicitly checks and branches out if the argument is zero. As in the std::pair case, the index is being returned in the upper 32-bit portion of the 64-bit RAX register and the bool is available in its lower 32-bits. Again, the implementation is exactly the same for both versions FFS and CTZ.
For completeness, let's try to reverse the parameters as before.
This time the result is 100% equivalent to our best resulting assembly with std::pair. The index is returned in the low 32-bits of RAX and the boolean in its upper 32-bits, no memory access performed.
Would it be possible to transform the struct example above to match exactly either the std::pair or the std::tuple implementations? I'll leave this as a challenge - comment below and share a Godbolt link if you got a solution!
Other architectures
Just for the sake of why not, I ran the basic original code against a few other CPU architectures, for giggles mostly.
On ARM a conditional move (no branch) was generated with a hardware-supported forward bit scan instruction (RBIT).
On ARM64, it's basically the same but the code is cleaner and more concise.
On the Raspberry Pi (still ARM) the result is almost the same for both calls
On AVR (Arduino) and MSP430 Texas microcontrollers (not shown), there is no hardware support so software functions are called.
On the Kalray MPPA (Massive Parallel Processor Architecture) the instruction is hardware-supported and the assembly is very concise. If you are not familiar with this architecture, here is a link .
On the MRISC32, which is an open-source 32-bit software-defined CPU for FPGAs , the operation is hardware-supported
On RISC-V 64-bit CPUs there is no hardware support for bit scan forward so software functions doing bit sorcery are called. Same happens for RISC-V 32 bits.
Conclusion
I don't really know what the moral of this article is. In general I do these deep dives to gather information for future coding. It drives instinct. But if some conclusions must be drawn:
Embedded Software Development Engineer - E2 Program at Embraer
1 年Nice experiment! Conclusion #6 is revealing and unexpected to me!