登录查看更多内容

Exploring Bit Scan Forward

Henrique Bucher

Financial Engineer | Substack Bestseller Author | C++/FPGA/Quant/Trading/HFT/ULL | PhD/MSc/BSc Engineering

发布日期: 2022年7月27日

Do you know what software engineers do when they are bored? They just mess around with anything that can be thought to be cool. Wednesday is one of these days. I got bored and set to explore. What better to explore than Godbolt's compiler explorer which recently completed 10 years of service?

So I'm looking at this ffs operation that roughly speaking returns the index of the first non-zero bit from right to left (trailing bit). This feature is used commonly in the implementation of fast bitsets.

DISCLAIMER: the current article will likely not improve your market value. But if you're up for some typical nerd lunch-time divergence, stick around.

Built-ins

GCC has implemented these two built-ins that do roughly the same operation. The difference is just that the first accepts a signed integer and returns one plus the index of the least significant 1 bit of the argument while the second accepts an unsigned and returns the 0-based index of the same bit.

I'm not aware of the history of these two implementations (if you are, please add in the comments) but I suspected that as built-ins are supposed to encapsulate low level assembly instructions in a nice language wrap, it had to do with how these instructions are available in the different hardware that GCC supports. The wikipedia entry for First Find Set reads:

Looking at everyone's favorite Intel x86, we can see that from the early days Intel had BSF (bit scan forward) since the early Pentiums/AMD K7s and later on TZCNT with the BMI/BMI1 instruction sets , which were introduced first with AMD PileDriver in 2012 and Jaguar (low power) in 2013 and Intel Haswell also in 2013. The main advantage of TZCNT to BSF is that the later has a high latency, typically 3 cycles up to 8-10 cycles on the early CPUs (Prescott, K7) while the newer instruction not only has typically better latency (2 cycles on AMD and 3 on Intel) but also better throughput: on AMD Zen(1-3) two such instructions can be executed in the same cycle, which is perfect for bit-heavy operations.

Basic Test Code

So just to have a sense of what to expect from these instructions, I just created a baseline on compiler explorer . This program cycles setting each bit on, and also setting all to zero, and recording the result for each iteration. I compiled it with optimizations on and for the basic x86 instruction set with x86-64, which should work with every Intel x86 chip ever designed.

x86-64 Assembly

Just as a crash intro to x86-64 assembly, arguments and results in 64-bit assembly are returned as registers as much as possible, in contrast with 32-bit assembly where arguments and return values are always returned using the stack. Integral (integer-like) arguments are passed in the DI, SI, DX, CX, R8, R9 in this order and results are returned in the AX register. The "R" prefix means 64-bits while "E" indicates 32-bits so RAX and EAX are the 64-bit and 32-bit versions of the same register. The biggest advantage of this register-passing is that there is no memory access involved, even if this access is through a fast L1-cache.

In our particular case, the argument "x" is passed as the EDI register, 32-bit version of the DI register and the result is returned in the 32-bit EAX register.

The two built-ins are implemented in assembly as the same instruction BSF (or BSFL for 32-bit arguments). The only difference is that __builtin_ffs uses a conditional move if the zero flag is not set - in short, it checks for a zero argument in an optimized way.

However the CTZ built-in implementation matches exactly the assembly instruction behavior, which is very important to keep in mind as we develop algorithms on top of it - if you already know that the argument is non-zero, this using this built-in instead of FFS will save you a double check.

Checking the input argument

We could try to do better than the compiler by manually checking for the zero but the result is eye-rolling: the assembly becomes exactly the same. Thank that to SSA optimizations!

The results of the run are as follows:

Note that the two built-ins are pretty much equivalent with the exceptions: (1) ffs returns index plus one and (2) ffs handles zero arguments while ctz returns garbage as expected.

BMI extensions

When we walk up the Intel architectures and switch on Haswell that introduced the BMI instruction set either with "-march=haswell" or with "-march=x86-64-v3" and above or with "-march=bdver2" for AMD family 15h with BMI, we see the compiler giving preference to the TZCNT instruction.

Returning std::pair<>

Now this C-like functions with semantics on the return value are sort of outdated. Why don't we try some C++ abstractions to see what happens. First, let's wrap the result in a nice std::pair<bool,int> where we can check the second argument for a zero. We can use std::get<1> to retrieve the result - we are going to ignore the bool for now in the print.

Ends up the two implementations are exactly the same, which does not surprise me a bit because they would ultimately result in the same AST graph after SSA squeezing. We can see that the std::pair is being returned as a single integer in the RAX (64-bit A) register, the index in the high 32-bit and the boolean in the low 32-bit. Nice and compact, I like it.

Can we do better perhaps? What if we try to invert the order of the parameters in the pair? Would it make any difference? We switch the order and make sure to use std::get<0> in main to get the result.

No, there is no difference. Apart from switching the A with C register so that the boolean is in the high 32-bit and the index in the low 32-bit, they are the exactly same six instructions. If you want to mess around with the code above, here is the link .

Simon Southwell 4 个月前

Microcode Vulnerabilities: A Gateway to Espionage Part…

Arunas Girdziusas 6 个月前

C++ Modelling of SoC Systems Part 1: Processor Elements

Simon Southwell 1 年前

Returning std::tuple<>

OK but what about a tuple? Maybe it optimizes better? Let's check. We change std::pair with std::tuple and something fundamental changes: the result is not being returned by value anymore, it is being returned by reference! The implementation has only five instructions but this can be bad as calling this function will always results in a memory access.

It looks like the first RDI argument is now a pointer to the result value and the second (fanthom) argument in ESI is the actual "x" value.

This is effectively equivalent to calling a function with this signature:

void bitpos1( std::tuple<bool,int>& result, int x );

So let's call this a win for syntax sugar! But why does this happen? I'll let this for you C++20 standard rats to figure out and add to the comments section!

Regarding the actual difference between FFS and CTZ, there is none, as before.

For completeness, let's now reverse the order of the fields in the tuple. Nothing really changes much, just the order of operations.

Returning a struct

But I'm a curious cat and I want to check what happens if instead of a std::pair or std::tuple I use a plain old school struct.

And now Yikes! It added a branch, the source of every performance evil! It explicitly checks and branches out if the argument is zero. As in the std::pair case, the index is being returned in the upper 32-bit portion of the 64-bit RAX register and the bool is available in its lower 32-bits. Again, the implementation is exactly the same for both versions FFS and CTZ.

For completeness, let's try to reverse the parameters as before.

This time the result is 100% equivalent to our best resulting assembly with std::pair. The index is returned in the low 32-bits of RAX and the boolean in its upper 32-bits, no memory access performed.

Would it be possible to transform the struct example above to match exactly either the std::pair or the std::tuple implementations? I'll leave this as a challenge - comment below and share a Godbolt link if you got a solution!

Other architectures

Just for the sake of why not, I ran the basic original code against a few other CPU architectures, for giggles mostly.

On ARM a conditional move (no branch) was generated with a hardware-supported forward bit scan instruction (RBIT).

On ARM64, it's basically the same but the code is cleaner and more concise.

On the Raspberry Pi (still ARM) the result is almost the same for both calls

On AVR (Arduino) and MSP430 Texas microcontrollers (not shown), there is no hardware support so software functions are called.

On the Kalray MPPA (Massive Parallel Processor Architecture) the instruction is hardware-supported and the assembly is very concise. If you are not familiar with this architecture, here is a link .

On the MRISC32, which is an open-source 32-bit software-defined CPU for FPGAs , the operation is hardware-supported

On RISC-V 64-bit CPUs there is no hardware support for bit scan forward so software functions doing bit sorcery are called. Same happens for RISC-V 32 bits.

Conclusion

I don't really know what the moral of this article is. In general I do these deep dives to gather information for future coding. It drives instinct. But if some conclusions must be drawn:

Return value optimization is syntax sugar. Nothing more, nothing less.
Any idiom that generates register-only results should be chosen over the ones that generate memory access.
Most modern CPUs that are not microcontrollers support bit scan forward in hardware, it is a very important operation.
Builtins are well understood by compilers and intelligently optimize. Use them as possible.
SSA optimizations will eat your smartness for breakfast.
Prefer std::pair to std::tuple whenever possible.

Reflections on Low Latency

1,761 位关注者

Jo?o Pedro Nascimento

Embedded Software Development Engineer - E2 Program at Embraer

1 年

Nice experiment! Conclusion #6 is revealing and unexpected to me!

1 次回应

要查看或添加评论，请登录

查看全部

Exploring Bit Scan Forward

Henrique Bucher

Financial Engineer | Substack Bestseller Author | C++/FPGA/Quant/Trading/HFT/ULL | PhD/MSc/BSc Engineering

Built-ins

Basic Test Code

x86-64 Assembly

Checking the input argument

BMI extensions

Returning std::pair<>

领英推荐

Returning std::tuple<>

Returning a struct

Other architectures

Conclusion

Reflections on Low Latency

1,761 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Communication runtimes in Parallel Programming with GPUs in HPC Cluster

Accelerate Development of High-Performance Products for Aerospace and Defense on Powerful FPGA Platforms

Followup to my GPUs as Processors Article

Accelerated Computing Series Part 1: Custom IP & Control Plane

Performance analysis & tuning on modern CPU

Intel Highlights Benefits of Software Optimized Silicon

Intel Highlights Benefits of Software Optimized Silicon

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

On HPC, FP16, Sapphire Rapids, AVX512 and how Vendors can actively hurt their users

7.Do you really know embeded?(2)

Built-ins

Basic Test Code

x86-64 Assembly

Checking the input argument

BMI extensions

Returning std::pair<>

领英推荐

Returning std::tuple<>

Returning a struct

Other architectures

Conclusion

Reflections on Low Latency

1,761 位关注者

Considerations on the C++20 <bit> header

2022年2月18日

Hierarchical Timing Wheels

2022年1月31日

Modern C++ References

2017年5月24日

Multi-core BCD-BigNum Strategies

2017年2月7日

Yes, You Have Been Writing SPSC Queues Wrong Your Entire Life

2016年12月15日

Discovering Positive Feedback Chambers on Social Media

2016年5月24日

C++ Hybrid Intrusive Containers

2016年4月27日

C++ IOStreams Considered Harmful

2016年4月15日

社区洞察

其他会员也浏览了

Communication runtimes in Parallel Programming with GPUs in HPC Cluster

Accelerate Development of High-Performance Products for Aerospace and Defense on Powerful FPGA Platforms

Followup to my GPUs as Processors Article

Accelerated Computing Series Part 1: Custom IP & Control Plane

Performance analysis & tuning on modern CPU

Intel Highlights Benefits of Software Optimized Silicon

Intel Highlights Benefits of Software Optimized Silicon

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

On HPC, FP16, Sapphire Rapids, AVX512 and how Vendors can actively hurt their users

7.Do you really know embeded?(2)