How many ways to count set bits do you know?
Counting set bits in an integer can be helpful in computer vision, e.g., for template matching or estimating a blob size. There are a bunch of approaches in modern C++ one can take to fulfill the task. Here you will find an overview of how to solve the problem efficiently.
At first, we will look at existing algorithms and utilities and then benchmark their performance. All the code is written for std::uint64_t, though one can easily apply the routines to the integers of other bitness.
Let's go.
Brute force
This algorithm comes into mind first - just check the state for each bit in a number individually. The algorithmic complexity is O(n), with n being the number of bits in the number.
Kernighan's algorithm
The main idea of this algorithm is to iteratively eliminate the set bits in a number and increase the counter unless the integer is zero. The worst-case complexity is the same as in the brute force case; still, it depends on the number of set bits, not bits in general.
Lookup Table
Instead of counting the set bits in runtime, we can just memorize their number for each value of a byte and then use the tabular values. This is precisely what the lookup-table method implies. This time the algorithmic complexity is proportional to the number of bytes in an integer, which sounds eight times better than the brute force approach.
Divide & Conquer
One can compute at first the number of set bits in each pair of consecutive bits in the integer. Then by adding the results from the previous step, we find the sum in four consecutive bits, after that in eight, and so on, until we know the number of set bits integer-wide.
The algorithm's complexity is obviously O(log n). Following the ideas from this book, one could optimize the approach even further and reduce the number of operations per log power. The code in the snippet shows the basic form.
Standard library tools
There are two possible options. The first one has been available since C++98 and is called std::bitset:
The second option, std::popcount, has been introduced into the standard library since C++20:
Both functions are implementation-dependent. The only thing to say with certainty is the ability of std::popcount to work with integer types directly without constructing an intermediate bitset object.
POPCNT instruction
Next, we will delve into solutions that depend on rather specific features of the processor architecture. The first such solution is the call to POPCNT processor instruction. The instruction has been supported by Intel and AMD processors starting from Nehalem and Barcelona microarchitectures since around fifteen years ago.
The instruction can be accessed directly via assembly language code insertion or by utilizing one of the dedicated functions from <immintrin.h> as I did:
领英推荐
SIMD-acceleration
SIMD-vectorization applies to most of the algorithms above, provided they are used on an array of integers (in all honesty, you wouldn't be reading this text if counting the set bits wasn't a performance-critical operation in your code). As we will see in the next section, the compiler can efficiently do this work for you.
Still, if you are a lucky owner of an AVX512-capable processor with VPOPCNTDQ support and it is your target architecture, you can use a vectorized POPCNT instruction and write a highly performant utility to solve the task:
Unlike the previous algorithms, the snippet above requires at least some explanation. After creating the result container (line 9), we have to go through the input container (lines 13-17), load the integers into a 512-bit-wide zmm-register (line 14), actually count the set bits (line 15), store the result in the form of packed 8-bit integers in the first 64 bits of the register and finally extract them to the result container (line 16). But it is only the first part; we need to process the tail of the container, which didn't fit into an integer number of zmm-registers. To keep the implementation simple and pure, I created two buffer arrays of appropriate length and applied the same operations as in the primary cycle (lines 20-30).
Performance comparison
Now the fun begins. I have benchmarked performance on a 10.000.000-element-sized vector of std::uint64_t. Each algorithm has been run on Intel Core i7-11850 Processor with averaging the execution time over 400 runs.
The following graphs draw the execution time in milliseconds (note the logarithmic scale) against the algorithm for the three major compilers: msvc (version 19.34.31937), gcc (version 10.3.0), and clang (version 13.0.1).
The first plot represents the execution times for "standard" compilation flags - that is, the flags generated by CMake (I've used CMake ver. 3.23.2) for the Release version of the project. Since AVX-512 and POPCNT instructions are not used by gcc and clang compilers in this case, the corresponding algorithms have not been included in the comparison.
All in all, the performance of the algorithms is in good agreement with the expectations - brute force is the slowest one, Kernighan's algorithm gets one and a half times faster, and all the rest of the algorithms are roughly one order of magnitude faster than the brute force.
However, even this stage of benchmarking brings some surprises. For example, the brute force approach is twice as fast on clang as on the other compilers. If one works hard and looks into the generated assembler instructions, it can be found that clang partially unrolls the shift & check cycle. Besides, it uses the bt (bit test) instruction instead of shift and bitwise and operations - while others don't.
Adding the key for using the POPCNT instruction to gcc and clang allows testing the procedure with the direct call to the instruction and enables the corresponding optimization in other functions.
The most noticeable speed-up occurs in the gcc and clang code for Kernighan's algorithm. Disassembly reveals an astonishing fact - these two can compile the procedure in a single POPCNT instruction.
Standard library functions get an x1.5 speed-up on gcc and clang compilers due to the internal compile-time switch to the POPCNT instruction. Since no dedicated flag exists for POPCNT instruction in msvc and I didn't switch from the default selection for the enhanced instruction set, the execution time for the standard library functions has remained the same. However, in reality, msvc does use the POPCNT instruction even for the default flags but makes a runtime check for its availability. On the contrary, by direct call, the runtime check is skipped.
At last, by enabling the instructions from the latest AVX-512 set, we can compare the performance of the vectorized popcount to all the rest of the algorithms. It is not much higher (only twice) because one must spend the precious processor ticks on loading to and storing from the zmm-registers.
Another interesting point is the triple speed-up for the brute force algorithm on clang and gcc. It is explained by the compiler's vectorizing of the counting process. Clang behaves more aggressively and applies zmm-wide (512-bit) instructions, while gcc operates only on the ymm-registers (i.e., 256-bit registers).
In this last test case, I have enabled /arch:AVX512 flag for msvc, which also assumes the availability of POPCNT instruction. It has affected the performance of the standard library functions since they do not require the runtime check for POPCNT availability. As a result, these implementations now perform the same as the direct instruction call.
Conclusion
The main conclusion of this post is straightforward:
All the source code is available in a dedicated repo on my GitHub page.
Multi-stack Technologist
10 个月Great article.
Software architect, Lead of HMI development
1 年It would be also nice to see how a) it runs on Linux and b) with Intel C++ compiler (as Intel claims to be the fastest compiler ever :).
Seasoned engineer enabling other engineers for incremental delivery and personal growth
1 年I’d be keen to see a side-by-side comparison with rust and a couple of gc-languages given identical environmental conditions.
Software, data analysis
1 年Very interesting! Clear and instructive. Thank you!