登录查看更多内容

How many ways to count set bits do you know?

Dmitry Yurov

Software developer in algorithms and computational analysis

发布日期: 2023年3月6日

#cpp #bitsmanipulation #compiler #setbits #algorithms

Counting set bits in an integer can be helpful in computer vision, e.g., for template matching or estimating a blob size. There are a bunch of approaches in modern C++ one can take to fulfill the task. Here you will find an overview of how to solve the problem efficiently.

At first, we will look at existing algorithms and utilities and then benchmark their performance. All the code is written for std::uint64_t, though one can easily apply the routines to the integers of other bitness.

Let's go.

Brute force

This algorithm comes into mind first - just check the state for each bit in a number individually. The algorithmic complexity is O(n), with n being the number of bits in the number.

Kernighan's algorithm

The main idea of this algorithm is to iteratively eliminate the set bits in a number and increase the counter unless the integer is zero. The worst-case complexity is the same as in the brute force case; still, it depends on the number of set bits, not bits in general.

Lookup Table

Instead of counting the set bits in runtime, we can just memorize their number for each value of a byte and then use the tabular values. This is precisely what the lookup-table method implies. This time the algorithmic complexity is proportional to the number of bytes in an integer, which sounds eight times better than the brute force approach.

Divide & Conquer

One can compute at first the number of set bits in each pair of consecutive bits in the integer. Then by adding the results from the previous step, we find the sum in four consecutive bits, after that in eight, and so on, until we know the number of set bits integer-wide.

The algorithm's complexity is obviously O(log n). Following the ideas from this book, one could optimize the approach even further and reduce the number of operations per log power. The code in the snippet shows the basic form.

Standard library tools

There are two possible options. The first one has been available since C++98 and is called std::bitset:

The second option, std::popcount, has been introduced into the standard library since C++20:

Both functions are implementation-dependent. The only thing to say with certainty is the ability of std::popcount to work with integer types directly without constructing an intermediate bitset object.

POPCNT instruction

Next, we will delve into solutions that depend on rather specific features of the processor architecture. The first such solution is the call to POPCNT processor instruction. The instruction has been supported by Intel and AMD processors starting from Nehalem and Barcelona microarchitectures since around fifteen years ago.

The instruction can be accessed directly via assembly language code insertion or by utilizing one of the dedicated functions from <immintrin.h> as I did:

领英推荐

DelPy: Our in-house framework for predictive analytics

AI Shared Services 2 年前

The fourth abstraction is upon us

Andrew La Grange 10 个月前

The Iterator Protocol

Rainer Grimm 2 年前

SIMD-acceleration

SIMD-vectorization applies to most of the algorithms above, provided they are used on an array of integers (in all honesty, you wouldn't be reading this text if counting the set bits wasn't a performance-critical operation in your code). As we will see in the next section, the compiler can efficiently do this work for you.

Still, if you are a lucky owner of an AVX512-capable processor with VPOPCNTDQ support and it is your target architecture, you can use a vectorized POPCNT instruction and write a highly performant utility to solve the task:

Unlike the previous algorithms, the snippet above requires at least some explanation. After creating the result container (line 9), we have to go through the input container (lines 13-17), load the integers into a 512-bit-wide zmm-register (line 14), actually count the set bits (line 15), store the result in the form of packed 8-bit integers in the first 64 bits of the register and finally extract them to the result container (line 16). But it is only the first part; we need to process the tail of the container, which didn't fit into an integer number of zmm-registers. To keep the implementation simple and pure, I created two buffer arrays of appropriate length and applied the same operations as in the primary cycle (lines 20-30).

Performance comparison

Now the fun begins. I have benchmarked performance on a 10.000.000-element-sized vector of std::uint64_t. Each algorithm has been run on Intel Core i7-11850 Processor with averaging the execution time over 400 runs.

The following graphs draw the execution time in milliseconds (note the logarithmic scale) against the algorithm for the three major compilers: msvc (version 19.34.31937), gcc (version 10.3.0), and clang (version 13.0.1).

The first plot represents the execution times for "standard" compilation flags - that is, the flags generated by CMake (I've used CMake ver. 3.23.2) for the Release version of the project. Since AVX-512 and POPCNT instructions are not used by gcc and clang compilers in this case, the corresponding algorithms have not been included in the comparison.

All in all, the performance of the algorithms is in good agreement with the expectations - brute force is the slowest one, Kernighan's algorithm gets one and a half times faster, and all the rest of the algorithms are roughly one order of magnitude faster than the brute force.

However, even this stage of benchmarking brings some surprises. For example, the brute force approach is twice as fast on clang as on the other compilers. If one works hard and looks into the generated assembler instructions, it can be found that clang partially unrolls the shift & check cycle. Besides, it uses the bt (bit test) instruction instead of shift and bitwise and operations - while others don't.

Adding the key for using the POPCNT instruction to gcc and clang allows testing the procedure with the direct call to the instruction and enables the corresponding optimization in other functions.

The most noticeable speed-up occurs in the gcc and clang code for Kernighan's algorithm. Disassembly reveals an astonishing fact - these two can compile the procedure in a single POPCNT instruction.

Standard library functions get an x1.5 speed-up on gcc and clang compilers due to the internal compile-time switch to the POPCNT instruction. Since no dedicated flag exists for POPCNT instruction in msvc and I didn't switch from the default selection for the enhanced instruction set, the execution time for the standard library functions has remained the same. However, in reality, msvc does use the POPCNT instruction even for the default flags but makes a runtime check for its availability. On the contrary, by direct call, the runtime check is skipped.

At last, by enabling the instructions from the latest AVX-512 set, we can compare the performance of the vectorized popcount to all the rest of the algorithms. It is not much higher (only twice) because one must spend the precious processor ticks on loading to and storing from the zmm-registers.

Another interesting point is the triple speed-up for the brute force algorithm on clang and gcc. It is explained by the compiler's vectorizing of the counting process. Clang behaves more aggressively and applies zmm-wide (512-bit) instructions, while gcc operates only on the ymm-registers (i.e., 256-bit registers).

In this last test case, I have enabled /arch:AVX512 flag for msvc, which also assumes the availability of POPCNT instruction. It has affected the performance of the standard library functions since they do not require the runtime check for POPCNT availability. As a result, these implementations now perform the same as the direct instruction call.

Conclusion

The main conclusion of this post is straightforward:

Unless your target architecture supports vectorized popcount from the latest AVX-512 instruction set, using the built-in standard library functions is an excellent first choice. Even with the AVX-512, one can expect that the appropriate optimizations will become available in future standard library versions.
To fully use the abilities of AVX-capable hardware, avoid frequent store and load operations; otherwise, they will steal all the performance gain.
Remember to enable the appropriate compilation flags for your compiler.

All the source code is available in a dedicated repo on my GitHub page.

Jason Burke

Multi-stack Technologist

10 个月

Great article.

Arseniy Masiuk

Software architect, Lead of HMI development

1 年

It would be also nice to see how a) it runs on Linux and b) with Intel C++ compiler (as Intel claims to be the fastest compiler ever :).

?? Dmitry Kisler

Seasoned engineer enabling other engineers for incremental delivery and personal growth

I’d be keen to see a side-by-side comparison with rust and a couple of gc-languages given identical environmental conditions.

Vitaly Vorobyev

Software, data analysis

Very interesting! Clear and instructive. Thank you!

1 次回应

查看更多评论

How many ways to count set bits do you know?

Dmitry Yurov

Software developer in algorithms and computational analysis

Brute force

Kernighan's algorithm

Lookup Table

Divide & Conquer

Standard library tools

POPCNT instruction

领英推荐

SIMD-acceleration

Performance comparison

Conclusion

社区洞察

其他会员也浏览了

The Type-Traits Library: Correctness

Safe Comparisons of Integrals with C++20

Use try/catch with caution

Introducing Lov Language

Ch 14.2 Pix2Pix Gan and Cycle?Gan

Names adventures

The Dijkstra You Never Knew

Constraint #49

To infinity and beyond, and why going there might be useful

Corner case in constraint #49 Learnings & Solution