登录查看更多内容

To Int or To UInt, This Is The Question

Alex Dathskovsky ?

Director of Software Engineering @ Speedata.io | C++ Guru and Speaker | ISO C++ standardization group member

发布日期: 2022年6月6日

During our work we use integral types to perform arithmetic calculations but usually we don't think how the selection of the data type will affect the performance and compiler optimizations.

In this short article we will see the importance of selecting the correct data type for the job and how it will affect the compiler optimizations. We will also look at the overall performance implication for the application.

Throughout this article we will be using Clang 13.0.1 and -O2 as the optimization level.

Let's start with a very simple example:

In this code snippet we can see the use of a different data type for the same simple logic. These simple functions take as an input of two numbers, add them up, and divide them by two. add_and_devide_u is the unsigned version and add_and_devide_s is the signed version. (Sorry for the spelling mistake in the function names)

Now let's see what will the compiler do for each one of these functions:

add_and_devide_u:

Here you can see that the compiler was smart enough to understand that because we are working with unsigned numbers instead of the latency-heavy division instruction it can use the SHR (shift logic right) which is a very simple bit instruction for the CPU. The assembly is very simple, it calculates the addition of a and b then shifting it right by 1 bit.

Side Notes: Why could the compiler optimize the assembly in the same way and why is division instruction is latency-heavy?

In the current example we are dividing the number by two and so, division by two is just like shifting the number right, as all numbers are represented in the binary form. Therefor using n/2 is equal to n>>1.
Each instruction that is fetched from the memory is pushed into a pipeline, one of the steps in the pipeline is execution, execution may be piped as well.
Each execution has its own unit and there is a limited number of execution units. (depends on the CPU)
Each instruction has its own latency:

Latency:?This is the delay that the instruction generates in a dependency chain. The n numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

For example add instruction latency is 1 (not piped) and imul has the latency of 3 (piped), div and idiv instructions latency is at least 20 times higher (depends on the architecture)

Back to our Example:

add_and_devide_s:

Now let's look at what the compiler produces for the signed version. Think about it for a minute - will it be able to optimize the assembly in the same way?

From this snippet we can see that the compiler cannot optimize the code in the same way. It is still smart enough not to perform the division itself but in this case the compiler cannot just add and shift right as before. You may ask yourself what happened here? The answer is in the snippet itself: as you may know signed integers may represent negative numbers. While using SAR (arithmetic right shift) the compiler is using a different rounding mode. In this case round-down is used and so to achieve round-to-zero behavior as in unsigned mode the compiler must add one if the number is negative and only then it can divide.

What is the difference between SHR and SAR operations:

SHR: Logical right shift?means shifting the bits to the right and MSB becomes 0.

Example: shr 1 0 1 1 0 1 1 1 =?0?1 0 1 1 0 1 1

SAR: Arithmetic right shift?means shifting the bits to the right and MSB bit is same as in the original number.

Example: sar?1?0 1 1 0 1 0 1 =?1?1 0 1 1 0 1 0.

From the first example we can already see that using the correct data type might make the optimization better and enhance the performance.

The first example was pretty simple and the gain will not be that high. Next let's look at a more complex example:

In this example we see a more complex code. The functions implement the sum of a Arithmetic Series where the only difference between the two functions is that mat_signed is summing a series of signed numbers and mat_unsigned is summing a series of unsigned numbers.

Arithmetic Series

Let's start the discussion by understanding what is an Arithmetic Series.

Arithmetic Series is a series of numbers where the difference between any two sequential numbers is constant.

For example : 1,2,3,4,5,6,7,8,9,10,...,n is an Arithmetic Series where the difference between any two sequential numbers is 1.

The summation of any Arithmetic Series can be calculated with the next equation:

where n is the number of elements in the series, a1 is the first element of the series, and an is the last element of the series.

Back to the example:

Our functions calculate the summation of a Arithmetic Series that spans from 1 to n, where n is provided by the user.

Now let's explore the assembly generated by the compiler for each function:

mat_signed :

From the snippet we can see the compiler was able to figure out we actually want a calculation of the Sum of the Series and so the compiler removed the loop and used the equation that was present above to calculate the sum of the series.

mat_unsigned :

Surprisingly here the compiler was not able to optimize the code. You can see that lines 19-23 represent the loop that calculates each of the numbers by adding rax and rcx in line 20 and adding 1 to rcx in line 21. In line 23 we can see the comparison of rcx and rdi and we will stop branching to .LBB1_4 when rcx will be equal to n. Unlike in the previous example here the use of unsigned integers made the compiler's job much harder, and it couldn't optimize the code. We will get to the reason soon but let's look into performance first.

领英推荐

Dynamic Linking and Memory Relocations in?Rust

Luis Soares 2 个月前

Code Blocks

Ferdinand Charles 1 年前

Deterministic Finite Automata

ANIK C. 2 年前

Performance Evaluation:

Now, let’s look into both functions performance using Quick Bench micro-benchmarks

The benchmarks are very similar and the only difference between them is the usage of the correct summing function (mat_signed/mat_unsigned). Each benchmark function calculates the sum of a 100-element long Series and stores the result in a variable which is not used yet it is not optimized out due to the special DoNotOptimize function provided by Quick Bench.

Results:

The mat_signed function is x34 faster than the same implementation with unsigned numbers. The reason for this is of course the loop that was produced in its assembly. mat_signed just calculates an equation while the unsigned version has to go over all the numbers one by one. The number of calculations the CPU has to do even without the branching is much greater in this case. The application has to execute at least 100 add instructions, which is much more than what has to be done in the signed version. Another problem is adding many branches that usually add performance hits despite the CPU's Branch Predictor. Predictions might still be wrong forcing the CPU to flush the calculation that it has already done in the pipeline and execute the correct path of instructions. As seen in the graph above, the performance hit for using the wrong data type might degrade for the overall performance.

What happened? why couldn't the compiler optimize the unsigned code?

This is a very good question, Let's first understand the difference between signed and unsigned numbers:

Signed Integers: Store using two's complement representation, support negative numbers, and overflow is considered undefined behavior. The range of a 64-bit signed integer is -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
Unsigned Integers: Stored using Modulo 2 representation, support only positive numbers, and overflow is well-defined. The Range of a 64-bit unsigned integer is 0 to 18,446,744,073,709,551,615

Now that we understand the difference, we can understand why the compiler couldn't apply the same optimization: since the unsigned integer overflow is a well-defined situation the compiler should avoid, it cannot generate optimized code that might lead to an overflow where the unoptimized code would not have overflowed. The compiler thus emits unoptimized assembly in the unsigned integer implementation, including overflow checks

Why Signed Integers Are Preferred:

If an unsigned value is out of range for its type, to store it we have to divide it by a number that is greater the max range for this type and then keep store only the reminder.

Let's look at a 16-bit unsigned number:

65538 is too large to fit an 16-bit unsigned integer because the 16-bit range is 0 to 65535. 1 greater than largest number for this range is 65536. Therefore, we divide 65538 by 65536, storing the remainder which is 2.

Another way to think about this calculation is to imagine that any larger number is just wrapped around or modulo wrapped, basically if the number is too big to fit it will be modulo wrapped and only the reminder will be kept.

Wrapping around doesn't work only for overflowing; it works even if we try to store a negative number in an unsigned integer.

Let's look at 16-bit unsigned integer:

Let's try to assign -1 to it, here it will wrap around to the top of the range and 65535 will be saved in the variable. The compiler detects an assignment of a signed integer to an unsigned variable. The signed integer has to be turn to unsigned and so -1 doesn't exists in the unsigned integer range so the compiler wraps it around to the highest number in the unsigned integer range.

Why should we avoid using unsigned integers when we don’t really need the non-negative semantics of the data type?

As shown in the example above, using signed integers might give the compiler more opportunities for optimizations.
Wrap Arounds might be unexpected and no error will be shown. For example, a subtraction of two numbers that are unsigned may wrap around and the result will be unexpected. Let's look at a 8-bit unsigned integer:

Another example:

In this example, the prefix -- operator was used for the loop and if i wasn't initialized properly the program will have to run 18446744073709551615 rimes which is definitely unplanned.

Mixing signed and unsigned integers is a hard to reason about: In C++, if one integer is signed and the other is unsigned, unsigned integers will be used, the signed integer will be converted to unsigned (promoted). Since unsigned integers can not store negative numbers, this may result in loss of data, an endless loop or any other sort of bad and undefined behavior. For example:

In this code snippet we can see an example of a very simplistic decoding function that receives a buffer of chars and the size of the buffer as a signed integer. The function prepares a buffer for the decoded message in line 33 and the buffer is of predefined size. The for loop in line 34 compares signed and unsigned integers and so the signed integer is promoted to unsigned (size variable is converted to unsigned integer). Then, the decoding process is just writing the same char from the buffer xor 0xC. if -1 is passed as a size here this code might cause a buffer overflow or stack corruption because the loop will be much longer then the buffer that was provided.

What can be done?

C++20 helps us with the comparison by providing dedicated utilities:

std::cmp_equal: ==

std::cmp_not_equal: !=

std::cmp_less: <

std::cmp_less_equal: <=

std::cmp_greater: >

std::cmp_greater_equal: >=

You can read more about these instructions here.

Another common pitfall is the use of an auto variable that’s initialized using an integer literal.

In this example auto is used in the for loop (line 42), In C++, a literal without any identification will be deduced as a signed integer As a result, i's type will be signed long. However arr.size() returns an unsigned integer, and the comparison of the signed i with the unsigned size() is error prone.

There are a few ways to fix this: For instance, You may use the correct data type to initialize the variable std::size_t i = 0. If you have access to C++20 you may use arr.ssize() to get a signed integer and last but not least in C++23 we will have a size_t literal so using auto i = 0uz will be possible.

Should we use Unsigned numbers?

The answer is Yes! As shown in the first example, bit manipulations are faster and safer when applied to unsigned integers, and the compiler might see better optimizations for simple mathematical equations.

Conclusion: Using the correct data type is really important for the performance, compiler optimization and even app safety. As best practice, we should always use signed integers unless an unsigned integer is actually needed for bit manipulations. If We are mixing signed and unsigned integers it is preferable to use the new standards safe comparison functions. Last but not least if auto is used with literals, the literals should always be with the correct type.

Amichai Oron

UX/UI SAAS Product Designer & Consultant ?? | Helping SAAS / AI companies and Startups Build Intuitive, Scalable Products.

3 个月

???? ??? ?? ?? ???????? ??? ????? ???? ?????? ???: ?????? ????? ??? ??????? ?????? ??????, ?????? ?????? ??????,?????? ????? ????????. https://chat.whatsapp.com/IyTWnwphyc8AZAcawRTUhR

Alex Belous

2 年

Good one!

1 次回应

Yacob Cohen Arazi

2 年

Very good writing Alex! enjoyed reading it. I'm sooo on the fence here for many years. On one hand, there was a (mistake? some say) in the std where all size() are unsigned. It is now fixed with 23(?) ssize() available in containers. I know for some time (btw https://subscription.packtpub.com/book/programming/9781800208117 also mention it in the very first few ch.) that int is faster. because the compiler does not need to "deal" with wrap around. There is no wrap around in signed, so it "cannot happen" ;) I know many (Bjarne, Chandler etc...) advocate for signed int the same reasons you indicated. But for me, besides a possible perfs loss, there is no wrap around for int. it's a UB (I think this might change in the future as all impls are going to be 2's complement??). Plus, I like to work with correct semantic. If something cannot be negative, why not express it in the type system. This is one reason we have a strong type system. Again - I understand the points that you and others bring to the table, these are good points but there are other points that also need to be look at. at least IMO.

1 次回应

Ori Kopel

Software Engineer at Microsoft Ltd

2 年

Barel Meir ??????

Amir Kirsh

Lecturer at the Academic College of Tel-Aviv-Yaffo, Co-Organizer of Core C++

2 年

Very nice article! It is interesting to note that if in the arithmetic series example we change the loop to: i=0; i<n; then clang is able to optimize also the unsigned version! (And gcc doesn't optimize any of the two, even with -O3). See: https://godbolt.org/z/hxxE44vn8

2 次回应

查看更多评论

要查看或添加评论，请登录

Alex Dathskovsky ?的更多文章

Nth element variadic pack extraction

2022年10月17日

Nth element variadic pack extraction

Last week I wrote about the incredible way Clang offers the Nth element from a variadic pack. As promised, I will show…

9 条评论
On generic programming, value types and constraints

2022年10月2日

On generic programming, value types and constraints

Last week I have published a new riddle, it involved generic programming from C++20 and an interesting case of…

3 条评论
The Awesome Power of Concepts: Specialization Detection

2022年1月16日

The Awesome Power of Concepts: Specialization Detection

Hi C++ fans! Today I want to talk about Concepts and how they can make our lives easier when it comes to template…

14 条评论
Generic Tuple Hashing With Modern C++

2020年8月1日

Generic Tuple Hashing With Modern C++

The Beauty of C++17 is its elegance and simplicity. A few days ago someone from my team asked me: "How can I use a…

9 条评论

To Int or To UInt, This Is The Question

Alex Dathskovsky ?

Director of Software Engineering @ Speedata.io | C++ Guru and Speaker | ISO C++ standardization group member

Side Notes: Why could the compiler optimize the assembly in the same way and why is division instruction is latency-heavy?

Back to our Example:

What is the difference between SHR and SAR operations:

Arithmetic Series

Back to the example:

领英推荐

Performance Evaluation:

Results:

What happened? why couldn't the compiler optimize the unsigned code?

Why Signed Integers Are Preferred:

Let's look at a 16-bit unsigned number:

Let's look at 16-bit unsigned integer:

What can be done?

Another common pitfall is the use of an auto variable that’s initialized using an integer literal.

Should we use Unsigned numbers?

Alex Dathskovsky ?的更多文章

社区洞察

其他会员也浏览了

C++ transactional memory

How we sympathize with a question on StackOverflow but keep silent

volatile and other Small Improvements in C++20

Template Arguments

Implementing Simple Futures with Coroutines

C++20 Coroutine: Under The Hood

More and More Utilities in C++20

Understanding Function Matching in C++: A Comprehensive Guide

Calendar and Time-Zones in C++20: Time of Day

Intermodular analysis of C and C++ projects in detail. Part 2

Side Notes: Why could the compiler optimize the assembly in the same way and why is division instruction is latency-heavy?

Back to our Example:

What is the difference between SHR and SAR operations:

Arithmetic Series

Back to the example:

领英推荐

Performance Evaluation:

Results:

What happened? why couldn't the compiler optimize the unsigned code?

Why Signed Integers Are Preferred:

Let's look at a 16-bit unsigned number:

Let's look at 16-bit unsigned integer:

What can be done?

Another common pitfall is the use of an auto variable that’s initialized using an integer literal.

Should we use Unsigned numbers?

Alex Dathskovsky ?的更多文章

Nth element variadic pack extraction

On generic programming, value types and constraints

The Awesome Power of Concepts: Specialization Detection

Generic Tuple Hashing With Modern C++

社区洞察

其他会员也浏览了

C++ transactional memory

How we sympathize with a question on StackOverflow but keep silent

volatile and other Small Improvements in C++20

Template Arguments

Implementing Simple Futures with Coroutines

C++20 Coroutine: Under The Hood

More and More Utilities in C++20

Understanding Function Matching in C++: A Comprehensive Guide

Calendar and Time-Zones in C++20: Time of Day

Intermodular analysis of C and C++ projects in detail. Part 2