To Int or To UInt, This Is The Question
Alex Dathskovsky ?
Director of Software Engineering @ Speedata.io | C++ Guru and Speaker | ISO C++ standardization group member
During our work we use integral types to perform arithmetic calculations but usually we don't think how the selection of the data type will affect the performance and compiler optimizations.
In this short article we will see the importance of selecting the correct data type for the job and how it will affect the compiler optimizations. We will also look at the overall performance implication for the application.
Throughout this article we will be using Clang 13.0.1 and -O2 as the optimization level.
Let's start with a very simple example:
In this code snippet we can see the use of a different data type for the same simple logic. These simple functions take as an input of two numbers, add them up, and divide them by two. add_and_devide_u is the unsigned version and add_and_devide_s is the signed version. (Sorry for the spelling mistake in the function names)
Now let's see what will the compiler do for each one of these functions:
Here you can see that the compiler was smart enough to understand that because we are working with unsigned numbers instead of the latency-heavy division instruction it can use the SHR (shift logic right) which is a very simple bit instruction for the CPU. The assembly is very simple, it calculates the addition of a and b then shifting it right by 1 bit.
Side Notes: Why could the compiler optimize the assembly in the same way and why is division instruction is latency-heavy?
Latency:?This is the delay that the instruction generates in a dependency chain. The n numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.
For example add instruction latency is 1 (not piped) and imul has the latency of 3 (piped), div and idiv instructions latency is at least 20 times higher (depends on the architecture)
Back to our Example:
Now let's look at what the compiler produces for the signed version. Think about it for a minute - will it be able to optimize the assembly in the same way?
From this snippet we can see that the compiler cannot optimize the code in the same way. It is still smart enough not to perform the division itself but in this case the compiler cannot just add and shift right as before. You may ask yourself what happened here? The answer is in the snippet itself: as you may know signed integers may represent negative numbers. While using SAR (arithmetic right shift) the compiler is using a different rounding mode. In this case round-down is used and so to achieve round-to-zero behavior as in unsigned mode the compiler must add one if the number is negative and only then it can divide.
What is the difference between SHR and SAR operations:
SHR: Logical right shift?means shifting the bits to the right and MSB becomes 0.
Example: shr 1 0 1 1 0 1 1 1 =?0?1 0 1 1 0 1 1
SAR: Arithmetic right shift?means shifting the bits to the right and MSB bit is same as in the original number.
Example: sar?1?0 1 1 0 1 0 1 =?1?1 0 1 1 0 1 0.
From the first example we can already see that using the correct data type might make the optimization better and enhance the performance.
The first example was pretty simple and the gain will not be that high. Next let's look at a more complex example:
In this example we see a more complex code. The functions implement the sum of a Arithmetic Series where the only difference between the two functions is that mat_signed is summing a series of signed numbers and mat_unsigned is summing a series of unsigned numbers.
Arithmetic Series
Let's start the discussion by understanding what is an Arithmetic Series.
Arithmetic Series is a series of numbers where the difference between any two sequential numbers is constant.
For example : 1,2,3,4,5,6,7,8,9,10,...,n is an Arithmetic Series where the difference between any two sequential numbers is 1.
The summation of any Arithmetic Series can be calculated with the next equation:
where n is the number of elements in the series, a1 is the first element of the series, and an is the last element of the series.
Back to the example:
Our functions calculate the summation of a Arithmetic Series that spans from 1 to n, where n is provided by the user.
Now let's explore the assembly generated by the compiler for each function:
From the snippet we can see the compiler was able to figure out we actually want a calculation of the Sum of the Series and so the compiler removed the loop and used the equation that was present above to calculate the sum of the series.
Surprisingly here the compiler was not able to optimize the code. You can see that lines 19-23 represent the loop that calculates each of the numbers by adding rax and rcx in line 20 and adding 1 to rcx in line 21. In line 23 we can see the comparison of rcx and rdi and we will stop branching to .LBB1_4 when rcx will be equal to n. Unlike in the previous example here the use of unsigned integers made the compiler's job much harder, and it couldn't optimize the code. We will get to the reason soon but let's look into performance first.
领英推荐
Performance Evaluation:
Now, let’s look into both functions performance using Quick Bench micro-benchmarks
The benchmarks are very similar and the only difference between them is the usage of the correct summing function (mat_signed/mat_unsigned). Each benchmark function calculates the sum of a 100-element long Series and stores the result in a variable which is not used yet it is not optimized out due to the special DoNotOptimize function provided by Quick Bench.
Results:
The mat_signed function is x34 faster than the same implementation with unsigned numbers. The reason for this is of course the loop that was produced in its assembly. mat_signed just calculates an equation while the unsigned version has to go over all the numbers one by one. The number of calculations the CPU has to do even without the branching is much greater in this case. The application has to execute at least 100 add instructions, which is much more than what has to be done in the signed version. Another problem is adding many branches that usually add performance hits despite the CPU's Branch Predictor. Predictions might still be wrong forcing the CPU to flush the calculation that it has already done in the pipeline and execute the correct path of instructions. As seen in the graph above, the performance hit for using the wrong data type might degrade for the overall performance.
What happened? why couldn't the compiler optimize the unsigned code?
This is a very good question, Let's first understand the difference between signed and unsigned numbers:
Now that we understand the difference, we can understand why the compiler couldn't apply the same optimization: since the unsigned integer overflow is a well-defined situation the compiler should avoid, it cannot generate optimized code that might lead to an overflow where the unoptimized code would not have overflowed. The compiler thus emits unoptimized assembly in the unsigned integer implementation, including overflow checks
Why Signed Integers Are Preferred:
If an unsigned value is out of range for its type, to store it we have to divide it by a number that is greater the max range for this type and then keep store only the reminder.
Let's look at a 16-bit unsigned number:
65538 is too large to fit an 16-bit unsigned integer because the 16-bit range is 0 to 65535. 1 greater than largest number for this range is 65536. Therefore, we divide 65538 by 65536, storing the remainder which is 2.
Another way to think about this calculation is to imagine that any larger number is just wrapped around or modulo wrapped, basically if the number is too big to fit it will be modulo wrapped and only the reminder will be kept.
Wrapping around doesn't work only for overflowing; it works even if we try to store a negative number in an unsigned integer.
Let's look at 16-bit unsigned integer:
Let's try to assign -1 to it, here it will wrap around to the top of the range and 65535 will be saved in the variable. The compiler detects an assignment of a signed integer to an unsigned variable. The signed integer has to be turn to unsigned and so -1 doesn't exists in the unsigned integer range so the compiler wraps it around to the highest number in the unsigned integer range.
Why should we avoid using unsigned integers when we don’t really need the non-negative semantics of the data type?
Another example:
In this example, the prefix -- operator was used for the loop and if i wasn't initialized properly the program will have to run 18446744073709551615 rimes which is definitely unplanned.
In this code snippet we can see an example of a very simplistic decoding function that receives a buffer of chars and the size of the buffer as a signed integer. The function prepares a buffer for the decoded message in line 33 and the buffer is of predefined size. The for loop in line 34 compares signed and unsigned integers and so the signed integer is promoted to unsigned (size variable is converted to unsigned integer). Then, the decoding process is just writing the same char from the buffer xor 0xC. if -1 is passed as a size here this code might cause a buffer overflow or stack corruption because the loop will be much longer then the buffer that was provided.
What can be done?
C++20 helps us with the comparison by providing dedicated utilities:
std::cmp_equal: ==
std::cmp_not_equal: !=
std::cmp_less: <
std::cmp_less_equal: <=
std::cmp_greater: >
std::cmp_greater_equal: >=
You can read more about these instructions here.
Another common pitfall is the use of an auto variable that’s initialized using an integer literal.
In this example auto is used in the for loop (line 42), In C++, a literal without any identification will be deduced as a signed integer As a result, i's type will be signed long. However arr.size() returns an unsigned integer, and the comparison of the signed i with the unsigned size() is error prone.
There are a few ways to fix this: For instance, You may use the correct data type to initialize the variable std::size_t i = 0. If you have access to C++20 you may use arr.ssize() to get a signed integer and last but not least in C++23 we will have a size_t literal so using auto i = 0uz will be possible.
Should we use Unsigned numbers?
The answer is Yes! As shown in the first example, bit manipulations are faster and safer when applied to unsigned integers, and the compiler might see better optimizations for simple mathematical equations.
Conclusion: Using the correct data type is really important for the performance, compiler optimization and even app safety. As best practice, we should always use signed integers unless an unsigned integer is actually needed for bit manipulations. If We are mixing signed and unsigned integers it is preferable to use the new standards safe comparison functions. Last but not least if auto is used with literals, the literals should always be with the correct type.
UX/UI SAAS Product Designer & Consultant ?? | Helping SAAS / AI companies and Startups Build Intuitive, Scalable Products.
3 个月???? ??? ?? ?? ???????? ??? ????? ???? ?????? ???: ?????? ????? ??? ??????? ?????? ??????, ?????? ?????? ??????,?????? ????? ????????. https://chat.whatsapp.com/IyTWnwphyc8AZAcawRTUhR
--
2 年Good one!
Very good writing Alex! enjoyed reading it. I'm sooo on the fence here for many years. On one hand, there was a (mistake? some say) in the std where all size() are unsigned. It is now fixed with 23(?) ssize() available in containers. I know for some time (btw https://subscription.packtpub.com/book/programming/9781800208117 also mention it in the very first few ch.) that int is faster. because the compiler does not need to "deal" with wrap around. There is no wrap around in signed, so it "cannot happen" ;) I know many (Bjarne, Chandler etc...) advocate for signed int the same reasons you indicated. But for me, besides a possible perfs loss, there is no wrap around for int. it's a UB (I think this might change in the future as all impls are going to be 2's complement??). Plus, I like to work with correct semantic. If something cannot be negative, why not express it in the type system. This is one reason we have a strong type system. Again - I understand the points that you and others bring to the table, these are good points but there are other points that also need to be look at. at least IMO.
Software Engineer at Microsoft Ltd
2 年Barel Meir ??????
Lecturer at the Academic College of Tel-Aviv-Yaffo, Co-Organizer of Core C++
2 年Very nice article! It is interesting to note that if in the arithmetic series example we change the loop to: i=0; i<n; then clang is able to optimize also the unsigned version! (And gcc doesn't optimize any of the two, even with -O3). See: https://godbolt.org/z/hxxE44vn8