Representation of?data types in?C/C++. Part 2?— Floating point numbers. Storing and features of?interaction with them

Representation of?data types in?C/C++. Part 2?— Floating point numbers. Storing and features of?interaction with them

Introduction

Continuing the series of?articles about the representation and operations with “natural” data types in?C/C++?I decided to?talk about floating point numbers and related types. Conducted surveys have shown that the quite significant part of?C/C++ developers doesn’t fully understand how are these data types stores in?memory and what are the nuances of?working with them.

Today, I’ll try to?pull back the curtain and explain the magic of?storing of?floating point numbers, answer the questions “How does the accuracy of?the given number depends on?the whole part of?this number?” and “Why it?is?not a?good idea to?check floating point numbers for equality?”.

Representation of?floating point numbers

First of?all let’s define that all the explanation will be?shown on?example of?single-precision floating point number (float) data type that is?usually have 4?bytes size.

// sizeof(float) * 8 bits = 32 bits        

In?this case the float type will be?represented in?memory like 32-bit container (Picture?1) that consists?of:

  • S?(Sign)?— 1?bit
  • E?(Exponent)?— 8?bit
  • M?(Mantissa)?— 23?bit

Picture 1 - Float type representation

From the math point of?view the linkage between the “float” type and it’s binary representations shows?as:

Main Float representaition formula

Here I?propose to?stop and consider an?example that will help?us in?the future to?go?through all the mathematical formulas and better understand what’s going?on. Let’s print the representation of?the number 12.258f:

#include <iostream>
#include <bitset>

using namespace std;

void printRepresentationOf(float & f)
{
  const unsigned int BitsInByte = 8u;

  for (unsigned int i = sizeof(f); i > 0; i--) // for little endian
    cout << bitset<BitsInByte>(reinterpret_cast<char*>(&f)[i - 1]);
  cout << endl;
}

int main (void)
{
  float f = 12.258f;
  printRepresentationOf(f);

  return 0;
}
        

Please pay attention that this code is?applicable only for little endian machines.

The result after execution will be?the next:

0 10000010 10001000010000011000101
_ ________ _______________________
^    ^                ^
S    E                M        

And now let’s split our math representation to?the several parts and figure out how the value can be?calculated.

Sign

The Sign part is?represented as?1?bit field. So?it?can be?equal only 0?or 1. Basically, the math rule says?us that any number in?0?degree is?equal to?1?and any number in?1?degree is?equal to?itself.

Considering this formula the conclusion can be?made that in?case if?S = 0?then this part of?formula will be?equal to?1?and in?case if?S = 1?then it?will be?equal to ?1.

Binary representation:

// 0b 0        

Exponent

The Exponent part is?represented by?8?bit field and in?the calculation process it?takes part in?the following multiplier:

For better understanding what is?the sense of?Exponent, imagine that we?have a?ranges between neighboring powers of?2. For instance:

For clarifying what value our Exponent will be?equal?to, it?is?necessary to?figure out to?what range does the number fall into. In?case of?our example with 12.258f the range that this number falls into?is:

All right. Now let’s take the power of?the lowest value from this range and calculate our Exponent:

So, for our example with 12.258f this part of?formula can be?converted?to:

Binary representation:

// 0b 1000 0010        

Mantissa

The Mantissa part determines the location of?a?number within the range defined by?the Exponent. It?is?represented by?23?bits, so?we?have:

Basically, it?is?the number of?the shifts in?the range defined by?the Exponent.

And for our example with 12.258f and [8...16] interval defined by?the Exponent the Mantissa can be?calculated by?the following way:

So, this is?the place inside the interval [8...16]. And now the last step that should be?performed in?order to?represent the Mantissa in?binary form:

And now let’s calculate the value of?the last multiplier in?the “Main Float representaition formula”.

Binary representation:

// 0b 100 0100 0010 0000 1100 0101        

Final calculation

Taking into account all the calculations above we?can gather all the intermediate calculations of?multipliers:

And finally for the example 12.258f number we?have:

  • S = 0
  • E = 130
  • M = 4464837

Also we?have the binary representation that is?fully equal to?the received result after our code example execution:

0b 0 10000010 10001000010000011000101
   _ ________ _______________________
   ^    ^                ^
   S    E                M        

Now we?know how the float data type can be?stored in?the memory, how to?represent it?in?binary form and how to?calculate it?back.

Accuracy

Do?you remember the questions that was asked in?the introduction to?this article? Here it?means the questions related to?accuracy of?floating point numbers.

In?order to?figure out how the accuracy depends on?the whole part of?the number, let’s consider the example with two different numbers:

  • 3.88f?— falls into [2...4)
  • 258.74f?— falls into [256...512)

Basically, we?have the same number of?the Mantissa bits for both numbers, but the length of?the range for first number is?equal to?2?and for second?— 256.

We?know that the accuracy is?presented?by:

So, let’s calculate the possible accuracy for both of?these numbers:

From the given example the conclusion can be?made that than the whole part of?the number is?greater then the accuracy is?lower.

Comparing floating points number

Due to?the changing of?the accuracy for the different numbers and due to?the accuracy lost in?calculation processes the results that expected to?be?the same may differ for a?very small delta. For instance:

float x = 1.0f;
float y = (0.3f * 3) + 0.1f;

if (x == y)
{
  std::cout << "The numbers are equal" << std::endl;
}
else
{
  std::cout << "The numbers are not equal" << std::endl;
}        

Here there is?no?garanty that the output will be?“The numbers are equal”.

The calculated values may look like:

// x = 1.0
// y = 0.999999999999        

In?this case the result will be?unexpected.

There is?a?common practice exists that can help to?solve this issue. We?can choose some delta value and compare the abs of?subtraction of?“x” and “y” with this delta:

if (fabs(x - y) < delta)
{
  std::cout << "The numbers are equal" << std::endl;
}
else
{
  std::cout << "The numbers are not equal" << std::endl;
}        

Using of?this approach is?guarantie, that the result will be?right with the given accuracy.

Conclusion

The main purpose of this article was to give the understanding of how the floating point data types are represented in the memory. I didn't find the proposed approach of explanation in the Internet and decided to try to explain all the things from my point of view and based on my own experience. Hope that it was quite clear and helpfull for you.


Sami Hartikainen

Experienced Software Developer | Embedded Linux, C/C++, Python, Agile

1 年

Very nice explanation! However, in the calculation of the value of the last multiplier (in the "main float representation formula"), the last element seems to be incorrect - shouldn't it be ... + 1*1/8388608 and not ...1/4194304? Does not affect the outcome since the difference is lost in rounding, but may confuse readers.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了