## [Math] IEEE 754 round-off errors

(The following text has been copypasted to the Math Recipes book.)

### The example from Wikipedia

This fragment from the Wikipedia article caught my attention:

An example of the error caused by floating point roundoff is illustrated in the following C code.

int main(){
double a;
int i;

a = 0.2;
a += 0.1;
a -= 0.3;

for (i = 0; a < 1.0; i++)
a += a;

printf("i=%d, a=%f\n", i, a);

return 0;
}

It appears that the program should not terminate. Yet the output is :

i=54, a=1.000000


How is that possible?

First of all, I wrote a Wolfram Mathematica code to unpack IEEE 754 single/double values, but with absolute precision (like bignum), not bounded by 32 or 64 bits.

DoubleExpLen = 11;

DoubleMantissaLen = 52;

DoubleBits = DoubleExpLen + DoubleMantissaLen + 1;

DoubleMaxExponent = BitShiftLeft[1, DoubleExpLen - 1] - 1;

DoubleExpMask = BitShiftLeft[1, DoubleExpLen] - 1;

DoubleMantissaMask = BitShiftLeft[1, DoubleMantissaLen] - 1;

convertDoubleSubnormal[sign_, mantissa_] := (-1)^
sign*(mantissa/(BitShiftLeft[
1, ((DoubleBits - (DoubleExpLen + 1)) + DoubleMaxExponent - 1)]))

convertDoubleNaN[sign_, exp_, mantissa_] :=
If[mantissa == 0 , If[sign == 0, "Inf", "-Inf"], "NaN"]

convertDoubleNormal[sign_, exp_, mantissa_] := (-1)^
sign*(BitShiftLeft[1, DoubleMantissaLen] +
mantissa)/(2^(DoubleMantissaLen - (exp - DoubleMaxExponent)))

convertDouble[sign_, exp_, mantissa_] :=
If[exp == DoubleExpMask, convertDoubleNaN[sign, exp, mantissa],
If[exp == 0, convertDoubleSubnormal[sign, mantissa],
convertDoubleNormal[sign, exp, mantissa]]]

convertDoubleQWORD[dw_] :=
convertDouble[BitShiftRight[dw, DoubleBits - 1],


The closest value to 0.1 is...

In[]:= N[convertDoubleQWORD[16^^3fb999999999999a], 100]
Out[]= 0.1000000000000000055511151231257827021181583404541015625


What are the previous and the next numbers? I decrease/increase mantissa by one bit here:

In[]:= N[convertDoubleQWORD[16^^3fb999999999999a - 1], 100]
Out[]= 0.09999999999999999167332731531132594682276248931884765625

In[]:= N[convertDoubleQWORD[16^^3fb999999999999a + 1], 100]
Out[]= 0.10000000000000001942890293094023945741355419158935546875


You see, the first is not exact, but the best approximation possible, in the double-precision IEEE 754 format.

You can't store 0.1 as exact value by the same reason you can't represent 1/3 value in any binary format. Well, maybe unbalanced ternary computer could, but it will have problems in representing 1/2 value.

Now I souped up the code from Wikipedia:

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <errno.h>
#include <fenv.h>

const int DoubleExpLen=11;
const int DoubleMantissaLen=52;
const int DoubleBits = DoubleExpLen + DoubleMantissaLen + 1;

void print_double_in_binary (double a)
{
uint64_t t;
memcpy (&t, &a, sizeof(double));

// sign
printf ("%ld ", (t>>(DoubleBits-1))&1); // sign

// exponent
for (int i=(DoubleBits-2); i!=(DoubleBits-2)-DoubleExpLen; i--)
printf ("%ld", (t>>i)&1);
printf (" ");

// mantissa
for (int i=(DoubleBits-2)-DoubleExpLen; i>=0; i--)
printf ("%ld", (t>>i)&1);
printf (" 0x%016lx", t);
};
// https://en.cppreference.com/w/c/numeric/fenv/feround
void show_fe_current_rounding_direction(void)
{
printf("current rounding direction:  ");
switch (fegetround()) {
case FE_TONEAREST:  printf ("FE_TONEAREST");  break;
case FE_DOWNWARD:   printf ("FE_DOWNWARD");   break;
case FE_UPWARD:     printf ("FE_UPWARD");     break;
case FE_TOWARDZERO: printf ("FE_TOWARDZERO"); break;
default:            printf ("unknown");
};
printf("\n");
}
int main(){
double a;
int i;

printf ("0.1: "); print_double_in_binary (0.1); printf ("\n");
printf ("0.2: "); print_double_in_binary (0.2); printf ("\n");
printf ("0.3: "); print_double_in_binary (0.3); printf ("\n");
a = 0.2;
printf ("a = 0.2;\n");
printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n");
show_fe_current_rounding_direction();
#if 0
fesetround(FE_DOWNWARD);        // works OK
fesetround(FE_UPWARD);          // not zero
fesetround(FE_TONEAREST);       // not zero
fesetround(FE_TOWARDZERO);      // works OK
#endif
printf ("a += 0.1;\n");
feclearexcept(FE_ALL_EXCEPT);
a += 0.1;
if(fetestexcept(FE_INEXACT))
printf ("FE_INEXACT\n");
printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n");

printf ("a -= 0.3;\n");
feclearexcept(FE_ALL_EXCEPT);
a -= 0.3;
if(fetestexcept(FE_INEXACT))
printf ("FE_INEXACT\n");
printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n");

if (a==0.0)
printf ("a==0\n");
else
printf ("a!=0\n");

for (i = 0; a < 1.0; i++)
{
printf ("i=%d ", i);
printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n");
a += a;
};

printf("stop. i=%d, a=%f %e\n", i, a, a);

return 0;
}

0.1: 0 01111111011 1001100110011001100110011001100110011001100110011010 0x3fb999999999999a
0.2: 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a
0.3: 0 01111111101 0011001100110011001100110011001100110011001100110011 0x3fd3333333333333
a = 0.2;
0.200000 2.000000e-01 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a
current rounding direction:  FE_TONEAREST
a += 0.1;
FE_INEXACT
0.300000 3.000000e-01 0 01111111101 0011001100110011001100110011001100110011001100110100 0x3fd3333333333334
a -= 0.3;
0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
a!=0
i=0 0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
i=1 0.000000 1.110223e-16 0 01111001010 0000000000000000000000000000000000000000000000000000 0x3ca0000000000000
i=2 0.000000 2.220446e-16 0 01111001011 0000000000000000000000000000000000000000000000000000 0x3cb0000000000000
...
i=52 0.250000 2.500000e-01 0 01111111101 0000000000000000000000000000000000000000000000000000 0x3fd0000000000000
i=53 0.500000 5.000000e-01 0 01111111110 0000000000000000000000000000000000000000000000000000 0x3fe0000000000000
stop. i=54, a=1.000000 1.000000e+00


After summation of 0.2 and 0.1, FPU reports FE_INEXACT exception in FPU status word. It warns: the result can't "fit" into double IEEE 754 exactly/precisely. The sum in form of 64-bit value is 0x3fd3333333333334. But at the beginning of our program, we dumped 0.3, and it is 0x3fd3333333333333. The difference is one lowest bit of mantissa.

Let's see what Mathematica can say about the exact sum:

In[]:= a = N[convertDoubleQWORD[16^^3fb999999999999a], 100]
Out[]= 0.1000000000000000055511151231257827021181583404541015625

In[]:= b = N[convertDoubleQWORD[16^^3fc999999999999a], 100]
Out[]= 0.200000000000000011102230246251565404236316680908203125

In[]:= a + b
Out[]= 0.3000000000000000166533453693773481063544750213623046875


That is the result FPU tries to shove into double-precision IEEE 754 register, but can't.

... while the best approximation of 0.3 is:

In[]:= N[convertDoubleQWORD[16^^3fd3333333333333], 100]
Out[]= 0.299999999999999988897769753748434595763683319091796875


And the result produced by our code (lowest bit of mantissa incremented):

In[]:= N[convertDoubleQWORD[16^^3fd3333333333334], 100]
Out[]= 0.3000000000000000444089209850062616169452667236328125


Since the rounding mode set by default is FE_TONEAREST, the FPU rounded it to the nearest value.

Now what is that difference by one lowest bit in mantissa?

In[]:= N[convertDoubleQWORD[16^^3fd3333333333334], 100] -
N[convertDoubleQWORD[16^^3fd3333333333333], 100]

Out[]= 5.5511151231257827021181583404541015625*10^-17


This is the 'noise' left in register after subtraction, that isn't equal to zero:

0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
a!=0
i=0 0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
i=1 0.000000 1.110223e-16 0 01111001010 0000000000000000000000000000000000000000000000000000 0x3ca0000000000000
i=2 0.000000 2.220446e-16 0 01111001011 0000000000000000000000000000000000000000000000000000 0x3cb0000000000000
i=3 0.000000 4.440892e-16 0 01111001100 0000000000000000000000000000000000000000000000000000 0x3cc0000000000000
...


When that 'noise' gets summed with itself, it grows:

...
i=49 0.031250 3.125000e-02 0 01111111010 0000000000000000000000000000000000000000000000000000 0x3fa0000000000000
i=50 0.062500 6.250000e-02 0 01111111011 0000000000000000000000000000000000000000000000000000 0x3fb0000000000000
i=51 0.125000 1.250000e-01 0 01111111100 0000000000000000000000000000000000000000000000000000 0x3fc0000000000000
i=52 0.250000 2.500000e-01 0 01111111101 0000000000000000000000000000000000000000000000000000 0x3fd0000000000000
i=53 0.500000 5.000000e-01 0 01111111110 0000000000000000000000000000000000000000000000000000 0x3fe0000000000000
stop. i=54, a=1.000000 1.000000e+00


However, you can switch rounding mode to FE_TOWARDZERO:

a = 0.2;
0.200000 2.000000e-01 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a
current rounding direction:  FE_TONEAREST
a += 0.1;
FE_INEXACT
0.299999 2.999999e-01 0 01111111101 0011001100110011001100110011001100110011001100110011 0x3fd3333333333333
a -= 0.3;
0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
a==0
i=0 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
i=1 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
i=2 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
i=3 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
...


Still FE_INEXACT value, but that works, because rounding gets other direction. The sum is ....33333 (not ....33334). (FE_DOWNWARD would work as well.)

But in a real-world code it's problematic to set rounding mode for each operation...

### Conversion from double-precision to single-precision and back

Now this is even more arcane. We add 0.1 to zero and then subtract it, but the result is not zero:

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <errno.h>
#include <fenv.h>

const int FloatExpLen = 8;
const int FloatMantissaLen = 23;
const int FloatBits = FloatExpLen + FloatMantissaLen + 1;

void print_float_in_binary (float a)
{
uint32_t t;
memcpy (&t, &a, sizeof(float));

// sign
printf ("%d ", (t>>(FloatBits-1))&1);

// exponent
for (int i=(FloatBits-2); i!=(FloatBits-2)-FloatExpLen; i--)
printf ("%d", (t>>i)&1);
printf (" ");

// mantissa
for (int i=(FloatBits-2)-FloatExpLen; i>=0; i--)
printf ("%d", (t>>i)&1);
printf (" 0x%08x", t);
};
int main(){
float a;
//double a; // this will fix it
int i;

printf ("0.1: "); print_float_in_binary(0.1); printf ("\n");

a = 0.0;
printf ("a = 0.0;\n");
printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n");
feclearexcept(FE_ALL_EXCEPT);
a += 0.1;
if(fetestexcept(FE_INEXACT))
printf ("FE_INEXACT\n");
printf ("a += 0.1;\n");
printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n");
feclearexcept(FE_ALL_EXCEPT);
a -= 0.1;

if(fetestexcept(FE_INEXACT))
printf ("FE_INEXACT\n");
printf ("a -= 0.1;\n");
printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n");

if (a==0.0)
printf ("a==0\n");
else
printf ("a!=0\n");

for (i = 0; a < 1.0; i++)
{
printf ("i=%d ", i);
printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n");
a += a;
};

printf("i=%d, a=%f %e\n", i, a, a);

return 0;
}

0.1: 0 01111011 10011001100110011001101 0x3dcccccd
a = 0.0;
0.000000 0.000000e+00 0 00000000 00000000000000000000000 0x00000000
FE_INEXACT
a += 0.1;
0.100000 1.000000e-01 0 01111011 10011001100110011001101 0x3dcccccd
FE_INEXACT
a -= 0.1;
0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd
a!=0
i=0 0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd
i=1 0.000000 2.980232e-09 0 01100010 10011001100110011001101 0x314ccccd
i=2 0.000000 5.960465e-09 0 01100011 10011001100110011001101 0x31cccccd
i=3 0.000000 1.192093e-08 0 01100100 10011001100110011001101 0x324ccccd
...
i=25 0.050000 5.000000e-02 0 01111010 10011001100110011001101 0x3d4ccccd
i=26 0.100000 1.000000e-01 0 01111011 10011001100110011001101 0x3dcccccd
i=27 0.200000 2.000000e-01 0 01111100 10011001100110011001101 0x3e4ccccd
i=28 0.400000 4.000000e-01 0 01111101 10011001100110011001101 0x3ecccccd
i=29 0.800000 8.000000e-01 0 01111110 10011001100110011001101 0x3f4ccccd
i=30, a=1.600000 1.600000e+00


How is that possible the 'noise' gets appeared if we just add a constant value and subtract it?

Luckily, I have some assembly knowledge, and I can get into the x86-64 code:

...
cvtss2sd        xmm1, DWORD PTR -4[rbp]
cvtss2sd        xmm0, DWORD PTR -4[rbp]
lea     rdi, .LC5[rip]
mov     eax, 2
call    printf@PLT
mov     eax, DWORD PTR -4[rbp]
movd    xmm0, eax
call    print_float_in_binary
mov     edi, 10
call    putchar@PLT
mov     edi, 61
call    feclearexcept@PLT
cvtss2sd        xmm1, DWORD PTR -4[rbp]
movsd   xmm0, QWORD PTR .LC6[rip]
cvtsd2ss        xmm0, xmm0
movss   DWORD PTR -4[rbp], xmm0
mov     edi, 32
call    fetestexcept@PLT
test    eax, eax
je      .L8
lea     rdi, .LC7[rip]
call    puts@PLT
.L8:
lea     rdi, .LC8[rip]
call    puts@PLT
cvtss2sd        xmm1, DWORD PTR -4[rbp]
cvtss2sd        xmm0, DWORD PTR -4[rbp]
lea     rdi, .LC5[rip]
mov     eax, 2
call    printf@PLT
mov     eax, DWORD PTR -4[rbp]
movd    xmm0, eax
call    print_float_in_binary
mov     edi, 10
call    putchar@PLT
mov     edi, 61
call    feclearexcept@PLT
cvtss2sd        xmm0, DWORD PTR -4[rbp]
movsd   xmm1, QWORD PTR .LC6[rip]
subsd   xmm0, xmm1
cvtsd2ss        xmm0, xmm0
movss   DWORD PTR -4[rbp], xmm0
...
.LC6:
.long   2576980378
.long   1069128089
...


But even without that knowledge, we can say something about that code. It uses 'cvtss2sd' and 'cvtsd2ss' instructions, which converts single-precision value to double-precision and back.

These are: "Convert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value" and "Convert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value".

The 'addsd' and 'subsd' instructions are: "Add Scalar Double-Precision Floating-Point Values" and "Subtract Scalar Double-Precision Floating-Point Value".

(The whole operation is happens in SIMD registers.) Also, the 0.1 constant is encoded as a 64-bit double-precision value (.LC6).

In other words, everything calculates here in double-precision, but stored as single-precision values.

The first FE_INEXACT exception raises when double-precision zero is summed with double-precision 0.1 value and shoved into single-precision register.

In[]:= N[convertFloatDWORD[16^^3dcccccd], 100]
Out[]= 0.100000001490116119384765625


This is the best possible approximation of 0.1 in single-precision. Now you convert it to double-precision. You'll get something like 0.10000000149011611... But this is not the best possible approximation of 0.1 in double-precision! The best possible approximation is:

In[]:= N[convertDoubleQWORD[16^^3fb999999999999a], 100]
Out[]= 0.1000000000000000055511151231257827021181583404541015625


You subtract $\approx 0.1000000000000000055511151...$ (best possible approximation) from $\approx 0.10000000149011611...$ and get $\approx 0.00000000149011611...$ Of course, it gets normalized, the mantissa is shifted and exponent aligned, leaving $\approx 1.490116 \cdot 10^{-9}$. After converting this to a single-precision value, it left as the 'noise':

...
0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd
a!=0
...


Again, if summing the 'noise' with itself, it grows.

What if it's compiled for 80x87 FPU support? As it was done before SIMD?

% gcc -mfpmath=387 2.c -S -masm=intel -lm

...

fld     DWORD PTR -8[rbp]
fld     QWORD PTR .LC8[rip]
fstp    DWORD PTR -8[rbp]

...

fld     DWORD PTR -8[rbp]
fld     QWORD PTR .LC8[rip]
fsubp   st(1), st
fstp    DWORD PTR -8[rbp]
...

.LC8:
.long   2576980378
.long   1069128089


You see, one operand for faddp/fsubp instruction (add/sub) loaded as a 32-bit value (single-precision), but another (which is constant) as a 64-bit value (double-precision).

The solution: switching 'float' to 'double' in that code would eliminate problem.

### The moral of the story

IEEE 754 is hard. Mathematically flawless expressions and algorithms can go crazy out of the blue.

A well-known introduction is What Every ComputerScientist Should Know About Floating-Point Arithmetic by David Goldberg.

The Numerical Recipes book is also recommended, as well as D.Knuth's TAOCP.

The field is numerical analysis.

### All the files

(Including the Wolfram Mathematica notebook I used.)

UPD: at reddit, HN.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.