[Math] IEEE 754 round-off errors

(The following text has been copypasted to the Math Recipes book.)

The example from Wikipedia

This fragment from the Wikipedia article caught my attention:

An example of the error caused by floating point roundoff is illustrated in the following C code.

int main(){
   double a;
   int i;

   a = 0.2; 
   a += 0.1; 
   a -= 0.3;

   for (i = 0; a < 1.0; i++) 
       a += a;

   printf("i=%d, a=%f\n", i, a);

   return 0;
}

It appears that the program should not terminate. Yet the output is :

i=54, a=1.000000

How is that possible?

First of all, I wrote a Wolfram Mathematica code to unpack IEEE 754 single/double values, but with absolute precision (like bignum), not bounded by 32 or 64 bits.

DoubleExpLen = 11;

DoubleMantissaLen = 52;

DoubleBits = DoubleExpLen + DoubleMantissaLen + 1;

DoubleMaxExponent = BitShiftLeft[1, DoubleExpLen - 1] - 1;

DoubleExpMask = BitShiftLeft[1, DoubleExpLen] - 1;

DoubleMantissaMask = BitShiftLeft[1, DoubleMantissaLen] - 1;

convertDoubleSubnormal[sign_, mantissa_] := (-1)^
   sign*(mantissa/(BitShiftLeft[
      1, ((DoubleBits - (DoubleExpLen + 1)) + DoubleMaxExponent - 1)]))

convertDoubleNaN[sign_, exp_, mantissa_] := 
 If[mantissa == 0 , If[sign == 0, "Inf", "-Inf"], "NaN"]

convertDoubleNormal[sign_, exp_, mantissa_] := (-1)^
   sign*(BitShiftLeft[1, DoubleMantissaLen] + 
     mantissa)/(2^(DoubleMantissaLen - (exp - DoubleMaxExponent)))

convertDouble[sign_, exp_, mantissa_] := 
 If[exp == DoubleExpMask, convertDoubleNaN[sign, exp, mantissa], 
  If[exp == 0, convertDoubleSubnormal[sign, mantissa], 
   convertDoubleNormal[sign, exp, mantissa]]]

convertDoubleQWORD[dw_] := 
 convertDouble[BitShiftRight[dw, DoubleBits - 1], 
  BitAnd[BitShiftRight[dw, DoubleMantissaLen], DoubleExpMask], 
  BitAnd[dw, DoubleMantissaMask]]

The closest value to 0.1 is...

In[]:= N[convertDoubleQWORD[16^^3fb999999999999a], 100]
Out[]= 0.1000000000000000055511151231257827021181583404541015625

What are the previous and the next numbers? I decrease/increase mantissa by one bit here:

In[]:= N[convertDoubleQWORD[16^^3fb999999999999a - 1], 100]
Out[]= 0.09999999999999999167332731531132594682276248931884765625

In[]:= N[convertDoubleQWORD[16^^3fb999999999999a + 1], 100]
Out[]= 0.10000000000000001942890293094023945741355419158935546875

You see, the first is not exact, but the best approximation possible, in the double-precision IEEE 754 format.

You can't store 0.1 as exact value by the same reason you can't represent 1/3 value in any binary format. Well, maybe unbalanced ternary computer could, but it will have problems in representing 1/2 value.

Now I souped up the code from Wikipedia:

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <errno.h>
#include <fenv.h>

const int DoubleExpLen=11;
const int DoubleMantissaLen=52;
const int DoubleBits = DoubleExpLen + DoubleMantissaLen + 1;

void print_double_in_binary (double a)
{
        uint64_t t;
        memcpy (&t, &a, sizeof(double));

        // sign
        printf ("%ld ", (t>>(DoubleBits-1))&1); // sign

        // exponent
        for (int i=(DoubleBits-2); i!=(DoubleBits-2)-DoubleExpLen; i--)
                printf ("%ld", (t>>i)&1);
        printf (" ");

        // mantissa
        for (int i=(DoubleBits-2)-DoubleExpLen; i>=0; i--)
                printf ("%ld", (t>>i)&1);
        printf (" 0x%016lx", t);
};
// https://en.cppreference.com/w/c/numeric/fenv/feround
void show_fe_current_rounding_direction(void)
{
        printf("current rounding direction:  ");
        switch (fegetround()) {
                case FE_TONEAREST:  printf ("FE_TONEAREST");  break;
                case FE_DOWNWARD:   printf ("FE_DOWNWARD");   break;
                case FE_UPWARD:     printf ("FE_UPWARD");     break;
                case FE_TOWARDZERO: printf ("FE_TOWARDZERO"); break;
                default:            printf ("unknown");
        };
        printf("\n");
}
int main(){
        double a;
        int i;

        printf ("0.1: "); print_double_in_binary (0.1); printf ("\n");
        printf ("0.2: "); print_double_in_binary (0.2); printf ("\n");
        printf ("0.3: "); print_double_in_binary (0.3); printf ("\n");
        a = 0.2;
        printf ("a = 0.2;\n");
        printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n");
        show_fe_current_rounding_direction();
#if 0
        fesetround(FE_DOWNWARD);        // works OK
        fesetround(FE_UPWARD);          // not zero
        fesetround(FE_TONEAREST);       // not zero
        fesetround(FE_TOWARDZERO);      // works OK
#endif
        printf ("a += 0.1;\n");
        feclearexcept(FE_ALL_EXCEPT);
        a += 0.1;
        if(fetestexcept(FE_INEXACT))
                printf ("FE_INEXACT\n");
        printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n");

        printf ("a -= 0.3;\n");
        feclearexcept(FE_ALL_EXCEPT);
        a -= 0.3;
        if(fetestexcept(FE_INEXACT))
                printf ("FE_INEXACT\n");
        printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n");

        if (a==0.0)
                printf ("a==0\n");
        else
                printf ("a!=0\n");

        for (i = 0; a < 1.0; i++)
        {
                printf ("i=%d ", i);
                printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n");
                a += a;
        };

        printf("stop. i=%d, a=%f %e\n", i, a, a);

        return 0;
}
0.1: 0 01111111011 1001100110011001100110011001100110011001100110011010 0x3fb999999999999a
0.2: 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a
0.3: 0 01111111101 0011001100110011001100110011001100110011001100110011 0x3fd3333333333333
a = 0.2;
0.200000 2.000000e-01 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a
current rounding direction:  FE_TONEAREST
a += 0.1;
FE_INEXACT
0.300000 3.000000e-01 0 01111111101 0011001100110011001100110011001100110011001100110100 0x3fd3333333333334
a -= 0.3;
0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
a!=0
i=0 0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
i=1 0.000000 1.110223e-16 0 01111001010 0000000000000000000000000000000000000000000000000000 0x3ca0000000000000
i=2 0.000000 2.220446e-16 0 01111001011 0000000000000000000000000000000000000000000000000000 0x3cb0000000000000
...
i=52 0.250000 2.500000e-01 0 01111111101 0000000000000000000000000000000000000000000000000000 0x3fd0000000000000
i=53 0.500000 5.000000e-01 0 01111111110 0000000000000000000000000000000000000000000000000000 0x3fe0000000000000
stop. i=54, a=1.000000 1.000000e+00

After summation of 0.2 and 0.1, FPU reports FE_INEXACT exception in FPU status word. It warns: the result can't "fit" into double IEEE 754 exactly/precisely. The sum in form of 64-bit value is 0x3fd3333333333334. But at the beginning of our program, we dumped 0.3, and it is 0x3fd3333333333333. The difference is one lowest bit of mantissa.

Let's see what Mathematica can say about the exact sum:

In[]:= a = N[convertDoubleQWORD[16^^3fb999999999999a], 100]
Out[]= 0.1000000000000000055511151231257827021181583404541015625

In[]:= b = N[convertDoubleQWORD[16^^3fc999999999999a], 100]
Out[]= 0.200000000000000011102230246251565404236316680908203125

In[]:= a + b
Out[]= 0.3000000000000000166533453693773481063544750213623046875

That is the result FPU tries to shove into double-precision IEEE 754 register, but can't.

... while the best approximation of 0.3 is:

In[]:= N[convertDoubleQWORD[16^^3fd3333333333333], 100]
Out[]= 0.299999999999999988897769753748434595763683319091796875

And the result produced by our code (lowest bit of mantissa incremented):

In[]:= N[convertDoubleQWORD[16^^3fd3333333333334], 100]
Out[]= 0.3000000000000000444089209850062616169452667236328125

Since the rounding mode set by default is FE_TONEAREST, the FPU rounded it to the nearest value.

Now what is that difference by one lowest bit in mantissa?

In[]:= N[convertDoubleQWORD[16^^3fd3333333333334], 100] - 
 N[convertDoubleQWORD[16^^3fd3333333333333], 100]

Out[]= 5.5511151231257827021181583404541015625*10^-17

This is the 'noise' left in register after subtraction, that isn't equal to zero:

0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
a!=0
i=0 0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
i=1 0.000000 1.110223e-16 0 01111001010 0000000000000000000000000000000000000000000000000000 0x3ca0000000000000
i=2 0.000000 2.220446e-16 0 01111001011 0000000000000000000000000000000000000000000000000000 0x3cb0000000000000
i=3 0.000000 4.440892e-16 0 01111001100 0000000000000000000000000000000000000000000000000000 0x3cc0000000000000
...

When that 'noise' gets summed with itself, it grows:

...
i=49 0.031250 3.125000e-02 0 01111111010 0000000000000000000000000000000000000000000000000000 0x3fa0000000000000
i=50 0.062500 6.250000e-02 0 01111111011 0000000000000000000000000000000000000000000000000000 0x3fb0000000000000
i=51 0.125000 1.250000e-01 0 01111111100 0000000000000000000000000000000000000000000000000000 0x3fc0000000000000
i=52 0.250000 2.500000e-01 0 01111111101 0000000000000000000000000000000000000000000000000000 0x3fd0000000000000
i=53 0.500000 5.000000e-01 0 01111111110 0000000000000000000000000000000000000000000000000000 0x3fe0000000000000
stop. i=54, a=1.000000 1.000000e+00

However, you can switch rounding mode to FE_TOWARDZERO:

a = 0.2;
0.200000 2.000000e-01 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a
current rounding direction:  FE_TONEAREST
a += 0.1;
FE_INEXACT
0.299999 2.999999e-01 0 01111111101 0011001100110011001100110011001100110011001100110011 0x3fd3333333333333
a -= 0.3;
0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
a==0
i=0 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
i=1 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
i=2 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
i=3 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
...

Still FE_INEXACT value, but that works, because rounding gets other direction. The sum is ....33333 (not ....33334). (FE_DOWNWARD would work as well.)

But in a real-world code it's problematic to set rounding mode for each operation...

Conversion from double-precision to single-precision and back

Now this is even more arcane. We add 0.1 to zero and then subtract it, but the result is not zero:

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <errno.h>
#include <fenv.h>

const int FloatExpLen = 8;
const int FloatMantissaLen = 23;
const int FloatBits = FloatExpLen + FloatMantissaLen + 1;

void print_float_in_binary (float a)
{
        uint32_t t;
        memcpy (&t, &a, sizeof(float));

        // sign
        printf ("%d ", (t>>(FloatBits-1))&1);

        // exponent
        for (int i=(FloatBits-2); i!=(FloatBits-2)-FloatExpLen; i--)
                printf ("%d", (t>>i)&1);
        printf (" ");

        // mantissa
        for (int i=(FloatBits-2)-FloatExpLen; i>=0; i--)
                printf ("%d", (t>>i)&1);
        printf (" 0x%08x", t);
};
int main(){
        float a;
        //double a; // this will fix it
        int i;

        printf ("0.1: "); print_float_in_binary(0.1); printf ("\n");

        a = 0.0;
        printf ("a = 0.0;\n");
        printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n");
        feclearexcept(FE_ALL_EXCEPT);
        a += 0.1;
        if(fetestexcept(FE_INEXACT))
                printf ("FE_INEXACT\n");
        printf ("a += 0.1;\n");
        printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n");
        feclearexcept(FE_ALL_EXCEPT);
        a -= 0.1;

        if(fetestexcept(FE_INEXACT))
                printf ("FE_INEXACT\n");
        printf ("a -= 0.1;\n");
        printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n");

        if (a==0.0)
                printf ("a==0\n");
        else
                printf ("a!=0\n");

        for (i = 0; a < 1.0; i++)
        {
                printf ("i=%d ", i);
                printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n");
                a += a;
        };

        printf("i=%d, a=%f %e\n", i, a, a);

        return 0;
}
0.1: 0 01111011 10011001100110011001101 0x3dcccccd
a = 0.0;
0.000000 0.000000e+00 0 00000000 00000000000000000000000 0x00000000
FE_INEXACT
a += 0.1;
0.100000 1.000000e-01 0 01111011 10011001100110011001101 0x3dcccccd
FE_INEXACT
a -= 0.1;
0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd
a!=0
i=0 0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd
i=1 0.000000 2.980232e-09 0 01100010 10011001100110011001101 0x314ccccd
i=2 0.000000 5.960465e-09 0 01100011 10011001100110011001101 0x31cccccd
i=3 0.000000 1.192093e-08 0 01100100 10011001100110011001101 0x324ccccd
...
i=25 0.050000 5.000000e-02 0 01111010 10011001100110011001101 0x3d4ccccd
i=26 0.100000 1.000000e-01 0 01111011 10011001100110011001101 0x3dcccccd
i=27 0.200000 2.000000e-01 0 01111100 10011001100110011001101 0x3e4ccccd
i=28 0.400000 4.000000e-01 0 01111101 10011001100110011001101 0x3ecccccd
i=29 0.800000 8.000000e-01 0 01111110 10011001100110011001101 0x3f4ccccd
i=30, a=1.600000 1.600000e+00

How is that possible the 'noise' gets appeared if we just add a constant value and subtract it?

Luckily, I have some assembly knowledge, and I can get into the x86-64 code:

...
        cvtss2sd        xmm1, DWORD PTR -4[rbp]
        cvtss2sd        xmm0, DWORD PTR -4[rbp]
        lea     rdi, .LC5[rip]
        mov     eax, 2
        call    printf@PLT
        mov     eax, DWORD PTR -4[rbp]
        movd    xmm0, eax
        call    print_float_in_binary
        mov     edi, 10
        call    putchar@PLT
        mov     edi, 61
        call    feclearexcept@PLT
        cvtss2sd        xmm1, DWORD PTR -4[rbp]
        movsd   xmm0, QWORD PTR .LC6[rip]
        addsd   xmm0, xmm1
        cvtsd2ss        xmm0, xmm0
        movss   DWORD PTR -4[rbp], xmm0
        mov     edi, 32
        call    fetestexcept@PLT
        test    eax, eax
        je      .L8
        lea     rdi, .LC7[rip]
        call    puts@PLT
.L8:
        lea     rdi, .LC8[rip]
        call    puts@PLT
        cvtss2sd        xmm1, DWORD PTR -4[rbp]
        cvtss2sd        xmm0, DWORD PTR -4[rbp]
        lea     rdi, .LC5[rip]
        mov     eax, 2
        call    printf@PLT
        mov     eax, DWORD PTR -4[rbp]
        movd    xmm0, eax
        call    print_float_in_binary
        mov     edi, 10
        call    putchar@PLT
        mov     edi, 61
        call    feclearexcept@PLT
        cvtss2sd        xmm0, DWORD PTR -4[rbp]
        movsd   xmm1, QWORD PTR .LC6[rip]
        subsd   xmm0, xmm1
        cvtsd2ss        xmm0, xmm0
        movss   DWORD PTR -4[rbp], xmm0
...
.LC6:
        .long   2576980378
        .long   1069128089
...

But even without that knowledge, we can say something about that code. It uses 'cvtss2sd' and 'cvtsd2ss' instructions, which converts single-precision value to double-precision and back.

These are: "Convert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value" and "Convert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value".

The 'addsd' and 'subsd' instructions are: "Add Scalar Double-Precision Floating-Point Values" and "Subtract Scalar Double-Precision Floating-Point Value".

(The whole operation is happens in SIMD registers.) Also, the 0.1 constant is encoded as a 64-bit double-precision value (.LC6).

In other words, everything calculates here in double-precision, but stored as single-precision values.

The first FE_INEXACT exception raises when double-precision zero is summed with double-precision 0.1 value and shoved into single-precision register.

In[]:= N[convertFloatDWORD[16^^3dcccccd], 100]
Out[]= 0.100000001490116119384765625

This is the best possible approximation of 0.1 in single-precision. Now you convert it to double-precision. You'll get something like 0.10000000149011611... But this is not the best possible approximation of 0.1 in double-precision! The best possible approximation is:

In[]:= N[convertDoubleQWORD[16^^3fb999999999999a], 100]
Out[]= 0.1000000000000000055511151231257827021181583404541015625

You subtract $ \approx 0.1000000000000000055511151...$ (best possible approximation) from $ \approx 0.10000000149011611...$ and get $ \approx 0.00000000149011611...$ Of course, it gets normalized, the mantissa is shifted and exponent aligned, leaving $\approx 1.490116 \cdot 10^{-9}$. After converting this to a single-precision value, it left as the 'noise':

...
0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd
a!=0
...

Again, if summing the 'noise' with itself, it grows.

What if it's compiled for 80x87 FPU support? As it was done before SIMD?

% gcc -mfpmath=387 2.c -S -masm=intel -lm

...

        fld     DWORD PTR -8[rbp]
        fld     QWORD PTR .LC8[rip]
        faddp   st(1), st
        fstp    DWORD PTR -8[rbp]

...

        fld     DWORD PTR -8[rbp]
        fld     QWORD PTR .LC8[rip]
        fsubp   st(1), st
        fstp    DWORD PTR -8[rbp]
...


.LC8:
        .long   2576980378
        .long   1069128089

You see, one operand for faddp/fsubp instruction (add/sub) loaded as a 32-bit value (single-precision), but another (which is constant) as a 64-bit value (double-precision).

The solution: switching 'float' to 'double' in that code would eliminate problem.

The moral of the story

IEEE 754 is hard. Mathematically flawless expressions and algorithms can go crazy out of the blue.

Wikipedia: Round-off error.

A well-known introduction is What Every ComputerScientist Should Know About Floating-Point Arithmetic by David Goldberg.

The Numerical Recipes book is also recommended, as well as D.Knuth's TAOCP.

The field is numerical analysis.

All the files

(Including the Wolfram Mathematica notebook I used.)


Please drop me email about bug(s) and/or suggestion(s): blog@yurichev.com. List of other blog posts. BTW, I'm teaching. Follow me in social networks: Twitter, Telegram, GitHub, Discord, Facebook.