(The following text has been copypasted to the Math Recipes book.)
This fragment from the Wikipedia article caught my attention:
An example of the error caused by floating point roundoff is illustrated in the following C code. int main(){ double a; int i; a = 0.2; a += 0.1; a -= 0.3; for (i = 0; a < 1.0; i++) a += a; printf("i=%d, a=%f\n", i, a); return 0; } It appears that the program should not terminate. Yet the output is : i=54, a=1.000000
How is that possible?
First of all, I wrote a Wolfram Mathematica code to unpack IEEE 754 single/double values, but with absolute precision (like bignum), not bounded by 32 or 64 bits.
DoubleExpLen = 11; DoubleMantissaLen = 52; DoubleBits = DoubleExpLen + DoubleMantissaLen + 1; DoubleMaxExponent = BitShiftLeft[1, DoubleExpLen - 1] - 1; DoubleExpMask = BitShiftLeft[1, DoubleExpLen] - 1; DoubleMantissaMask = BitShiftLeft[1, DoubleMantissaLen] - 1; convertDoubleSubnormal[sign_, mantissa_] := (-1)^ sign*(mantissa/(BitShiftLeft[ 1, ((DoubleBits - (DoubleExpLen + 1)) + DoubleMaxExponent - 1)])) convertDoubleNaN[sign_, exp_, mantissa_] := If[mantissa == 0 , If[sign == 0, "Inf", "-Inf"], "NaN"] convertDoubleNormal[sign_, exp_, mantissa_] := (-1)^ sign*(BitShiftLeft[1, DoubleMantissaLen] + mantissa)/(2^(DoubleMantissaLen - (exp - DoubleMaxExponent))) convertDouble[sign_, exp_, mantissa_] := If[exp == DoubleExpMask, convertDoubleNaN[sign, exp, mantissa], If[exp == 0, convertDoubleSubnormal[sign, mantissa], convertDoubleNormal[sign, exp, mantissa]]] convertDoubleQWORD[dw_] := convertDouble[BitShiftRight[dw, DoubleBits - 1], BitAnd[BitShiftRight[dw, DoubleMantissaLen], DoubleExpMask], BitAnd[dw, DoubleMantissaMask]]
The closest value to 0.1 is...
In[]:= N[convertDoubleQWORD[16^^3fb999999999999a], 100] Out[]= 0.1000000000000000055511151231257827021181583404541015625
What are the previous and the next numbers? I decrease/increase mantissa by one bit here:
In[]:= N[convertDoubleQWORD[16^^3fb999999999999a - 1], 100] Out[]= 0.09999999999999999167332731531132594682276248931884765625 In[]:= N[convertDoubleQWORD[16^^3fb999999999999a + 1], 100] Out[]= 0.10000000000000001942890293094023945741355419158935546875
You see, the first is not exact, but the best approximation possible, in the double-precision IEEE 754 format.
You can't store 0.1 as exact value by the same reason you can't represent 1/3 value in any binary format. Well, maybe unbalanced ternary computer could, but it will have problems in representing 1/2 value.
Now I souped up the code from Wikipedia:
#include <stdio.h> #include <stdint.h> #include <string.h> #include <math.h> #include <errno.h> #include <fenv.h> const int DoubleExpLen=11; const int DoubleMantissaLen=52; const int DoubleBits = DoubleExpLen + DoubleMantissaLen + 1; void print_double_in_binary (double a) { uint64_t t; memcpy (&t, &a, sizeof(double)); // sign printf ("%ld ", (t>>(DoubleBits-1))&1); // sign // exponent for (int i=(DoubleBits-2); i!=(DoubleBits-2)-DoubleExpLen; i--) printf ("%ld", (t>>i)&1); printf (" "); // mantissa for (int i=(DoubleBits-2)-DoubleExpLen; i>=0; i--) printf ("%ld", (t>>i)&1); printf (" 0x%016lx", t); }; // https://en.cppreference.com/w/c/numeric/fenv/feround void show_fe_current_rounding_direction(void) { printf("current rounding direction: "); switch (fegetround()) { case FE_TONEAREST: printf ("FE_TONEAREST"); break; case FE_DOWNWARD: printf ("FE_DOWNWARD"); break; case FE_UPWARD: printf ("FE_UPWARD"); break; case FE_TOWARDZERO: printf ("FE_TOWARDZERO"); break; default: printf ("unknown"); }; printf("\n"); } int main(){ double a; int i; printf ("0.1: "); print_double_in_binary (0.1); printf ("\n"); printf ("0.2: "); print_double_in_binary (0.2); printf ("\n"); printf ("0.3: "); print_double_in_binary (0.3); printf ("\n"); a = 0.2; printf ("a = 0.2;\n"); printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n"); show_fe_current_rounding_direction(); #if 0 fesetround(FE_DOWNWARD); // works OK fesetround(FE_UPWARD); // not zero fesetround(FE_TONEAREST); // not zero fesetround(FE_TOWARDZERO); // works OK #endif printf ("a += 0.1;\n"); feclearexcept(FE_ALL_EXCEPT); a += 0.1; if(fetestexcept(FE_INEXACT)) printf ("FE_INEXACT\n"); printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n"); printf ("a -= 0.3;\n"); feclearexcept(FE_ALL_EXCEPT); a -= 0.3; if(fetestexcept(FE_INEXACT)) printf ("FE_INEXACT\n"); printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n"); if (a==0.0) printf ("a==0\n"); else printf ("a!=0\n"); for (i = 0; a < 1.0; i++) { printf ("i=%d ", i); printf ("%f %e ", a, a); print_double_in_binary(a); printf ("\n"); a += a; }; printf("stop. i=%d, a=%f %e\n", i, a, a); return 0; }
0.1: 0 01111111011 1001100110011001100110011001100110011001100110011010 0x3fb999999999999a 0.2: 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a 0.3: 0 01111111101 0011001100110011001100110011001100110011001100110011 0x3fd3333333333333 a = 0.2; 0.200000 2.000000e-01 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a current rounding direction: FE_TONEAREST a += 0.1; FE_INEXACT 0.300000 3.000000e-01 0 01111111101 0011001100110011001100110011001100110011001100110100 0x3fd3333333333334 a -= 0.3; 0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000 a!=0 i=0 0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000 i=1 0.000000 1.110223e-16 0 01111001010 0000000000000000000000000000000000000000000000000000 0x3ca0000000000000 i=2 0.000000 2.220446e-16 0 01111001011 0000000000000000000000000000000000000000000000000000 0x3cb0000000000000 ... i=52 0.250000 2.500000e-01 0 01111111101 0000000000000000000000000000000000000000000000000000 0x3fd0000000000000 i=53 0.500000 5.000000e-01 0 01111111110 0000000000000000000000000000000000000000000000000000 0x3fe0000000000000 stop. i=54, a=1.000000 1.000000e+00
After summation of 0.2 and 0.1, FPU reports FE_INEXACT exception in FPU status word. It warns: the result can't "fit" into double IEEE 754 exactly/precisely. The sum in form of 64-bit value is 0x3fd3333333333334. But at the beginning of our program, we dumped 0.3, and it is 0x3fd3333333333333. The difference is one lowest bit of mantissa.
Let's see what Mathematica can say about the exact sum:
In[]:= a = N[convertDoubleQWORD[16^^3fb999999999999a], 100] Out[]= 0.1000000000000000055511151231257827021181583404541015625 In[]:= b = N[convertDoubleQWORD[16^^3fc999999999999a], 100] Out[]= 0.200000000000000011102230246251565404236316680908203125 In[]:= a + b Out[]= 0.3000000000000000166533453693773481063544750213623046875
That is the result FPU tries to shove into double-precision IEEE 754 register, but can't.
... while the best approximation of 0.3 is:
In[]:= N[convertDoubleQWORD[16^^3fd3333333333333], 100] Out[]= 0.299999999999999988897769753748434595763683319091796875
And the result produced by our code (lowest bit of mantissa incremented):
In[]:= N[convertDoubleQWORD[16^^3fd3333333333334], 100] Out[]= 0.3000000000000000444089209850062616169452667236328125
Since the rounding mode set by default is FE_TONEAREST, the FPU rounded it to the nearest value.
Now what is that difference by one lowest bit in mantissa?
In[]:= N[convertDoubleQWORD[16^^3fd3333333333334], 100] - N[convertDoubleQWORD[16^^3fd3333333333333], 100] Out[]= 5.5511151231257827021181583404541015625*10^-17
This is the 'noise' left in register after subtraction, that isn't equal to zero:
0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000 a!=0 i=0 0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000 i=1 0.000000 1.110223e-16 0 01111001010 0000000000000000000000000000000000000000000000000000 0x3ca0000000000000 i=2 0.000000 2.220446e-16 0 01111001011 0000000000000000000000000000000000000000000000000000 0x3cb0000000000000 i=3 0.000000 4.440892e-16 0 01111001100 0000000000000000000000000000000000000000000000000000 0x3cc0000000000000 ...
When that 'noise' gets summed with itself, it grows:
... i=49 0.031250 3.125000e-02 0 01111111010 0000000000000000000000000000000000000000000000000000 0x3fa0000000000000 i=50 0.062500 6.250000e-02 0 01111111011 0000000000000000000000000000000000000000000000000000 0x3fb0000000000000 i=51 0.125000 1.250000e-01 0 01111111100 0000000000000000000000000000000000000000000000000000 0x3fc0000000000000 i=52 0.250000 2.500000e-01 0 01111111101 0000000000000000000000000000000000000000000000000000 0x3fd0000000000000 i=53 0.500000 5.000000e-01 0 01111111110 0000000000000000000000000000000000000000000000000000 0x3fe0000000000000 stop. i=54, a=1.000000 1.000000e+00
However, you can switch rounding mode to FE_TOWARDZERO:
a = 0.2; 0.200000 2.000000e-01 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a current rounding direction: FE_TONEAREST a += 0.1; FE_INEXACT 0.299999 2.999999e-01 0 01111111101 0011001100110011001100110011001100110011001100110011 0x3fd3333333333333 a -= 0.3; 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000 a==0 i=0 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000 i=1 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000 i=2 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000 i=3 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000 ...
Still FE_INEXACT value, but that works, because rounding gets other direction. The sum is ....33333 (not ....33334). (FE_DOWNWARD would work as well.)
But in a real-world code it's problematic to set rounding mode for each operation...
Now this is even more arcane. We add 0.1 to zero and then subtract it, but the result is not zero:
#include <stdio.h> #include <stdint.h> #include <string.h> #include <math.h> #include <errno.h> #include <fenv.h> const int FloatExpLen = 8; const int FloatMantissaLen = 23; const int FloatBits = FloatExpLen + FloatMantissaLen + 1; void print_float_in_binary (float a) { uint32_t t; memcpy (&t, &a, sizeof(float)); // sign printf ("%d ", (t>>(FloatBits-1))&1); // exponent for (int i=(FloatBits-2); i!=(FloatBits-2)-FloatExpLen; i--) printf ("%d", (t>>i)&1); printf (" "); // mantissa for (int i=(FloatBits-2)-FloatExpLen; i>=0; i--) printf ("%d", (t>>i)&1); printf (" 0x%08x", t); }; int main(){ float a; //double a; // this will fix it int i; printf ("0.1: "); print_float_in_binary(0.1); printf ("\n"); a = 0.0; printf ("a = 0.0;\n"); printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n"); feclearexcept(FE_ALL_EXCEPT); a += 0.1; if(fetestexcept(FE_INEXACT)) printf ("FE_INEXACT\n"); printf ("a += 0.1;\n"); printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n"); feclearexcept(FE_ALL_EXCEPT); a -= 0.1; if(fetestexcept(FE_INEXACT)) printf ("FE_INEXACT\n"); printf ("a -= 0.1;\n"); printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n"); if (a==0.0) printf ("a==0\n"); else printf ("a!=0\n"); for (i = 0; a < 1.0; i++) { printf ("i=%d ", i); printf ("%f %e ", a, a); print_float_in_binary(a); printf ("\n"); a += a; }; printf("i=%d, a=%f %e\n", i, a, a); return 0; }
0.1: 0 01111011 10011001100110011001101 0x3dcccccd a = 0.0; 0.000000 0.000000e+00 0 00000000 00000000000000000000000 0x00000000 FE_INEXACT a += 0.1; 0.100000 1.000000e-01 0 01111011 10011001100110011001101 0x3dcccccd FE_INEXACT a -= 0.1; 0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd a!=0 i=0 0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd i=1 0.000000 2.980232e-09 0 01100010 10011001100110011001101 0x314ccccd i=2 0.000000 5.960465e-09 0 01100011 10011001100110011001101 0x31cccccd i=3 0.000000 1.192093e-08 0 01100100 10011001100110011001101 0x324ccccd ... i=25 0.050000 5.000000e-02 0 01111010 10011001100110011001101 0x3d4ccccd i=26 0.100000 1.000000e-01 0 01111011 10011001100110011001101 0x3dcccccd i=27 0.200000 2.000000e-01 0 01111100 10011001100110011001101 0x3e4ccccd i=28 0.400000 4.000000e-01 0 01111101 10011001100110011001101 0x3ecccccd i=29 0.800000 8.000000e-01 0 01111110 10011001100110011001101 0x3f4ccccd i=30, a=1.600000 1.600000e+00
How is that possible the 'noise' gets appeared if we just add a constant value and subtract it?
Luckily, I have some assembly knowledge, and I can get into the x86-64 code:
... cvtss2sd xmm1, DWORD PTR -4[rbp] cvtss2sd xmm0, DWORD PTR -4[rbp] lea rdi, .LC5[rip] mov eax, 2 call printf@PLT mov eax, DWORD PTR -4[rbp] movd xmm0, eax call print_float_in_binary mov edi, 10 call putchar@PLT mov edi, 61 call feclearexcept@PLT cvtss2sd xmm1, DWORD PTR -4[rbp] movsd xmm0, QWORD PTR .LC6[rip] addsd xmm0, xmm1 cvtsd2ss xmm0, xmm0 movss DWORD PTR -4[rbp], xmm0 mov edi, 32 call fetestexcept@PLT test eax, eax je .L8 lea rdi, .LC7[rip] call puts@PLT .L8: lea rdi, .LC8[rip] call puts@PLT cvtss2sd xmm1, DWORD PTR -4[rbp] cvtss2sd xmm0, DWORD PTR -4[rbp] lea rdi, .LC5[rip] mov eax, 2 call printf@PLT mov eax, DWORD PTR -4[rbp] movd xmm0, eax call print_float_in_binary mov edi, 10 call putchar@PLT mov edi, 61 call feclearexcept@PLT cvtss2sd xmm0, DWORD PTR -4[rbp] movsd xmm1, QWORD PTR .LC6[rip] subsd xmm0, xmm1 cvtsd2ss xmm0, xmm0 movss DWORD PTR -4[rbp], xmm0 ... .LC6: .long 2576980378 .long 1069128089 ...
But even without that knowledge, we can say something about that code. It uses 'cvtss2sd' and 'cvtsd2ss' instructions, which converts single-precision value to double-precision and back.
These are: "Convert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value" and "Convert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value".
The 'addsd' and 'subsd' instructions are: "Add Scalar Double-Precision Floating-Point Values" and "Subtract Scalar Double-Precision Floating-Point Value".
(The whole operation is happens in SIMD registers.) Also, the 0.1 constant is encoded as a 64-bit double-precision value (.LC6).
In other words, everything calculates here in double-precision, but stored as single-precision values.
The first FE_INEXACT exception raises when double-precision zero is summed with double-precision 0.1 value and shoved into single-precision register.
In[]:= N[convertFloatDWORD[16^^3dcccccd], 100] Out[]= 0.100000001490116119384765625
This is the best possible approximation of 0.1 in single-precision. Now you convert it to double-precision. You'll get something like 0.10000000149011611... But this is not the best possible approximation of 0.1 in double-precision! The best possible approximation is:
In[]:= N[convertDoubleQWORD[16^^3fb999999999999a], 100] Out[]= 0.1000000000000000055511151231257827021181583404541015625
You subtract $ \approx 0.1000000000000000055511151...$ (best possible approximation) from $ \approx 0.10000000149011611...$ and get $ \approx 0.00000000149011611...$ Of course, it gets normalized, the mantissa is shifted and exponent aligned, leaving $\approx 1.490116 \cdot 10^{-9}$. After converting this to a single-precision value, it left as the 'noise':
... 0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd a!=0 ...
Again, if summing the 'noise' with itself, it grows.
What if it's compiled for 80x87 FPU support? As it was done before SIMD?
% gcc -mfpmath=387 2.c -S -masm=intel -lm ... fld DWORD PTR -8[rbp] fld QWORD PTR .LC8[rip] faddp st(1), st fstp DWORD PTR -8[rbp] ... fld DWORD PTR -8[rbp] fld QWORD PTR .LC8[rip] fsubp st(1), st fstp DWORD PTR -8[rbp] ... .LC8: .long 2576980378 .long 1069128089
You see, one operand for faddp/fsubp instruction (add/sub) loaded as a 32-bit value (single-precision), but another (which is constant) as a 64-bit value (double-precision).
The solution: switching 'float' to 'double' in that code would eliminate problem.
IEEE 754 is hard. Mathematically flawless expressions and algorithms can go crazy out of the blue.
A well-known introduction is What Every ComputerScientist Should Know About Floating-Point Arithmetic by David Goldberg.
The Numerical Recipes book is also recommended, as well as D.Knuth's TAOCP.
The field is numerical analysis.
(Including the Wolfram Mathematica notebook I used.)
Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.