Why 1 + 1 is not always 2?

Cesky: , English:

Why 1 + 1 is not always 2?

You may have come across this before. I'll divide the pie into tenths. I'll take seven tenths and then three tenths. How many are left? The computer says it has a small piece left, 0.00000001! How did he get that? Is it defective? Unfortunately, the fault is not in the computer, but on the other side of the screen. Let's discuss the issue of float accuracy here.

Binary representation of a number

The first problem is in the number system. Nature (or Anunnaki?) has blessed us with ten fingers, which are so perfect for counting sheep that we have adopted them as our number system. That kind of messed us up, because 10 is very hard to divide into ever smaller halves. And just dividing by two is the most convenient way of expressing information, because we make do with just 2 states, YES and NO, black and white, ON and OFF, 1 and 0.

Computers work on binary. It is therefore obvious that even numbers are most convenient for them to express in binary. And in order to convert our way of counting into a form understandable to computers, we have to convert the number from the decadic form to the binary form.

For integers, it's still pretty easy - we decompose the number into multiples of a power of 2. For example, we decompose the number 753 into:

753 = 1*2^9 + 0*2^8 + 1*2^7 + 1*2^6 + 1*2^5 + 1*2^4 + 0*2^3 + 0*2^2 + 0*2^1 + 1*2^0

This is in binary notation 10111100001, in HEX format 0x2F1.

We can decompose a decimal number similarly, except that we use negative exponents for multiples of powers of 2. For example, we decompose the number 0.1 into:

0.1 = 0*2^-1 + 0*2^-2 + 0*2^-3 + 1*2^-4 + 1*2^-5 + 0*2^-6 + 0*2^-7 + 1*2^-8 + 1*2^-9 ...

But oops - as we can see, the conversion never ends and we never get to the exact value of 0.1! What is a number with a finite number of digits in the decadic representation can mean an infinite number of digits in the binary representation, and thus a number not expressible exactly. In float number format (single-precision, 4 bytes per number), the value 0.1 would look like this:

0xCD 0xCC 0xCC 0x3D

The exponent is 0x7B (= -4 with no bias), the mantissa is 0xCCCCCD (with a hidden "1" added). In exact expression, the number 0.1 would contain an infinite number of digits 0xC in the mantissa, but due to the limitation of the length of the number, the last digit of the mantissa is rounded up to 0xD.

If we display such a number, we get the expected result of 0.1 only when rounding to a precision less than the full range of the number. Example in C:

printf("Rounded: %.8f\n", 0.1f);
printf("Full precision: %.9f\n", 0.1f);

Prints:

Rounded: 0.10000000
Full precision: 0.100000001

Seemingly we have entered a "nice" number, with a finite number of digits, and without performing any operation, the computer is no longer able to return such a number. This is not a bug, it is a feature we have to take into account.

Binary numbers versus BCD numbers

The remedy for this problem could be a BCD representation of the number. In BCD format, the number is not stored in computer memory in binary form, but as the decadic digits 0 to 9. This is how calculators used to (and still do) calculate. The processor calculates numbers in the same way that a human does. It doesn't convert the numbers into binary form, it just stores the given number as a sequence of digits and calculates with the digits. This brings with it several pitfalls:

Higher memory requirements. Take for example the number 999999. In decadic form, this is 6 digits. 1 digit in memory takes 4 bits (half a byte). So we need 3 bytes of space for 6 digits. If we express the same number in binary form, it will be the number F423F. It takes 5 HEX digits, or 2 and a half bytes, to express it.
Higher granularity shifts. During operations, we need to shift the mantissa of the number to achieve normalization. In binary format we shift the mantissa by 1 bit, in decadic format by 1 digit, or 4 bits. This means that on average 2 to 3 bits are lost as precision inefficiencies. On the other hand, some improvement is obtained if we use an exponent expressed as a binary number in the decadic format. For float BCD numbers with an exponent of 8 bits, this gives a range of +-127 for the decadic exponent compared to the +-38 range of the binary format. This reduces the granularity handicap of the mantissa to about 1 to 2 bits.
Lower computation speed. An operation such as summing numbers in binary format is a fast and easy operation - the bits are gradually added and the carry is carried to higher orders. In BCD format, the summation is more complex. First, a 1 digit sum is performed in binary, and if it exceeds 9, the carry flag is set for the higher order and 10 is subtracted from the digit. This was still relatively easy to do on 4-bit processors focused on BCD operations. The situation was worse on 8-bit processors, where operations had to be done by 1/2 byte instead of byte at a time. Older processors still had a DAA instruction that could do BCD correction of the addition result. Newer processors no longer have this instruction and so BCD operations are even more difficult.

The first pitfall, higher memory requirements, is addressed by the new IEEE754 standard formats with decimal arithmetic - decimal32, decimal64 and decimal128. Firstly, the classical variant with binary BCD encoding, where 2 digits occupy 1 byte, and secondly, DPD encoding, where 3 decade digits are encoded into 10 bits. When using DPD encoding, the efficiency of number storage is comparable to the binary number format.

However, working with DPD encoding is not easy. It has large compression and decompression requirements for numbers and is therefore slower even than the slow BCD mode. A more massive deployment of the DPD format is still awaiting hardware support from a mathematical coprocessor and is therefore still rather theoretical. As a reference standard, Intel has released a library to support these new formats: the Intel Decimal Floating-Point Math Library https://software.intel.com/content/www/us/en/develop/articles/intel-decimal-floating-point-math-library.html . The DPD format together with the Intel library has been used e.g. in the DM42 calculator ( https://www.swissmicros.com/product/dm42 ), which supports 128-bit BCD numbers with a precision of 34 digits and an exponent of +-6143. As usual with calculators, the low speed of calculations is not much of a concern there. Especially when compensated by the powerful STM32L476VGT6 processor.

However, it is questionable whether in these days of rapidly falling memory prices it makes sense to deal with the higher memory requirements of numbers in BCD format. The difference in memory requirements is not so abysmal as to make the time-consuming operation of the DPD format worthwhile. Reasonable use will only occur when the format is fully supported by the hardware coprocessor, without loss of speed.

Low speed is the biggest pitfall of the BCD format. For comparison, here is what the sum of 2 bytes of mantissa in binary format looks like on an ATmega processor:

	adc	r1,r10

And roughly what the sum of the corresponding 2 BCD digits looks like in BCD format (out of context, don't examine the details, it's just for a hint :-) ):

	andi	r21,0x0f	; mask lower digit 1
	add	r23,r21		; add input carry

	; load low digit 2
	mov	r25,r24		; save byte
	andi	r24,0x0f	; mask lower digit of mantissa
	andi	r25,0xf0	; mask higher digit of mantissa

	; add low digit
	add	r24,r23		; add 1st digit with carry
	clr	r23		; clear carry
	cpi	r24,10		; carry?
	brcs	2f		; no carry, save result
	subi	r24,10		; carry correction
	ldi	r23,1		; new carry

	; compose digit
2:	or	r24,r25		; compose digits
	swap	r21		; swap digits
	swap	r24		; swap digits

	; load high digit
	mov	r25,r24		; save byte
	andi	r24,0x0f	; mask lower digit of mantissa
	andi	r25,0xf0	; mask higher digit of mantissa

	; add high digits
	add	r24,r23		; add carry
	clr	r23		; clear carry
	cpi	r24,10		; carry?
	brcs	2f		; no carry, save result
	subi	r24,10		; carry correction
	ldi	r23,1		; new carry

	; compose digit
2:	or	r24,r25		; compose digits
	swap	r24		; swap digits

The lower speed of calculations in the BCD format is somewhat compensated by the higher speed of conversions between the text and the internal representation of the number. Displaying a BCD number is very simple, just take the necessary digits from the mantissa. In binary format, the conversion is very difficult. First, you need to divide the number into a mantissa and an exponent - by dividing or multiplying the number by powers of the exponent until the mantissa falls in the interval 1 to 10. The individual digits of the mantissa are then obtained by repeatedly multiplying the mantissa by 10 and removing the integer part of the number.

For the above reasons, the BCD format is used in calculators and small devices where low calculation speed is not a concern and rather fast frequent conversions between internal and text number formats are preferred. The binary format is preferably used in computers where conversions to text format are not frequent and high speed calculations are more important.

Accumulation of error

Since the numbers in the internal format have limited precision, there will always be an accumulation of error. Deviation from the correct value cannot be avoided and the deviation increases further with repeated calculations. Normally float numbers are displayed with a margin and rounding, but after multiple operations the error may reach the displayed fraction. Example:

	for (float n = 0; n != 0.4f; n += 0.1f) printf("n=%.9f\n", n);

The loop parameter is float incremented by 0.1f. The loop is terminated when the value 0.4f is reached. The program prints as expected:

n=0.000000000
n=0.100000001
n=0.200000003
n=0.300000012

Let's raise the threshold to 0.7f. The program starts and never stops. The error accumulation has caused a deviation from the expected value of 0.7f.

When working with decimal numbers, we cannot compare the numbers to the exact value. Due to deviations, the number may be slightly different. The solution is to compare numbers within an interval, with an epsilon deviation. In this case, we would modify the program as follows:

	for (float n = 0; abs(n - 0.7f) > 0.0001f; n += 0.1f) printf("n=%.9f\n", n);

The program will run until the deviation from the end value is greater than the set minimum deviation.

In the case of programmable calculators, the calculator already provides this check. During the comparison operation, the numbers are subtracted from each other. If the calculator calculates to 13 digits and displays to 10 digits, it knows that it has 3 digits available for its internal use, for inaccuracies. If, during the comparison, the value of the difference of the numbers falls outside the displayed range (the difference of the exponents of the operands and the result is greater than 10), it knows that this is not a user input, but is an inaccuracy deviation, in which case it considers the numbers to be identical.

Miroslav Nemecek

<< Back