Floating-point number formats
In this article we will show floating point number formats in various computers and calculators - Busicom 141-PF, TI-59, ZX Spectrum, IEEE754 and others.
First we'll stop with one of the first desktop calculators, the Busicom 141-PF with an Intel 4004 processor, from 1971. Although the calculator does not yet use an exponent, it does contain information about the position of the decimal point, the first hint of an exponent. Example of displaying the number -75.43:

Register of the number contains 20 BCD digits (i.e. cells with value 0 to 15). The first 16 digits are mantissa, stored from the lowest digit to the highest, the last 4 digits are status cells. M0 is located at the lowest RAM address, S3 at the highest. Cell S0 contains flags: 0 is a positive number, 15 is a negative number. Cell S1 contains the position of the decimal point, i.e. the number of digits after the decimal point. Online emulator of the calculator: https://dutchen18.gitlab.io/emu-rs/ .

In 1973, Texas Instruments released its Datamath TI-2500 calculator with the TMS0800 processor. Unlike the Intel processor, this processor was aimed specifically at calculators and performed integer operations with a single instruction, using a mask to determine which digits to perform the operation with. Example of displaying the number -75.43:

Register of the number contains 11 BCD digits. Cell S10 contains the sign: 0 positive, 14 negative. This is followed by the 8 digits of the mantissa, the highest digit in S9, the lowest digit in S2. S1 and S0 contain the position of the decimal point - i.e. the number of digits after the decimal point. When displayed, the number is right-aligned with the sign, i.e. the extra zeros after the decimal point are dropped. Suppression of insignificant zeros before the number is provided by the hardware. During calculations, the 9th digit is used instead of the sign.
Online emulator of the calculator together with a detailed description: http://files.righto.com/calculator/TI_calculator_simulator.html .

Clive Sinclair had the idea of selling a cheap calculator that could do scientific and technical calculations. He wanted to use the TMS0800 processor (used in the previous Datamath calculator) because he could get it cheaply from Texas Instruments. The engineers at Texas Instruments laughed at him, saying it was impossible, the processor had no such facilities and only had space for 320 instructions. Incredibly, Sinclair did it, and in 1974 he started selling a cheap calculator that could do sine, cosine, tangent, arcsine, arcsine cosine, arcsine tangent, logarithm, and exponential. Example of displaying the number -75.43:

For better code optimization, the numbers in the registers are arranged differently than in Datamath. In the first cell, S10, there is a mantissa sign. In the second cell, S9, is the sign of the exponent. The next 2 digits, S8 and S7, are the exponent. The mantissa follows. The mantissa of 6 digits is used during calculations and starts with the digit S6. During display, the mantissa has 5 digits and starts at position S5. The calculator always displays the numbers in the form: the sign of the mantissa, 5 digits of the mantissa (with a decimal point after the first digit), the sign of the exponent, and 2 digits of the epxonent.
Sinclair Scientific became the first low-cost scientific calculator built on a single chip. However, it was not a great success, and sales were discontinued in 1979 due to falling prices of competing calculators. It is currently enjoying a new revival, as a demonstration of the almost miraculous optimization of code in a small space. Here you will find an online emulator with a detailed description of the calculator: http://files.righto.com/calculator/sinclair_scientific_simulator.html and has seen the construction of many replicas: https://simpleavr.github.io/tms0800/index.html , https://hackaday.com/2018/06/22/your-own-sinclair-scientific-calculator/ .

The TI-57 programmable calculator from Texas Instruments, with the TMC1501 processor, 1977, works on similar principles as the Datamath. The number is stored in 14 BCD digits. The highest digit, D13, contains the flags: B0 negative mantissa, B1 negative exponent, B2 inverse. Inversion means that the mantissa has been temporarily changed to a binary complement. Digits D12 to D2 contain the 11 digits of the mantissa (D12 contains the highest digit). Digits D1 and D0 contain the exponent 00 to 99 (D1 contains the highest digit). Example of displaying the number -75.43:

Online emulator of the calculator: https://www.pcjs.org/machines/ti/ti57/rev0/ .

The popular TI-59 calculator (1978) from Texas Instruments uses the TMC0501E processor, has more memory than the TI-57, and uses larger registers to store numbers. The number is stored in 16 BCD digits. Digits D15 to D3 contain 13 digits of mantissa (the highest digit in D15). Digits D2 and D1 contain the exponent 00 to 99 (the higher digit in D2). Cell D0 contains flags: B0 negative mantissa, B1 negative exponent, B2 inverse. Example of displaying the number -75.43:


From calculators, we are moving on to 8-bit computers. We are moving from BCD code to a binary representation of a number, which is more advantageous for 8-bit processors because of faster computations.
The popular ZX-81 (1981) and ZX-Spectrum (1982) computers store numbers in 5 bytes in binary code. The first byte, D0, located at the lowest address, contains an exponent with a bias (=offset) of 0x80 (128). Unlike calculators operating in BCD code, where the exponent is a direct power of base 10, here it is a binary exponent. It represents the number of bit shifts (or multiplication and division by 2) by which we must shift the mantissa to bring it into the interval 0.5 (inclusive) to 1.0 (exclusive). Add the bias 0x80, or the mean value of the exponent, to the number of shifts. This is so that we can work with the exponent as a positive number - after all, an unsigned number is easier to work with in processors than a signed number (the shift operations generate a carry flag that can be used to check for overflow).
Note that the interpretation of the bias is different from the IEEE754 standard. For the ZX Spectrum, the mantissa is considered to be normalized in the interval 0.5 <= to < 1.0 (bias is given as 0x80), while for the float number of the IEEE754 standard, the normalized mantissa is in the interval 1.0 <= to < 2.0 (bias 0x7F). The number "1. 0" in the ZX Spectrum is stored as 0x81 0x00 0x00 0x00 (the exponent is 0x81) and in the IEEE754 standard as 0x00 0x00 0x80 0x3F (the exponent is 0x7F). Thus, the exponent differs by 2 even though the reported bias differs by 1.
The following 4 bytes contain the mantissa. The first byte (at the lowest address) contains the high byte of the mantissa. The mantissa is used in an aligned form - i.e. the number moves up in order until the highest bit of the mantissa has a value of "1". Since the number is always aligned, the bit with the value "1" would always be in the highest bit position of the mantissa. We can therefore omit this bit and put a signed bit in its place.
The exceptional value is the number 0. We couldn't express this with an exponent and a mantissa (for the mantissa, we dropped the highest bit "1", which would otherwise be used to indicate a zero). The special case of the number 0 is expressed by an exponent with a value of zero. Thus, the exponent will take the values 0x01 to 0xFF for valid non-zero numbers and the value 0x00 for the number 0.
The binary exponent has a range of +-127 (0x01-0x80 to 0xFF-0x80). To find the range of the resulting decadic exponent, multiply the binary exponent by log10(2). Thus, 127 * 0.30103 = 38.2. The decadic exponent has a range of +- 38.
We calculate the precision of the mantissa by taking 1 bit to represent the value log10(2) = 0.30103. The mantissa has 32 bits (including the hidden "1"), its precision will be 32 * 0.30103 = 9.6 digits. The computer performs internal calculations to 9.6 digits, but because some precision is lost during internal intermediate calculations, it displays the result rounded to 8 digits for the user.
The ZX Spectrum goes even further in the format (not applicable to the ZX-81). It uses the exponent of 0 for yet another purpose - an integer. If the first byte of D0 has a value of 0, the next 2 bytes D1 and D2 are used as signed integers. Thus, a number with a value of 0 must be expressed by the zero content of the first 3 bytes. Using an integer as part of the format has the great advantage that both precision is not lost during float operations and calculations with integers can be faster. During calculations, Spectrum first checks that both operands are integer and that the result fits again in an integer. If it does, an accelerated calculation using integers is performed. If not, the two operands are converted to float numbers and the calculation is performed classically using float arithmetic.
The combination of float and integer format is particularly important in languages using a single numeric format, such as BASIC typically. If a float number is used, for example, as a loop counter, it may deviate from the correct value after multiple iterations and the number of loop passes may "miss". Which is understandable from the point of view of the binary representation of the float number, but poorly understood by the user. In this case, Spectrum will use integers for the counter and the error function will not occur.

In the example you can see how the number with the value -75.43 looks in ZX Spectrum. In hexadecimal form, the values are 0x87 0x96 0xDC 0x28 0xF6. How did we arrive at such a number? Let's take an unsigned number, 75.43. We repeatedly divide the number by 2 until it falls in the interval 0.5 (inclusive) to 1.0 (exclusive). We need 7 shifts to do this, so divide by 128. After adding the bias 0x80, the first byte with the exponent will be 0x80 + 7 = 0x87.
Next, we convert the resulting mantissa 0.589296875 to an integer expression. The result should be a 4-byte number, so we multiply the mantissa by 2^32. So 0.589296875 * 2^32 = 2531010805.76. After rounding to the integer 2531010806, in hex code 0x96DC28F6. Remove the redundant high bit "1", the number will be 0x16DC28F6. And since the number is negative, we set the high bit again and get the resulting value of the next 4 bytes, 0x96 0xDC 0x28 0xF6 (we store in memory starting from the highest bytes).
Other examples of numbers: 1.0 = 0x81 0x00 0x00 0x00, pi = 0x82 0x49 0x0F 0xDA 0xA2, 10 = 0x00 0x0A 0x00 0x00 (for ZX-81 10 = 0x84 0x40 0x00 0x00).
Note the byte order. In the ZX Spectrum, the number is stored starting with the most significant bytes, starting with the exponent. This is because the test of the exponent to zero is the most common operation on a number, so it is preferable that it be at the beginning of the number. The second most common operation is the sign test, which is located in the second byte of the number.

AMOS from 1986 is the operating system for IQ 151 computers. The representation of "real" numbers in AMOS Pascal is almost identical to the modern single-precision (float) form of numbers. Example of the number -75.43:

The number is stored in 4 bytes. The highest byte, D3, contains the exponent. The exponent is an unsigned 8-bit number, with a bias (offset) of 0x7F (127). The exponent represents the number of binary shifts (multiplying and dividing by 2x) to bring the mantissa into the range 1.0 (inclusive) to 2.0 (exclusive). The mantissa is stored in bytes D0 to D2, the high bit "1" is hidden and replaced by a signed bit. The decadic exponent has a range of +- 38, the mantissa has a precision of 7.2 digits and is displayed to a maximum of 6 digits. Other number examples: 1.0 = 0x00 0x00 0x7F, -1.0 = 0x00 0x80 0x7F, pi (3.1415927) = 0xDB 0x0F 0x49 0x80.
The sign bit is not stored in the highest bit of the number (above the exponent), as would be the IEEE754 standard, but in the highest bit of the mantissa, instead of the hidden "1" bit. This is more convenient for software implementation - the exponent can be easily manipulated as a single byte, and the sign bit can be replaced back by the hidden bit during calculations.

Modern computers follow the IEEE754 standard, dating from 1985, as the standard for numbers. As a typical representative we will show the double-precision number format.
(image source: Wikipedie)
The double number has a size of 64 bits (8 bytes). The lowest 52 bits contain the mantissa. The mantissa is stored without the most significant bit "1". So it would actually be 53 bits long. The sign bit is not stored with the mantissa, instead of the hidden bit "1", but is stored in the highest bit of the 63 number. The rationale for this is the ease of comparing and sorting numbers - numbers can be compared simply as bytes. At the same time, unfortunately, it is a bit of an inconvenience for software implementation of calculations - during calculations it is useful to temporarily restore the hidden bit "1" by substituting it for the sign bit, which cannot be done with this format. But it's not that big a problem, since it's better to do internal calculations with higher precision anyway (so as not to accumulate intermediate calculation error) and convert the formats. This is also done by the math coprocessor, which uses 80 bits as the internal "extended precision" format with the hidden bit recovered during calculations. Achievable mantissa precision of 53 * 0.30103 = 16.0 digits.
The exponent is 11 bits, with a bias (offset) of 0x3FF (1023). The exponent represents the number of bit offsets (or multiplication and division by 2) by which we must shift the mantissa to bring it into the interval 1.0 (inclusive) to 2.0 (exclusive). Add the bias 0x3FF, or the mean value of the exponent, to the number of shifts. To calculate the range of the exponent, we multiply the binary exponent by log10(2), so 1023 * 0.30103 = 308. The decadic exponent has a range of +- 308.
Double-precision number uses some exponent values as special cases. Normal numbers are represented by the range 0x001 to 0x7FE. The special exponent value 0x000 is used to indicate both zero (zero can have a sign, oddly enough) - this is in case the mantissa has zero content - and numbers with a small exponent, "subnormals". Subnormals are small unnormalized numbers for which the high bit is set to "1". In this case, the highest bit need not be in the highest position of the mantissa.
The opposite end, exponent 0x7FF, is reserved to indicate out-of-range numbers: 1#INF, -1#INF, 1#NAN, -1#NAN.
Example numbers (bytes in order from lowest address, reversed from the bits in the figure): 1.0 = 0x00 0x00 0x00 0x00 0xF0 0x3F, pi (3.1415926535897932) = 0x18 0x2D 0x44 0x54 0xFB 0x21 0x09 0x40.
Single precision numbers, or "floats", are 32 bits (4 bytes) in size. The lowest 23 bits contain the mantissa, with no hidden high bit "1". The achievable mantissa precision is 24 * 0.30103 = 7.2 digits. The sign is stored in the highest bit of the number 31. Between the mantissa and the sign are 8 bits of exponent, with a bias of 0x7F (127). The decadic exponent has a range of 127 * 0.30103 = +- 38.
(image source: Wikipedie)
Examples of numbers (from lowest bytes): 1.0 = 0x00 0x00 0x80 0x3F, pi (3.1415927) = 0xDB, 0x0F, 0x49, 0x40.
If we are not bound to standards, we can make up any float number format we want. The ET-58 calculator is a replica based on the popular TI-58 and extending the features of the original calculator. The numbers are stored here in 10 bytes, in binary format. In the first 2 bytes there is an exponent with a bias of 0x8000 (32768). The range of the decadic exponent is +-9863. The next 8 bytes are mantissa, stored from higher bytes to lower bytes, with the hidden bit "1" replaced by a sign. The precision of the mantissa is 19.3 digits. Example number -75.43:

To convert a number to internal format: the number is divided by 2 to get into the interval 1.0 (inclusive) to 2.0 (exclusive). For the number 75.43, this is 6 shifts. An exponent with a bias of 0x8000 will have a value of 0x8006, this is the first 2 bytes of the number. After dividing by the exponent, the number will be 1.17859375. To get the mantissa into the required form expressed by 8 bytes of the integer, we need to multiply it by 2^63 (the mantissa will have 64 bits, 1 bit of which is already included in the interval 1..2). We get 1.17859375 * 2^63 = 10870608636561808424.96. After rounding to an integer, 10870608636561808425. In HEX form, this is 0x96DC28F5C28F5C29. We remove the hidden high "1": 0x16DC28F5C28F5C29. And again, we add the high "1" because the number is negative: 0x96DC28F5C28F5C29. So the next 8 bytes of the number will be: 0x96 0xDC 0x28 0xF5 0xC2 0x8F 0x5C 0x29.
website of the project: ../../hw/et58/

Miroslav Nemecek