Real Number, Floating Point

This page was translated by a robot.

In mathematics, the set of real numbers denotes all numbers with any number of digits after the decimal point. Since only a limited number of bits are available in a computer, there are special codings that are able to map a precisely defined part of the real number range. All of these codings are basically referred to as real numbers. Accordingly, there are special machine instructions in processors that can deal with these codes. While encodings such as fixed point or binary coded decimals (BCD) were used in earlier times, the versatile floating point encoding has prevailed for several decades.

Floating point calculations are processed in the so-called floating point unit (FPU). This stores real numbers using the exponential representation, i.e. a splitting of the number into mantissa and exponent. By choosing a suitable exponent, it is possible to move the separator (the point or the comma) between the whole number part and the fractional part in such a way that both very large and very small numbers can be stored with the same coding. This shift in the delimiter gave the coding the name Floating Point Number, or floating point number in German . See also the comment at the bottom of the page.

Details

The floating point coding is defined by the IEEE 754 standard. This encoding is motivated by the exponential representation of decimal numbers. For example, the number 0.00456can be written in the exponential representation as follows: 4.56 * 10^-3, or briefly 4.56e-3. Here, the part 4.56is called the mantissa and the part -3is called the exponent . The corresponding basis 10is given implicitly and is simply omitted. This splitting of mantissa and exponent allows both very large and very small numbers to be specified in compact notation. Floating point encoding does the same.

The basis of the exponential representation is also called the radix . While radix-10 representation is common for humans, most FPUs calculate with radix-2, based on the binary system. There are also radix-10 FPUs in circulation, but for the following explanation it is assumed that an FPU calculates in radix-2.

A floating point number always has 1 sign bit, which is for positive numbers 0and for negative numbers 1. The remaining bits are divided into the mantissa and the exponent. In the IEEE 754 standard, two bit distributions in particular are defined as standard: a 32-bit floating point number with 23 bits mantissa and 8 bits exponent for single precision and a 64-bit floating point number with 52 bits mantissa and 11 bits exponent for double precision. Processor manufacturers can, in principle, implement any additional coding (and they do). In the following explanation, the bit sizes are used for single precision (1 bit sign, 23 bits mantissa, 8 bits exponent):

The exponent is encoded as an unsigned integer with a so-called bias . Unlike the complement , negative exponents are created simply by subtracting the bias from the unsigned integer. If 8 bits are reserved for the exponent and the bias is 127set to , then the shifted exponents [ 0, 255] or the actual exponents [ -127, 128] can be encoded. However, the smallest and largest exponent is reserved for special numbers (see below). With this information, the exponents [ -126, 127] can finally be displayed.

The mantissa is interpreted as an unsigned integer. For a mantissa with 23 bits, this results in the (integer) value range [ 0, 8388607]. If this range of values ​​is 8388608divided by the number (=2^23), the result is the (real) range of values ​​[0, 1). Since the number 8388608only contains ALMOST all numbers from 0 to 9999999 (the largest 7-digit decimal number), the real value range [0, 1) can only be mapped to 6 decimal places (decimal places). The smallest distinguishable unit of the mantissa is defined as 1 / 8388608 = 1.192093e-7. This value is called the so-called machine epsilon .

When coding the mantissa, a so-called normalization is also carried out, which means the following: The exponential representation is not unique. For example, the number is 4.56e-3identical to the number 45.6e-4. This ambiguity is unsuitable for coding. In order to remedy this, it is defined that in an FPU only that exponential representation is valid which has only one digit in front of the separator and which, moreover, must not be zero. With a radix-2 FPU this means: The most significant bit must 1be. Since this bit must always 1be there, it can also simply be omitted and implicitly assumed in calculations by using the integer value of this bit (2^23 =8388608) is added. According to the above calculation, 1 can simply be added to the real-valued mantissa in the range from [0, 1). If it is not possible to normalize the mantissa of a number due to the limited exponents, then it is a denormalized number (see below).

Finally, the following (pseudo) calculation results for a floating point coded radix-2 number:

Decimal number = sign * (1 + Mantisse / 2^BitsPerMantisse) * 2^(Exponent - Bias)

+-inf, denormalisized Numbers, +-Null, NaN

Since the value range of floating point numbers is limited, overflows and underflows can occur. An overflow occurs when the absolute value of a floating point number is too large to be represented with the exponent. When this happens, the exponent is set to the maximum value and the mantissa to 0. Together with the sign, this number is interpreted as +-infinity, i.e. plus/minus infinity.

Underflow occurs when the absolute value of a floating point number is too small to be represented by the exponent. Here, the exponent is set to the minimum value, but the mantissa remains. Because this mantissa has no implicit 1before the delimiter, it is a denormalized number ( sub-normal in English ). The IEEE 754 standard specifies that denormalized numbers are to be interpreted as equally distributed numbers towards the value zero. However, these numbers are considered to be extremely imprecise and are only supported by FPUs for the sake of completeness, but often with severe performance losses. Because of this, some FPUs can be set to simply add the mantissa to denormalized numbers0and round the number to 0.

The denormalized numbers are comparable to what is known in some scientific fields as a singularity : any number that falls below a certain threshold can no longer be represented exactly.

The number zero must be treated specially in floating-point encoding. Since there is no normalized representation of the number zero, it is also one of the denormalized numbers. Since every number, including a denormalized one, always has a sign, the numbers +0and -0can arise. The special treatment of the number zero is that +0and -0must be taken to be the same.

Furthermore, floating point numbers define another special number: NaN (Not a Number). This number results from an undefined calculation result, for example when the square root of a negative number is taken or zero is divided by zero. For the NaN number, the exponent is set to the maximum possible value and the mantissa to a non-zero value. The exact specification of the mantissa can be found in other sources.

Calculation Errors, Numerical Accuracy

Floating point numbers are usually not a problem for everyday programming. They are able to cover the real number range with sufficient accuracy. However, due to the limited number of bits, they are not always completely exact.

A precise description of the problems that can occur with floating point numbers would go far beyond the scope of this page. It is left to the reader to find out from other sources why 3 * 1 / 3the number 1is the 1 / 3 * 3same as the number.9999999

Floating Point Numbers in C and C++

In some programming languages ​​there is a data type called Real. In C and C++, on the other hand, the types float, doubleand long doubleare used. They are encoded as a floating point number with varying precision. The type floatis single precision as defined in the IEEE 754 standard and the type doubleis double precision. The type long doubleis a so-called extended type with a few bits more precision. However, the exact definition of this type is a chapter in itself, which is why reference is made to other sources.

In C and C++, the type doubleis considered the standard type. If, for example, there is no type specification for a floating -point number (see also Type Promotion ), the type is automatically doubleassumed. In the case of fixed values ​​in the program code, the type is also assumed by default double, unless the number is provided with a suffix: ffor floatand lfor long double.

The type floatcan normally be processed significantly faster by an FPU than double, and a needs floatonly 32 bits, doublewhereas a needs 64. The precision of the floattype is perfectly sufficient for many everyday applications. In science, however, it quickly reaches its limits, which is why doubleprogramming is mainly done with the type.

The specifications for coding the types are floatspecified in the library. It should be noted that the number of bits specified there for the mantissa includes the implicitly added bit.

Fixed Point Arithmetic

In earlier times, numbers with decimal places were encoded as so-called fix-point numbers, whereby it was simply defined how many bits were used for the numbers before the separator and how many after the separator. The part before the separator ( magnitude part or integer part ) was then interpreted as an integer (e.g. using the two's complement) and the part after the separator as a fraction ( fractional part ). Fractions are mathematically referred to as rational numbers, which is why bit size specifications of fixed-point types are often given with the character Q.

Fixed-point numbers are hardly standardized, but can be easily defined and easily reprogrammed. For example, there is a fixed point type with 1 sign bit, 15 integer bits and 16 fraction bits. Or a fixed point type with 1 sign bit and 31 fraction bits. Theoretically, an 8-byte type with 1 sign bit, 4 integer bits and 3 fraction bits is also conceivable. For school grades, for example. However, what the sign means should be discussed with the teacher beforehand.

By specifying the bits, the range of values ​​and thus also the handling of the arithmetic are very precisely defined, as with integers, but are also severely limited. For example, it is hardly possible to represent numbers such as physical constants with sufficient accuracy, since they are often either very large or very small (e.g. the mass of an atom: 1.66 * 10^-27). In fact, some technologies still use this fixed-point arithmetic today, but floating-point numbers have prevailed for everyday programming needs.

Binary Coded Decimals (BCD)

Since computers usually work with radix-2, both floating point numbers and fixed point numbers have the problem that they cannot represent decimal places (tenths, hundredths, thousandths, ...) exactly. However, in some areas of programming, precise handling of decimal places is critical to success, such as in the financial sector.

Accordingly, some processors have supported BCD encoding. In this encoding, a number is split into its digits and each decimal digit is encoded individually into 8 or 4 bits:

0
1
2
3
4
5
6
7
8
9
00000000   0000
00000001   0001
00000010   0010
00000011   0011
00000100   0100
00000101   0101
00000110   0110
00000111   0111
00001000   1000
00001001   1001

The remaining possible combinations of the 4 or 8 bits are simply discarded. With 8 bits, this is quite a waste of memory. The 4-bit variant was then also called packed or dense in English . The resulting binary numbers could be calculated in the processors like normal integers, but a BCD correction had to be applied before or after each individual calculation. These commands had catchy names like Ascii adjust after addition .

Today, BCDs are probably as good as extinct, which is why a closer look at the correction commands is not given.

Comment on the Terminology

Note that the word real is not entirely correct. From a purely mathematical point of view, it should be called a real number , which is correctly called a real number in English . Through Germanization, the term real number was formed and naturalized. It is sometimes pronounced mixed up in German and sometimes in English.

Furthermore, there are heated discussions about which of the terms floating point , floating point , floating point , floating point , etc. is correct. All are more or less thoughtless translations from the English floating point . Correctly it should be:

Exponentenmultiplikationsradixtrennzeichenpositionsverschiebungscodierung

Since this is a bit too complicated, the above terms have become common. Depending on the situation, professor, location or institution, one or the other may apply. Due to the author's longstanding habit, the floating point variant is used throughout on this page..

Next Chapter: Integer Arithmetics