Floating-Point Numbers

In a computer, floating-point numbers are represented by a binary form of scientific notation. Since fixed-size representations are used, the numbers that they represent are limited in range and precision. The algorithms for floating point arithmetic are similar to algorithms for scientific notation.

The floating-point representations used in the early processors varied considerably in the details. Modern processors use single and double precision representations that adhere to the IEEE 754 standards.

For a normalized floating-point number, the bit pattern

is interpreted as the number

(-1)^S × (1 + Fraction) × 2^{Exp - Bias}

The bias depends on the particular representation.

The IEEE 754 standard defines two commonly used floating point formats, a single precision format and a double precision format.

A floating-point format has limited range and precision. These limitations can be understood by considering using scientific notation with a limited range of exponents and a limited number of digits in the mantissa.

Range limitation: A fixed number of "Exp" bits is comparable to limiting the size of the exponent in scientific notation.
Precision limitation: A fixed number of "Fraction" bits is comparable to limiting the number of digits in the mantissa in scientific notation.

The limitations in decimal can be estimated from bit limitations using the fact that log₁₀(2) = 0.30103. This implies the following.

2 raised to a power of is about the same as 10 raised to 3 ⁄ 10 of that power.
Each bit (binary digit) is about 3 ⁄ 10 of a decimal digit.

As an example of the first bullet, 2¹⁰ = 1024 ≈ 10³.

The bias for an IEEE 754 single precision number is 127.

Properties

The IEEE 754 single precision format has the following properties.

C, C++, Java type	Range	Precision
float	10^-38 to 10⁺³⁸	7 decimal digits

The bias for an IEEE 754 double precision number is 1023.

Properties

The IEEE 754 double precision format has the following properties.

C, C++, Java type	Range	Precision
double	10^-308 to 10⁺³⁰⁸	16 decimal digits

Although single precision numbers have adequate precision and range for most purposes, there are two consideration that may require double precision.

Some quantities in physics are too large or too small for the single precision range.
Vector and matrix operations, which are common in scientific applications, can involve repeated additions of hundreds or thousands of numbers. Since errors can accumulate, double precision may be necessary.

Since most modern processors can do a double precision operation as fast as a single precision operation, and since memory today is inexpensive, the use of double precision is common even when it is not necessary.

The algorithms for floating point arithmetic are similar to algorithms for scientific notation. Each operation involves separate operations on the sign, exponent, and fraction parts.

The algorithms for floating point addition and subtraction are similar to algorithms for adding and subtracting in scientific notation:

if the exponents are different, denormalize the number with the smaller exponent, making the exponents the same
add or subtract the mantissas
if necessary, normalize the result

These steps must be performed in order.

The algorithms for floating point multiplication and division are similar to algorithms for multiplying and dividing in scientific notation:

XOR the sign bits
multiply/divide the mantissas
add/subtract the exponents
if necessary, normalize the result

The first three steps can be done in parallel.