In a computer, floating-point numbers are represented by a binary form of scientific notation. Since fixed-size representations are used, the numbers that they represent are limited in range and precision. The algorithms for floating point arithmetic are similar to algorithms for scientific notation.
The floating-point representations used in the early processors varied considerably in the details. Modern processors use single and double precision representations that adhere to the IEEE 754 standards.
For a normalized floating-point number, the bit pattern
is interpreted as the number
(-1)^{S} × (1 + Fraction) × 2^{Exp - Bias}
The bias depends on the particular representation.
The IEEE 754 standard defines two commonly used floating point formats, a single precision format and a double precision format.
A floating-point format has limited range and precision. These limitations can be understood by considering using scientific notation with a limited range of exponents and a limited number of digits in the mantissa.
The limitations in decimal can be estimated from bit limitations using the fact that log_{10}(2) = 0.30103. This implies the following.
As an example of the first bullet, 2^{10} = 1024 ≈ 10^{3}.
The bias for an IEEE 754 single precision number is 127.
The IEEE 754 single precision format has the following properties.
C, C++, Java type | Range | Precision |
---|---|---|
float | 10^{-38} to 10^{+38} | 7 decimal digits |
The bias for an IEEE 754 double precision number is 1023.
The IEEE 754 double precision format has the following properties.
C, C++, Java type | Range | Precision |
---|---|---|
double | 10^{-308} to 10^{+308} | 16 decimal digits |
Although single precision numbers have adequate precision and range for most purposes, there are two consideration that may require double precision.
Since most modern processors can do a double precision operation as fast as a single precision operation, and since memory today is inexpensive, the use of double precision is common even when it is not necessary.
The algorithms for floating point arithmetic are similar to algorithms for scientific notation. Each operation involves separate operations on the sign, exponent, and fraction parts.
The algorithms for floating point addition and subtraction are similar to algorithms for adding and subtracting in scientific notation:
These steps must be performed in order.
The algorithms for floating point multiplication and division are similar to algorithms for multiplying and dividing in scientific notation:
The first three steps can be done in parallel.