In a computer, floating-point numbers are represented by a binary form of scientific notation. Since fixed-size representations are used, the numbers that they represent are limited in range and precision. The algorithms for floating point arithmetic are similar to algorithms for scientific notation.

The floating-point representations used in the early processors varied considerably in the details. Modern processors use single and double precision representations that adhere to the IEEE 754 standards.

For a normalized floating-point number, the bit pattern

is interpreted as the number

(-1)S × (1 + Fraction) × 2Exp - Bias

The bias depends on the particular representation.

The IEEE 754 standard defines two commonly used floating point formats, a single precision format and a double precision format.

A floating-point format has limited range and precision. These limitations can be understood by considering using scientific notation with a limited range of exponents and a limited number of digits in the mantissa.

The limitations in decimal can be estimated from bit limitations using the fact that log10(2) = 0.30103. This implies the following.

As an example of the first bullet, 210 = 1024 ≈ 103.

The bias for an IEEE 754 single precision number is 127.

Properties

The IEEE 754 single precision format has the following properties.

C, C++, Java type Range Precision
float 10-38 to 10+38 7 decimal digits

The bias for an IEEE 754 double precision number is 1023.

Properties

The IEEE 754 double precision format has the following properties.

C, C++, Java type Range Precision
double 10-308 to 10+308 16 decimal digits

Although single precision numbers have adequate precision and range for most purposes, there are two consideration that may require double precision.

Since most modern processors can do a double precision operation as fast as a single precision operation, and since memory today is inexpensive, the use of double precision is common even when it is not necessary.

The algorithms for floating point arithmetic are similar to algorithms for scientific notation. Each operation involves separate operations on the sign, exponent, and fraction parts.

The algorithms for floating point addition and subtraction are similar to algorithms for adding and subtracting in scientific notation:

  1. if the exponents are different, denormalize the number with the smaller exponent, making the exponents the same
  2. add or subtract the mantissas
  3. if necessary, normalize the result

These steps must be performed in order.

The algorithms for floating point multiplication and division are similar to algorithms for multiplying and dividing in scientific notation:

  1. XOR the sign bits
  2. multiply/divide the mantissas
  3. add/subtract the exponents
  4. if necessary, normalize the result

The first three steps can be done in parallel.