In a computer, floating-point numbers are represented by a binary form of scientific notation. Since fixed-size representations are used, the numbers that they represent are limited in range and precision. The algorithms for floating point arithmetic are similar to algorithms for scientific notation.

The floating-point representations used in the early processors varied considerably in the details. Modern processors use single and double precision representations that adhere to the IEEE 754 standards.

For a normalized floating-point number, the bit pattern

is interpreted as the number

(-1)S × (1 + Fraction) × 2Exp - Bias

The bias depends on the particular representation.

A floating-point format has limited range and precision. These limitations can be understood by considering using scientific notation with a limited range of exponents and a limited number of digits in the mantissa.

• Range limitation: A fixed number of "Exp" bits is comparable to limiting the size of the exponent in scientific notation.
• Precision limitation: A fixed number of "Fraction" bits is comparable to limiting the number of digits in the mantissa in scientific notation.

The limitations in decimal can be estimated from bit limitations using the fact that log10(2) = 0.30103. This implies the following.

• 2 raised to a power of is about the same as 10 raised to 3 ⁄ 10 of that power.
• Each bit (binary digit) is about 3 ⁄ 10 of a decimal digit.

As an example of the first bullet, 210 = 1024 ≈ 103.

Although single precision numbers have adequate precision and range for most purposes, there are two consideration that may require double precision.

• Some quantities in physics are too large or too small for the single precision range.
• Vector and matrix operations, which are common in scientific applications, can involve repeated additions of hundreds or thousands of numbers. Since errors can accumulate, double precision may be necessary.

Since most modern processors can do a double precision operation as fast as a single precision operation, and since memory today is inexpensive, the use of double precision is common even when it is not necessary.

The algorithms for floating point addition and subtraction are similar to algorithms for adding and subtracting in scientific notation:

1. if the exponents are different, denormalize the number with the smaller exponent, making the exponents the same
2. add or subtract the mantissas
3. if necessary, normalize the result

These steps must be performed in order.

The algorithms for floating point multiplication and division are similar to algorithms for multiplying and dividing in scientific notation:

1. XOR the sign bits
2. multiply/divide the mantissas