Topics: Numerical Analysis - Rounding Error - Normalised Form of Real Numbers


The floating point form of a number in its normalised form can be obtained by terminating the fractional part of the number at digits. There are two ways to do this:

  • Truncate: the digits of the number up to the position that the word size allows (i.e. )
  • Round:
    • If , we add to , ”rounding up
    • If , we ignore the digits from onward, ”rounding down

is dependant on the size of the computer word.

Bit Assignation and Range

Floating point numbers use specific bits of the computer word for specific purposes. Normally, there’s 1 bit dedicated to the sign, several bits for the exponent, and some more for the mantissa (the fractional part of the number).

Range and Exponent

The size of the word defines the range of the exponent of the floating point number:

Total digitsRange of the exponentExponent
to

…where:

  • : number of digits for the exponent
  • : number of digits for the mantissa
  • : decimal form of the number in the exponent part of the word

Final Number

The final number as represented on the computer is given by:

…where:

  • : the bit corresponding to the sign
  • : the exponent (i.e. )
  • : the decimal number in the mantissa section
  • : the base for the numbering system