Topics: Numerical Analysis - Rounding Error - Normalised Form of Real Numbers

The floating point form of a number in its normalised form can be obtained by terminating the fractional part of the number at $k$ digits. There are two ways to do this:

Truncate: the digits of the number up to the position that the word size allows (i.e. $k$ )
Round:
- If $d^{k + 1} \geq 5$ , we add $1$ to $d^{k}$ , ”rounding up”
- If $d^{k + 1} < 5$ , we ignore the digits from $k + 1$ onward, ”rounding down”

$k$ is dependant on the size of the computer word.

Bit Assignation and Range

Floating point numbers use specific bits of the computer word for specific purposes. Normally, there’s 1 bit dedicated to the sign, several bits for the exponent, and some more for the mantissa (the fractional part of the number).

Range and Exponent

The size of the word defines the range of the exponent of the floating point number:

Total digits	Range of the exponent	Exponent
$N = p + q + 1$	$- (2^{p - 1} - 1)$ to $2^{p - 1}$	$c - (2^{p - 1} - 1)$

…where:

$p$ : number of digits for the exponent
$q$ : number of digits for the mantissa
$c$ : decimal form of the number in the exponent part of the word

Mantissa Bits

Please note that the mantissa section of the word must always have $1$ in its first position (left to right). The mantissa section cannot be $0$ in total.

Final Number

The final number as represented on the computer is given by:

(- 1)^{s} B^{e} (1 + f)

…where:

$s$ : the bit corresponding to the sign
$e$ : the exponent (i.e. $c - (2^{p - 1} - 1)$ )
$f$ : the decimal number in the mantissa section
$B$ : the base for the numbering system

Example

For instance, take the following number as represented in bits:

$s$ Exponent Mantissa
1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0

In this case, we have that:

$s = 1$

$e = 19 - 15 = 4$

$f = 0.100101000 0_{2} = 0.57812 5_{10}$

Thus, the number is:
$(- 1)^{1} 2^{4} (1 + 0.578125)$ $= - 16 (1.578125) = - 25.25$

Notkesto

Navigation

Recently Created

A Camera is a Simple Device

Good Feedback is Constructive and Actionable

Photography Gear is the Medium

Photography is More About Art than Technology

Discounted Policy Improvement Method for MDPs

Floating Point Form of Real Numbers

Bit Assignation and Range

Range and Exponent

Final Number

Graph View

Table of Contents

Backlinks

Notkesto

Navigation

Recently Created

A Camera is a Simple Device

Good Feedback is Constructive and Actionable

Photography Gear is the Medium

Photography is More About Art than Technology

Discounted Policy Improvement Method for MDPs

Floating Point Form of Real Numbers

Bit Assignation and Range §

Range and Exponent §

Final Number §

Graph View

Table of Contents

Backlinks

Bit Assignation and Range

Range and Exponent

Final Number