Sunday, 8 January 2012

A Word About Floating Point Numbers + 2 Lecture Recordings

On my last lecture at the Telerik Academy on January 5th, 2012, the students were presented with various numeral systems (the usual base 10, the binary base 2 and the hexadecimal base 16).
In my humble opinion, this presentation came a bit too late, since the students have already been handling variables of all primitive types (after the "Primitive data types and variables" lecture) and manipulating them quite on a low binary level (after the "Operators and Expressions" lecture).
If anything, this lecture, which includes explanations of the basic count systems and memory representation of primitive types, should have been presented before learning about the various types themselves.

The Complexity Of Floating Point Numerals
One part of the numerical systems lecture, regarding floating point numbers and their memory representation, was not well explained in the PowerPoint slides, and in the lack of pre-built live samples, I felt a deeper explanation of the IEEE 754 standard may be in order. This is not quite light to grasp, especially for beginners.

The original lecture redirected the curious students to Wikipedia's "Floating point" article, which I felt may complicate things even more (I mentioned the possibility of brain explosions in such case). It's possible to also read Berkeley's actual IEE 754 specifications (PDF link), which I suspect wouldn't really do much good either.

I felt the best way to convey the subject would probably be using the power a step-by-step breakdown of a binary floating number, to see which parts it consists of and how it's constructed, and so I did.

Once the subject becomes more clear, I can recommend websites such as IEEE-754 Analysis, which converts any given real number to its binary representation, or the Floating Point IEEE 754 Converter, which can do the opposite (construct a number from a given binary structure) using a Java applet.

Break-Down By Example
Here is the example I provided in class, and its step by step break-down process. I decided to share this in a post, for the interest of the students who want to review it.

Particles of a Floating Point Number
The trick of representing any real number (or a number close-enough to it), is breaking the representation into 3 main parts:
(-1)Symbol bitx2(Exponent - 127)x(1 + Mantissa)
± Symbol Exponent Fraction

The floating point number would normally be split into 3 segments:
  • The ± symbol, represented by 1 bit, rendering the number positive (the default, when the bit is set to 0), or negative (if the bit is set to 1).
  • An exponent  which roughly produces the main part of the number.
    The exponent is always subtracted 127.
    This "always" is usually referred to as a "bias", and the exponent is called a "biased exponent" (because it is always affected).
  • The fraction, which helps flow the point and make the number more accurate, usually referred to by geeks as the Mantissa.
    The mantissa is a number between 0 and 1 (smaller than 1), to which an integer of 1 is forcefully added. This forceful addition is referred to as "normalization".
    This value of 1 could have been stored in the binary representation, but since it's implicitly added, there is no need to store it. It's referred to as the "hidden bit".
  • The hidden bit is always 1, thus the fraction is calculated as "1.Mantissa", unless the biased exponent is 0. In that case the hidden bit would have a value of 0, and the mantissa is not added anything.
    When the hidden bit is 0, the number is referred to as "denormalized". This may produce very small numbers, close to 0.
We'll start with a 32-bit number: 11000001100000000000000001100001.
Following IEEE-754 standard, we have:
  • 1 Bit for the symbol
  • 8 Bits for the exponent (would be 11 bits if we had a 64 bit number)
  • 23 Bits for the mantissa (would be 52 bits  if we had a 64 bit number)  
  • A hidden bit, added to the mantissa. Its value is 1 in our case.

Thus we can represent it by the following bit structure:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1 1 1 1 1 1 1
S Exponent Mantissa

As can be seen in the broken-down particles, the exponent (bits 1..8 in our sample) part is treated as-is, then biased with a subtraction of 127.
The mantissa part, however (bits 9..31 in our sample) is parsed from right to left, each bit representing 2 in the power of the negative position (2-1, 2-2, 2-3 and so on), so we get:
Bit 9 Bit 10 Bit 11   ...  
2-1 = 1 / 21 = 0.5 2-2 = 1 / 22 = 0.25 2-3 = 1 / 23 = 0.125   ...  

As pointed above, we are adding the hidden bit to the mantissa, as a value of an integer 1.

Putting everything in place
So now our number 11000001100000000000000001100001 can be broken down:
  • The symbol bit is 1, so we are clearly dealing with a negative number.
  • The exponent is 10000011.
  • The mantissa is 00000000000000001100001.
  • The hidden bit is 1, thus we will add a value of 1 to the mantissa.
Putting the parts in place we get:
-(2 (10000011 - 127) x (1 + 00000000000000001100001))

Parsing the particles
Our 8 bit exponent 10000011 is equal to decimal 131.
Our 23 bits mantissa has 1's in its 17th, 18th and 23rd places. Thus its calculation is:
2-17 + 2-18 + 2-23 = (0.00000762 + 0.00000318 + 0.000000119) = 0.00001156

And our number becomes a bit more clear:
-(2 (131 - 127) x (1 + 0.00001156)) = - (2(131 - 127) x (1.00001156))

Which means:
- (24 x (1.00001156)) = -(16 x 1.00001156)

And our binary number 11000001100000000000000001100001 is decimal:
-16.00018496 = ~ -16.000185

Special cases
There are a few cases where the fraction or mantissa binary parts have specific values. Those produce special floating-point number values:
Sign bit Exponent bits Mantissa bits Value
0 All zeros All zeros +0 (positive zero)
1 All zeros All zeros -0 (negative zero)
0 All ones All zeros +∞ (positive infinity)
0 All ones All zeros -∞ (negative infinity)
Whatever All ones Non zero NaN (not a number)

Lecutre Recordings
Recordings of the last 2 lectures:

Multidimensional arrays
This one covers the topic of complex multidimensional arrays, jagged arrays and how to handle them in C#.

Numerical systems
As written, and in relation to the example above, this short lecture covers the basics of non-decimal counting systems and explores how various numeric data is represented in memory.


  1. Great Explanation!!!!

  2. Need to correct table. -infinity has the same condition as +infinity; that cannot be correct.