Number Systems (PC Magazine Vol 4 No 13 June 25, 1985 PC Tutor) One of the first things newcomers notice when they start reading computer magazines is the frequency with which they encounter hexadecimal (base 16) numbers. There are sound reasons for using hex notation in preference to decimal (base 10) notation. Even binary (base 2) notation should be used in certain instances. The figure below shows (a) the symbols used to denote different digits and (b) several examples of numeric equivalences for the four number bases commonly used in computer writing and programming. - - - - - (a) Name Base Numeral Set ---- ---- ----------- hexadecimal 16 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F decimal 10 0,1,2,3,4,5,6,7,8,9 octal 8 0,1,2,3,4,5,6,7 binary 2 0,1 (b) hex decimal octal binary --- ------- ----- ------ 1 1 1 1 8 8 10 1000 A 10 12 1010 F 15 17 1111 10 16 20 10000 3E8 1000 1750 1111101000 400 1024(1K) 2000 10000000000 FFFF 65535 177777 1111111111111111 - - - - - One thing evident is the "obvious" truth that the more different symbols you have available to represent a number, the shorter the number can be written. FFFF (hex), 65535 (decimal), 177777 (octal), and 1111111111111111 (binary) are all the same number, but the first representation takes 4 places and the last takes 16. Humans have trouble keeping track of long strings of numbers, but for a machine this system is much easier than keeping track of a lot of differently valued symbols. Two is the minimum number of symbols needed to represent 1 and not-1 (i.e., 0), and all machine code consists of nothing but strings of binary numbers. As a concession to human weakness, references to "computer numbers" are conventionally translated (even by a utility like DEBUG, which works directly on machine code) into hex form. Each hex digit is the equivalent of four binary digits (bits): hex binary hex binary 0 0000 8 1000 1 0001 9 1001 2 0010 A 1010 3 0011 B 1011 4 0100 C 1100 5 0101 D 1101 6 0110 E 1110 7 0111 F 1111 Two hex digits thus represent 1 byte of information and from the table above you can always translate down into binary to see which bits are 1s and which at 0s. Bytes need not have been defined containing 8 bits. Earlier in computer history, grouping things in terms of the possible values of 3 bits at a time was popular giving rise to octal notation. (The last mention of octal.) But the "powers of x" idea is as strong in computer design (where x is 2) as in "human" design (where x is 10), because it is the most efficient way to handle place value. Mathematically, if a number is represented as a sequence of digits: dn ...d3d2d1d0 is a number base b then the value of that number is: d*bn + ... d*b3 + d*b2 + d*b1 + d*b0 A few examples of working with hex in terms of the PC's 8088 microprocessor will show how much clearer hex makes things than does decimal. The 8088 has 20 address lines. Each line can be a 0 or 1 (this is digital; ones are +5 volts and zeros are 0 volts). Thus, the 8088 can address at most: 11111111111111111111 (or FFFFF hex or 1,048,575 decimal) characters (bytes). We generally say that the 8088 has a 1-megabyte memory capacity. The exact number in base 10 (1,048,575) is meaningless, but it is very meaningful in either binary or hex. Both are much closer to the way the computer thinks. Thus, if you work in hex, just by knowing that the 8088 has 20 address lines you immediately know its directly addressable memory capacity. Similarly, if you tell me that an integer is stored at 0305 hex, I know immediately that (1) the machine uses 16-bit addresses, and (2) when the integer is accessed, lines 9, 8, 2, 0 should be high; the rest should be low. This is because 0305 hex = 0000001100000101 binary. (Line numbering begins with the right-most digit.) Consider next the way the 8088 computes the address of objects. It does this by combining the contents of two registers (normally, DS and DX), each of which contains a 16-bit quantity (4 hex digits). If we refer to DS:DX, for example, the actual address in decimal would be calculated via: effective address = DS * 16 + DX This looks a bit strange to me in decimal notation, and it is not easy to calculate -- try it, assuming DS is 4096 and DX is 256, for example. If we switch to hex, however, the problem becomes: effective address = DS * 10 + DX which looks a lot simpler. It says, put a 0 at the end of the number in DS and add it to that in DX. Now, using the same numbers as above (in hex), we have: DS 1000 DX 100 DX DS * 10 10000 + DS * 10 DS * 10 + DX 10100 === Thus, the effective address is 10100 hex. If you did it in decimal, you'd come out with 65792. Conclusion: when referring to the IBM PC, it often makes more sense to state computer references in hex or binary. Now let's see why computer languages use binary internally for calculations. The internal representation of a number is the way a computer encodes it into its internal memory (registers). There is actually no intrinsic reason that a computer or computer language could not use decimal notation internally, but there are good mathematical reasons why it usually doesn't. We should start by understanding why the computer chips have small registers. Why does the 8088 on the IBM PC have 16-bit registers rather than 100-bit registers which would produce greater accuracy? Today's digital logic requires one line (an electrical path) for each binary digit (bit) in a number. This is a consequence of recognizing only two voltage levels: low (0) and high (1). For each line on a chip, there are probably 5 to 10 more supporting chips on the computer motherboard for buffering and control. Thus, the fewer outside lines a microprocessor chip has, the cheaper it is to design and build both the chip and its support. The 8088 is relatively cheap since it provides only 8 data lines to the outside world, even though internally it "thinks" with 16 data lines. (The 8088 has an 8-bit external data bus and a 16-bit internal data bus.) This means that sometimes a numeric transfer requires two successive accesses to memory: one for the low 8 bits and one for the high 8 bits. The 80286 used by the PC AT, on the other hand, has a full 16-bit data bus so programs with a lot of memory-based arithmetic automatically run more quickly, even ignoring such other speedups as a higher clock rate. The 8086 (used on the AT&T 6300 and the CompaqPlus, among others) also has a 16-bit data bus. Even though the 8086 was available when the PC was developed, IBM opted for the 8088 because of the dollar savings resulting from using 8 rather than 16 data lines. Thus, to reduce cost in computer designs the hardware people use as few data bits as possible, both internally and externally. Microcomputers are usually 8- or 16-bit designs, while mainframes are typically in the 60- to 80-bit range. While the hardware designers have an interest in reducing the number of bits in a register, the language writers, on the other hand, want to minimize the errors that result from arithmetic roundoff and overflow. One way they do this is by using binary arithmetic internally. Let's compare arithmetic strength -- that is, the accuracy and range -- of an 8088 language that represents numbers in binary form internally (which we read as hex) with one that uses decimal internally. A single hex digit has a maximum value of 15 decimal (1111 binary). Hence it requires 4 bits in order to represent the full complement of hex digits. A decimal digit also requires 4 bits, however. (Try to represent 8 or 9 with 3 bits, if you're skeptical.) An 8088 register holds 16 bits of information. Thus, the 8088 can manipulate either (but no more than) 4 hex digits or 4 decimal degits (16 bits total). Now consider a number, x, that equals 1237 decimal (4d5 in hex). A decimal language stores x as: 1 2 3 7 0001 0010 0011 0111 A binary language stores x as: 0 4 d 5 0000 0100 1101 0101 In this example, the only difference lies in the encoding method, but a hint is given by the fact that the binary/hex version needed only three digits to represent the decimal quantity 1237. If you now return to the last example shown in the figure above, you'll see that since the 8088 has 16-bit registers it can easily manipulate any number up to FFFF (65535 decimal) by using hex encoding. On the other hand, if the 8088 uses decimal encoding it can only easily manipulate numbers up to 9999. This means that by thinking internally in hex (or binary), a language running on the PC can increase the usable range of its arithmetic by 6.5 times, in comparison with decimal. The disparity gets even larger with bigger word sizes. A machine with 32-bit internal arithmetic increases its arithmetic range by 43:1 if it uses hex, not decimal, internal representation. (continuation of above .....) (PC Magazine Vol 4 No 14 July 9, 1985 PC Tutor) The situation with floating-point numbers is harder to describe, but the end result is essentially the same. In floating-point arithmetic, the average error involved in representing a floating- point number is lower if the computer things in binary (hex). This may need some explanation, however, for one result of using binary notation was that BASIC had a bit of "trouble" when it was first released. While one would expect that 10*0.3 = 3.0, this does not necessarily happen when using binary languages. The reason for the seeming inaccuracy in floating-point calculations can be easily demonstrated using decimal notation. Consider the equation: 1 - 3 * (1/3) = x You would think x could only b 0, but consider the way a 4-digit decimal language might do the arithmetic: 1/3 = .3333 3 * (1/3) = 3 * .3333 = .9999 1 - 3 * (1/3) = 1 - .9999 = 0.0001 In fact, if the floating-point routines were written poorly, that last statement could even be as bad as: 1 - 3 * (1/3) = 1 - .9999 = 1.000 - 0.999 = 0.001 This would occur if the third step required that each number be represented by a 4-digit number which would drop off the last "9" digit. In reality, it takes a tougher test to cause the binary floating-point routines in IBM's current BASIC to conk out. If you're interested in doing so, however, you might try: PRINT 5 - 25 * (2/25 + 3/25) We, who can scan the algebra, can see that the answer should be 0 but the computer cannot. (Some high-level computer languages do support algebraic reduction, but that's a separate issue.) Conclusion: computer languages generally process numbers internally in binary. This increases the range of the arithmetic, decreases roundoff errors, and reduces the number of overflow errors. Binary representation, however, may sometimes cause unexpected results since noncomputer people think in decimal. To avoid this last difficulty, some business languages (notably C-BASIC) use decimal notation internally, thus ensuring that dollars and cents line up. Although that makes sense at first blush, consider some arithmetic such as: purchase 1 item at 3 for $1.00 or 1 for .33 (.34?) In any problem involving division, the greater number of digits made available by using binary rather than decimal representation internally will yield potentially greater accuracy. As we've seen, an integer can be represented internally in many different forms. Here we must discuss two standards: the IBM BASIC representation and the IEEE proposed standard (used by the 8087 coprocessor implementation). In each case, there are different levels of precision, which allow tradeoffs between storage and accuracy requirements, just as there are when you select between single- and double-precision in BASIC or FORTRAN. The IEEE (Institute for Electrical and Electronics Engineers) is an august body that, among other things, tries to set standards for the industry. This is a worthwhile task since it promotes transportability between different manufacturer's hardware and different language writers' compilers and interpreters. IBM's implementation of the 8087 is almost completely IEEE compatible, both in data format and in the algorithms it uses to compute errors due to rounding, overflow and truncation. I know this because, among other reasons, I once had a number of sophisticated numerical-analysis programs running on an Amdahl mainframe computer in APL. When these applications were moved to the IBM PC, they worked without modification. What is more, the answers were identical to the 14th decimal place because the language on the Amdahl and the language on the PC used the same floating-point format. For some reason, however, IBM BASIC does not adhere to the IEEE standard, so we need to know two different formats to understand floating-point math on a PC. The various formats are: Name Number Range Mantissa Exponent of Bits Bits Bits ---- ------- ----- -------- -------- IBM BASIC: Integer 16 32767 - - Single 32 1E+38 24 8 Double 64 1E+38 56 8 IEEE (8087): Word Int. 16 32767 - - Short Int. 32 1E+9 - - Long Int. 64 1E+19 - - Packed BCD 80 1E+18 - - Short Real 32 1E+38 24 8 Long Real 64 1E+308 53 11 Temp. Real 80 1E+4932 64 15 As you examine the above, you will see that IEEE short real format has the same overall characteristics as a single-precision IBM BASIC (although the two are not absolutely identical). The IEEE long real format, however, is very different from double- precision BASIC, even though both representation use the same number of bits. The IEEE long real uses more bits for the exponent and fewer for the mantissa. Thus, IBM's double-precision BASIC is more precise (number of digits of precision) but has less range than the IEEE long real. You may wish to note in passing that for business applications the IEEE defines a data type called packed BCD (Binary Coded Decimal). This is our old friend the decimal-language representation. Packed BDC consists of an 18-digit fixed-point number (which requires 9 bytes to store and process), with the 10th byte reserved for the sign [+ or -] of the number. Returning to floating-point notation, one useful question to consider is this: What does a known mantissa, sign, and exponent in binary notation translate into in decimal terms? The equation for this is: x = sign * .mantissa * (2 ^ exponent) The period before the mantissa signifies that the mantissa comes after the decimal point (base 2). Thus, for example, suppose we have a sign of +1, a mantissa of C000 hex and an exponent of 12 hex. The number in hex then is: + 1 * .C000 * (2 ^ 12) or (in decimal): + 1 * .75 * (2 ^ 18) = 196608 Why is .C000 equal to .75? Consider that in decimal .3 is 3/10; in hex .C is C/10 or 12/16 = .75 in decimal. Since the sign can be only plus 1 or minus 1, this is often encoded as a single bit (off/0 == -1, on/1 == +1). Further, in base 2 the mantissa must begin with a 1. (Otherwise you could start one digit to the right and decrease the exponent, gaining more bits of precision.) Thus, sometimes the leading 1 is omitted from the representation entirely. For the 8088 byte 0 is usually stored at the highest address and the rightmost byte at the lowest address. Further, in order to be able to represent negative exponents, the actual exponent is calculated by taking the binary (unsigned value) and subtracting a bias. The bias values are: BASIC Single-Precision 128 Double-Precision 128 IEEE Short Real 127 Long Real 1023 Temporary Real 16383 The following short program allows us to see just how the various pieces of a floating-point number fit together when we're using IBM single-precision BASIC. 5 'Assignment 10 J=0:N=0 'so varptr works, set all scalars 20 FLOAT=.xxx 'surprise (actually, I used 0.12) 30 J=VARPTR(FLOAT) 'where is the number? 40 FOR N=3 TO 0 STEP -1 'since bytes are stored backwards 50 PRINT PEEK(N+J) 'read the bytes that make up FLOAT 60 NEXT N 70 END Suppose that the printout from the program gives us the following decimal numbers: 125 117 194 143 What is the value of xxx? First, we know that the exponent is stored in byte 0 (single precision), so the exponent is 125 to 128 decimal (bias of 128). Thus, the first part of the puzzle is: 2 ^ -3 = .125 (exponent portion) The sign bit is the first bit of byte 1. Since 117 decimal = 75 hex, we can deduce that the first bit is 0, so we can tell that the result is positive. How? If the first hex digit were 8 or greater, the first bit in the byte would be a 1, and the sign would be negative. The second piece thus becomes: + 1 * 2 ^ -3 Finally, we remove the first bit (0 anyway) and add to that byte the leading 1 that was implicit, and we get the bytes: byte 1 = 75 + 80 (hex) = F5 hex byte 2 = 194 (decimal) = C2 hex byte 3 = 143 (decimal) = 8F hex In this way, our mantissa becomes .F5C28F hex. Note the decimal place: .F5C28F = F5C28F/1000000 (hex). The mantissa takes up 6 hex digits (24 bits), so when we put everything together, we have, finally: + 1 * (F5C28F/1000000) * (2 ^ -3) hex In decimal, this is 0.1199999...., which is fortunate since, in point of fact, I had actually set FLOAT to 0.12 originally to get the numbers for the example. As you can see, understanding how IBM BASIC handles even single-precision floating-point arithmetic is no piece of cake. However, if interest warrants, I'll go into more detail about the 8087 routines in a future column.