Thursday, April 3, 2008

Rikki Don't Lose That NUMBER

Everything you know is wrong. The more things change, the more they stay the same. Money makes the world go around. Let's see how many aphorisms I can spout while thinking about the Oracle NUMBER datatype.

The Oracle NUMBER datatype has a couple of attributes, precision and scale, that can be hard to grasp. The documentation spends some time explaining how these attributes affect the storage requirements and accuracy for numeric data. But what does it all mean, really?

Instead of storing integers in 16-, 32-, or 64-bit packages, or floating-point values in IEEE format, the Oracle database uses a different format that trades off some space and performance, to avoid rounding errors of the sort that would foul up financial calculations.

The precision has to do with how big your numbers could get, in terms of decimal digits rather than bits. Need to count something with values from 0 to 99? That's a precision of 2. Need to score tests from 0 to 100? That's a precision of 3; the 3 digits let you go up to 999, even if you don't use all those values.

When you declare a column type of INTEGER, again you're not saying you want values up to 32,767 or 2,147,483,647 like you might expect in binary terms. You're saying you want as many decimal digits as possible, which in the case of Oracle is 38 -- 99,999,999,999,999,999,999,999,999,999,999,999,999. INTEGER is just a synonym for NUMBER(38). If you don't really need that many, you can declare a smaller numeric type like NUMBER(10), and you won't have to buy as many hard drives.

OK, but banks and stores and stuff sometimes hand out these little sub-dollar units, coins I think they're called, or "change" as the kids say these days. No problem, that's the scale part. The "longest" number of decimal digits is given by the precision, and the scale just says how far over the decimal point should go, starting from the right. In terms of storage space in the database, the range 0-999 takes up the same amount as 0-9.99, it's just that the former is declared NUMBER(3,0) and the latter is NUMBER(3,2), meaning shift the decimal point 2 places from the end.

If you are a scientist, banker, or soap marketer, you probably have rules for how many decimal places of accuracy you need. That asteroid has only a 0.0000000062% probability of hitting the Earth, not a 0.00000001% chance like it said in the paper. We've made a breakthrough in purity and now it's 99.46%. Your savings account earned $000000.00 in interest this month. That sort of thing.

When I first heard that NUMBER had a special storage format, I thought Oracle had resurrected the idea of "binary coded decimal", which was used in old microcomputers like the Commodore 64. Instead of storing 8- or 16-bit integers, you could fit one decimal digit 0-9 in each nibble of memory, so you could get 0-99 in a byte instead of 0-255. The 6502 processor could do addition and subtraction in this mode. It bought you some speed doing things like implementing a scoreboard in a video game. When someone scored 15,850 points in a pinball game, instead of doing a lot of slow divide-by-10 operations to figure out what digits to display on the screen, you could use the BCD values as indexes in a lookup table to grab the bitmaps for "1", "5", and so on.

But no, the NUMBER storage format is different. More compact than BCD, but still slower than native machine operations. If all this business with decimal digits has no bearing on your results, that's what BINARY_FLOAT, BINARY_DOUBLE, and PLS_INTEGER are for -- raw speed. That last one, PLS_INTEGER, can't be stored in a table. It's only for doing arithmetic within PL/SQL, which can speed up your triggers and stored procedures, in case you were declaring loop indexes as INTEGER thinking that was like an "int" in C. These days, BINARY_INTEGER is a synonym for PLS_INTEGER, but it used to be slower, so I use PLS_INTEGER in case my code ever gets run on an older database release.

Most languages these days have done one thing or another to slow down the '80s style of pure machine arithmetic. In Java, for example, although you've still got the C-style int, many of the data structures require the type Java.lang.Integer, which is just an object containing an int. Manipulating these objects requires extra levels of indirection behind the scenes before you can actually add, subtract, etc. them.

I say the '80s were the int's heyday because everyone was going to so much trouble to avoid any kind of floating-point arithmetic. Want to draw a line? Use Bresenham's line algorithm. Want to draw a circle? Use Bresenham's circle algorithm. Bezier curves?!

OK, the Bresenham algorithms are from the '60s, but it was in the '80s that everyone wanted to write their own version of Flight Simulator. I used these algorithms to implement an Amiga-style windowing and graphics system on Atari STs.

It was in the '90s that things changed. I trace the moment to the release of the IBM RS/6000 processor. (Sorry, RISC System/6000. So many other companies had trademarks on "RS" terms that we were told not to use the acronym in anything official.) Its floating-point unit was so good that we would advise FORTRAN programmers to do calculations on integer values by converting from integer to floating point, doing all the multiplies and divides, and then converting back to integer.

No comments: