Monday, April 21, 2008

A CASE of Functional Programming

Chris mentioned in response to my post on functional programming that the CASE construct is useful in such situations. You can use CASE expressions either in PL/SQL subprograms or in SQL queries.

I agree, although I'm not sure if my agreement is more because of developer convenience or the pragmatic (performance) aspect. When you use a CASE expression, you skip the step of putting the IF/THEN test into its own function, so a purist might say it's bad that you end up duplicating code if you use the same CASE in several different queries. However, I have found CASE to be faster than I expected in queries producing big result sets, so I use it sometimes in preference to setting up a real function and making a function-based index.

As Chris mentioned, CASE lets you turn a PL/SQL function into essentially a one-liner, just a RETURN statement that covers all the possible bases. CASE expressions force you to use the ELSE clause, so you're guaranteed to return a value no matter how strange the input arguments. [See correction/digression at end.]

In a SQL query, CASE can take the place of a function call, either for transformation of the output results:

select
case region_code
when 'NL' then 'Newfoundland'
when 'ON' then 'Ontario'
when 'CA' then 'California'
else 'Some other place'
end
from ...

or for a condition in the WHERE clause:

select ...
where vacation_days >
case
when emp_type = 'VP' then 10
when years_service > 5 then 5
else 2
end
...

or (my favorite) in some clause where you wouldn't normally think of it, such as putting a dynamic ORDER BY clause into a query inside a PL/SQL procedure:

for item in (select ... where ...
order by case
when param = 'by_date' then date_added
when param = 'by_name' then last_name
else credit_card_number
end
)


Correction



As Boneist pointed out, my original assertion that CASE forces you to cover all possibilities is not correct for CASE expressions. For example:

select nvl(case when 1 = 0 then 'foo' end, 'bar') from dual;

returns 'bar', demonstrating that if the CASE expression "falls through" without matching any WHEN clauses, the result is null.

It is true for a CASE statement rather than an expression:

begin
case
when 1 = 0 then null;
end case;
end;

That will throw an exception, because of the lack of an else clause. I can never remember whether that behavior is for the statement or the expression, because it's the opposite of what I would like. If I'm selecting actions, I'd like a default ELSE condition of "do nothing", so I could leave it out. But if I'm selecting values, I'd prefer to be forced to consider all cases.

I fretted about the CASE statement vs. expression difference when I first wrote it up in the PL/SQL User's Guide. Is there enough cross-linking so that people will see both forms? Is there too much cross-linking so that people will get the expression and the statement mixed up?

Also, do we really need to use a special term "searched case" for one variety of syntax? I suppose it is a long-standing term, since I see it used in SQL Server and MySQL contexts too. I just can't imagine going into a code review and arguing with someone that they should or shouldn't use "searched case", without having to define it and clarify whether that meant it just dives into the WHEN clauses, or starts with CASE something.

Anyway, knowing about the statement as opposed to the expression does have some point in the functional programming environment. Let's say you wanted to rewrite the COMPARE function from my earlier post using CASE, but if something went wrong, it should raise an exception rather than return a value. Instead of coding that as a single statement starting "RETURN CASE ...", you'd write:

function compare(x number, y number) return number is
begin
case
when x > 0 then return 1;
when x < 0 then return -1;
when x = 0 then return 0;
-- If there is some unexpected combination of parameters, NULLs or NaNs or
-- what have you, need to handle that case also.
else raise some_package.weird_math_error;
end case;
end;

Sunday, April 20, 2008

A Word about Functional Programming

In all the back-and-forth about procedural programming and object-oriented programming, don't forget about another possibility: functional programming. It offers valuable lessons that can help with both hardcore SQL database operations, and the most modern and hip OO coding styles.

In some quarters, functional programming has a bit of a stigma from the Lisp days. In its purest form, functional programming does away with variables, turns even the most complicated program into a one-liner (with tons of brackets to nest properly), and has performance that varies depending on the characteristics of the language compiler.

These days, the introductory Comp Sci course at UC Berkeley (61A) uses Scheme as the teaching language. It's more accessible than Lisp, but still I found it more pleasant to follow along at home doing the exercises in Javascript.

On a pragmatic level, the upside of functional programming goes something like this. Once you write a function that can reliably read the color value for the pixel at coordinates (x,y), and another that can reliably write a color value for the pixel at (x,y), writing Photoshop just boils down to some data structures and a zillion calls to these functions. In a web development context, if you can make a popup window appear reliably for Firefox, Internet Explorer, Opera, Safari, etc., it doesn't really matter how unappealing the code underneath is, you can build more and more layers that call the same few functions. You can see the downside on many popular web sites; once people learned how to make ads appear and dance across pages, or pull code from other sites to track people's browsing behavior, they built on top of those functions without considering the ultimate performance hit or memory consumption to do all that for hundreds of browser pages/tabs at a time.

If you just want to introduce a bit of functional programming into your everyday life, you can do that without switching languages. Here are some ways you could use its precents to improve PL/SQL or other database code.

With functional programming, every program unit is a function (i.e. has a return value). This lets you map the outline of a program using essentially pseudo-code style. Instead of:

if x > y then x := 0; end if;

you would write something like:

if exceeds_limit() then reset_value(); end if;

When you've tried this before, you might have found it cumbersome to pass all the input values as parameters -- it's not any better to write is_greater_than(x,y) rather than x > y. But PL/SQL lets you write nested subprograms that access the variables of the parent subprogram, so you can make calls without parameters and let the subprograms get and set the values they need.

OK, it's true that reset_value() above is a procedure, not a function. But that's why I said full-on functional programming isn't strictly required. In traditional FP style, instead of doing assignments, you'd issue RETURN statements, and all the assignments would happen at outer layers of the program.

These pseudo-code outlines make it easier to implement bug-free versions of the named subprograms, since they're generally dead simple to write. You're also protected if you need to write several different variations of the main program, that call different combinations of subprograms or do things in a different order; you don't wind up duplicating all the code for the detailed logic.

Functional programming does instill a certain discipline to consider all the possible cases of execution:

function compare(x number, y number) return number is
begin
if x > 0 then return 1; end if;
if x < 0 then return -1; end if;
if x = 0 then return 0; end if;
-- If there is some unexpected combination of parameters, NULLs or NaNs or
-- what have you, need to handle that case also.
return null; -- Or raise an exception...
end;

When you're comfortable handling all the "halting conditions" for straightforward functions like that, it gives you confidence to tackle recursive calls. Typically, a recursive call will do a little bit of manipulation or testing to see if the function has enough information to return a final value, and if not, will call itself with a slightly modified (simpler) version of the input parameter. For example, N-1 for an integer parameter, or SUBSTR(param,2) for a string parameter. Something that trends towards zero, the empty string, the empty list, etc.

PL/SQL can do recursion, but its more complex data structures don't jump out as being ideally suited to recursive calls. For example, if you have a collection, it's not a simple matter to construct a new collection consisting of elements 2-N, to chop off the first one, or to "concatenate" collections to pass the results back through recursive calls. That's a common operation in doing parsing operations, and other scenarios often tackled through recursion. Perhaps with the enhancements to collection operations over the last couple of releases, recursive operations involving collections are now straightforward. If so, I'd love to hear from people who have done so.

SQL coders have been dealing with performance implications of functions for some time now, so they could swap war stories over beer with Lisp coders. If you aren't careful, a SQL query can make astronomical calls to expensive functions. But if you are careful, then you can be like one of those Lisp programmers who looks at any problem and says, "I could solve that in one line, since 1985!".

Oracle subselects and inline views are very good for doing "composition of functions", that is, producing a set of values and passing those values through a series of functions.

select line from
(
select howmany || ': ' || label line from
(
select howmany, label from
(
select count(*) howmany, title label from t1
group by title
)
order by howmany desc
)
)
where rownum < 11
;

In the example above, we're producing a large number of values in the innermost section, and in the outer layers we're sorting the results, concatenating different fields into a single string, and finally limiting the number of results that come out at the end. Along the way, we're renaming some of the expressions and columns of the inner layers to be more descriptive and generic.

Performance-wise, it's best to push the most expensive operations and the limiting step as far inside as possible. If the innermost query only produces 10 rows, it doesn't matter how much transforming and sorting we do in the outer parts, because those steps only happen 10 times.

For example, consider these variations on the same idea:

select distinct(substr(lower(colname),2)) from t;
select distinct(substr(lower(distinct(colname)),2));

Let's say the table has 10,000 rows. The first query is relatively inefficient, because we're running LOWER 10,000 times, then running SUBSTR 10,000 times on the results, then finally doing DISTINCT which might prune the final results down to 1000. The second query runs DISTINCT right at the beginning (the innermost DISTINCT), which prunes the results let's say to 1500; then LOWER gets run 1500 times, and SUBSTR gets run 1500 times, and then the second DISTINCT brings the final results from 1500 rows down to 1000 rows.

I actually ran the first query above in a moment of Lisp-like "see, I can do this in one line", and it processed 22,000 rows in a fraction of a second. In a real business situation, the functions might be things like IS_GOOD_CREDIT_RISK() or DESERVES_A_RAISE_THIS_YEAR(), things that might even query other tables, and anyway take a long time to compute for every row. In which case it really is important to make the result set small in the inner layers, and save the expensive calls for the outer ones.

A whole class of Oracle features go into the idea of optimizing queries that call functions.

There is the function-based index, where the database pre-computes the results of a function for each row, and stores the results in the index rather than as a separate column. The function must be declared DETERMINISTIC, that is, it is guaranteed to give the same result every time when passed the same input parameters. (The database already knows that built-in functions like LOWER() and SUBSTR() are deterministic, so you can build function-based indexes that incorporate calls to those functions.) This approach can be better than having a separate column, because what if the logic breaks down and that column doesn't get updated when the other columns do? The function results will get out-of-date. If you only need the function results for use in WHERE clauses, the function-based index protects against that case. It actually protects a little too well in my experience, because when you recompile the function, the index becomes invalid and prevents DML operations on the table. So you need to separate out the code for that function so it doesn't get recompiled except when you really change it. Either that or add an ALTER INDEX...REBUILD statement after the CREATE OR REPLACE statement for the function.

In 11g, there is the function result cache, which devotes some memory to the results from functions. The idea being that if you have a function that translates 'CA' to 'California', 'IA' to 'Indiana', etc. the database should be smart enough to keep those 50 or so values cached in memory and skip the actual call to your function. That way, you could query large amounts of data without displaying cryptic verbatim column values, and without paying a performance penalty for the transformations. I haven't personally tried the result cache, but I'm itching to -- got those DETERMINISTIC keywords locked in and ready to go!

When you make a nested expression like regexp_replace(lower(substr(to_char(...)))), that's known as "composition of functions", and although that seems like a straightforward notion, there are things you can do with it. If you take that whole expression and make it into a function:

procedure transform_it(what varchar2)
return varchar2
deterministic
is
begin
return regexp_replace(lower(etc. etc.

then you can reference this transformation from many queries; if you later realize there is a faster way to do it, you can rearrange the nesting and all those queries will speed up; and you can use a function-based index or the 11g function result cache to supercharge it.

Saturday, April 12, 2008

Deleting Data, Part 2

In a previous post, I listed a few different techniques for deleting Oracle data. Just to recap:

DELETE statement in SQL (or PL/SQL).
FORALL loop construct wrapped around a DELETE statement in PL/SQL.
TRUNCATE TABLE statement.
Temporary tables.

Different themes emerge. Some ways are more flexible (DELETE statement), some are faster for large amounts of data (FORALL, TRUNCATE), and some are more convenient (temporary tables).

As I thought about it some more, I realized there must be 50 ways to... delete your data. Here are some of the others. I'm sure there are plenty more that I haven't had direct experience with.

ALTER TABLE DROP PARTITION falls in between a normal DELETE and a TRUNCATE. You've specified some criteria to divide up the values in a large table, for example by year or by area code, and with one statement you can get rid of all the data in one of those buckets. Perhaps every Jan. 1, you remove the data for the 5-years-ago year. Dropping a partition simplifies the behind-the-scenes bookkeeping for the physical storage and the indexes, as opposed to issuing a DELETE in a regular table.

If you add a foreign key constraint to a column, you can include the ON DELETE CASCADE clause. The foreign key references another table, meaning you can't insert data unless the value in that column matches one of the key values in a different table. ON DELETE CASCADE means that when data is deleted from the other table, the matching key values are deleted from your table also, saving you from having to issue multiple sets of DELETE statements for interrelated tables.

You can add a foreign key constraint via the CREATE TABLE statement or the ALTER TABLE statement. Here's where it becomes difficult to dream up different "tasks" for different ways to achieve the same end. Putting a foreign key constraint in a CREATE TABLE statement suggests you are planning ahead and have constructed a detailed database design. Putting a foreign key constraint in an ALTER TABLE statement could mean that you are troubleshooting a problem once the database is in operation; perhaps it was overoptimistic to assume that every programmer would remember to DELETE from all the right tables. Or it could just mean that you were writing a setup script, found it easier to guess the right constraint syntax for ALTER TABLE than for CREATE TABLE, and so wrote a simple vanilla CREATE TABLE statement immediately followed by one or more ALTER TABLEs.

Personally, I use foreign key constraints a little less than I might like. In a data warehousing situation, such as with a search engine, I might be fiddling with or reloading data to get everything just right, and want to avoid unexpected side-effects in other tables. For example, if you reload the set of US state abbreviations into a 50-row table, you wouldn't want all the customer data for California to disappear because the 'CA' row was removed for a moment.

Another space-saving technique that applies to PL/SQL only is the SERIALLY_REUSABLE pragma. It lets the database free up memory space used by package variables, more frequently than happens by default. Then you don't need to feel so guilty about having your package code load the entire contents of a table into a PL/SQL collection data structure. But the memory gets freed up so often that the pragma seems only useful for specialized situations with long-running sessions.

Let's broaden the objective a little bit. Some of these techniques make data disappear without requiring action on the programmer's part, keeping bad old data out of query results and freeing up space on disk. Taking off the DBA hat and putting on the developer one, perhaps we wouldn't care quite as much about the disk space. What about techniques to make data disappear from queries, even if it doesn't really disappear from the database?

The CREATE VIEW statement can create a rolling window based on dates or some other value in the data stream. For example, you might use a condition WHERE date_col >= SYSDATE - 7 to only examine the last 7 days worth of data. Because Oracle dates include a time component, that would mean everything from 1:25 PM last Saturday up to 1:25 PM this Saturday, using the example of my own clock as I type this. If you wanted only full-day periods, you would use a condition such as WHERE date_col BETWEEN TRUNC(SYSDATE) - 7 AND TRUNC(SYSDATE). That would include everything from the start of last Saturday, up to the start of today, and would give you consistent results if you ran the same query again during the same day.

Other views might restrict queries to the last 1000 web page views or financial transactions. I think of such views broadly under the "deleting data" umbrella, because just based on the passage of time or activity within an application, older data stops appearing in query results without any inconvenience to the programmer.

Another technique along the same lines is the CREATE MATERIALIZED VIEW statement. This has 2 distinct uses: either you have a table normally accessed through a database link that is so slow you just want to have a local copy that you can query, or you have such an expensive/complicated query against a big table that you want to keep the query results in their own table so you can retrieve or subset them much faster. The "deleting" part comes because the materialized view can be automatically refreshed, either on a schedule or whenever the data in the underlying table changes. So as with other views, you can sit tight and old data will stop showing up in your queries.

I've started making some headway with materialized views, but am not yet a master of the refreshing techniques. That seems to require perusing both the SQL Reference and the Data Warehousing Guide. All those Data Warehousing examples with quarterly sales figures etc. don't resonate much with me when I'm learning about MVs, analytic functions, or partitioned tables.

I think those are all the related techniques that I use or consider using for such purposes. Of course, you can always manufacture other situations or add layers on top of these techniques. You could set up a scheduled job that deleted old data, you could have a trigger that acted like a foreign key constraint by deleting from a related table -- the possibilities are endless!

Tuesday, April 8, 2008

Unlikely Combos: Deleting Data

One of the challenges I see in understanding how the Oracle Database works is when features for similar tasks don't fall into neatly organized categories. It's obvious that this feature should be documented over here, and that one over there, but where can you discuss them side-by-side? Maybe the features are from different functional areas, and it's tough to find a single reviewer who can verify. Maybe the feature is being used in ways not originally intended, which prevents a technique from getting officially endorsed.

Well, that's where a blog comes in. Personal opinions all the way!

I'll write some posts with features that I think of as addressing similar tasks, falling along a spectrum of convenience, capacity, and complexity. First up: deleting data.

The basic method for deleting data is the SQL DELETE statement. Well duh! You can delete one or a zillion rows in a table, or a subset of rows based on simple or complicated conditions. Everyone knows that.

delete from the_table;
delete from the_table where col_value > 10;
delete from the_table where col_value in (select some_other_col from some_other_table);

But as the volume of data grows or the conditions get more tangled, other techniques start to look attractive.

Next up is the PL/SQL FORALL statement. You can use this loop-like technique in situations where normally you would issue a sequence of regular DELETE statements, because it's too complicated to construct an IN list or what have you. You'd put the construct inside an anonymous block, stored procedure, or trigger.

forall i in 1..the_count
delete from the_table
where col_value = collection_element(i);

In real life, your WHERE condition would probably be more complicated, with different clauses referencing the i'th value of different collections.

forall i in 1..the_count
delete from the_table
where col_value between lower_bound(i) and upper_bound(i)
and col_string not like '%' || string_pattern(i) || '%';

When I would normally write a little Perl script to spit out thousands of DELETE statements, instead I make the script generate an anonymous block that sets up one or more big PL/SQL collections, then runs a little FORALL-DELETE loop at the end. FORALL batches all the requests, so there's a lot less network traffic and other overhead. You can use the same technique to speed up large groups of related INSERT or UPDATE statements too.

Both DELETE and FORALL do all the nice Oracle behind-the-scenes bookkeeping that we all know and love. But sometimes, all you want is for a lot of data to be gone, ASAP. For example, in a data warehousing scenario, you might want to keep emptying a table and immediately re-filling it. That's where the TRUNCATE statement comes in. It skips some of the bookkeeping, so for example index statistics can go stale and the database doesn't reclaim space like it could. But it's fast.

truncate table the_table;

The opposite scenario is when you only have a little data to delete, but you want to delete it frequently. Maybe a stored procedure needs a little work area that isn't needed after the COMMIT, or after a web session finishes generating an HTML page and disconnects. That's when the temporary table comes in. It empties itself with no fuss, no muss, either when the session commits or disconnects, your choice.

Oracle temporary tables are faster to populate and clear than normal tables, although they don't totally avoid behind-the-scenes overhead. The key thing to remember, if you're a SQL Server expert, is that creating and dropping them is just as expensive as for a regular table, so you want to create a single temporary table as part of application setup, and let it live forever. Oracle takes care of deleting the data when you're through with it, and also keeps different sessions from seeing each other's data. So several sessions can be messing with the same temporary table at once, and each only sees the data it put there.

-- To clear the table as soon as the transaction ends...
create global temporary table t1 (scratch_value number) on commit delete rows;
-- To clear the table as soon as the session ends...
create global temporary table t2 (key_value varchar2(16)) on commit preserve rows;

OK, back to TRUNCATE. Sometimes, you want a lot of data to disappear quickly, but then the lack of bookkeeping bites you later. I've had tables where I truncated a couple of hundred thousand rows before loading new data, and then anything involving the new data was slow. I could go through the process of generating new stats, but then the whole operation takes as long as using DELETE for the whole table. Sometimes -- again usually in data warehousing situations where nobody else is using a table -- the practical thing to do is DROP TABLE followed by CREATE TABLE to recreate it.

Another kind of table that you can think of like a temporary table is the external table. The data comes from an outside text file, so you can imagine that the data ceases to exist (within the Oracle sphere of control, anyway) as soon as you finish querying it. It takes up space on the filesystem, but not in the normal tablespace and datafile regime. I haven't personally used external tables, but I've considered it for data that I thought of as "read once and forget".

Thursday, April 3, 2008

Rikki Don't Lose That NUMBER

Everything you know is wrong. The more things change, the more they stay the same. Money makes the world go around. Let's see how many aphorisms I can spout while thinking about the Oracle NUMBER datatype.

The Oracle NUMBER datatype has a couple of attributes, precision and scale, that can be hard to grasp. The documentation spends some time explaining how these attributes affect the storage requirements and accuracy for numeric data. But what does it all mean, really?

Instead of storing integers in 16-, 32-, or 64-bit packages, or floating-point values in IEEE format, the Oracle database uses a different format that trades off some space and performance, to avoid rounding errors of the sort that would foul up financial calculations.

The precision has to do with how big your numbers could get, in terms of decimal digits rather than bits. Need to count something with values from 0 to 99? That's a precision of 2. Need to score tests from 0 to 100? That's a precision of 3; the 3 digits let you go up to 999, even if you don't use all those values.

When you declare a column type of INTEGER, again you're not saying you want values up to 32,767 or 2,147,483,647 like you might expect in binary terms. You're saying you want as many decimal digits as possible, which in the case of Oracle is 38 -- 99,999,999,999,999,999,999,999,999,999,999,999,999. INTEGER is just a synonym for NUMBER(38). If you don't really need that many, you can declare a smaller numeric type like NUMBER(10), and you won't have to buy as many hard drives.

OK, but banks and stores and stuff sometimes hand out these little sub-dollar units, coins I think they're called, or "change" as the kids say these days. No problem, that's the scale part. The "longest" number of decimal digits is given by the precision, and the scale just says how far over the decimal point should go, starting from the right. In terms of storage space in the database, the range 0-999 takes up the same amount as 0-9.99, it's just that the former is declared NUMBER(3,0) and the latter is NUMBER(3,2), meaning shift the decimal point 2 places from the end.

If you are a scientist, banker, or soap marketer, you probably have rules for how many decimal places of accuracy you need. That asteroid has only a 0.0000000062% probability of hitting the Earth, not a 0.00000001% chance like it said in the paper. We've made a breakthrough in purity and now it's 99.46%. Your savings account earned $000000.00 in interest this month. That sort of thing.

When I first heard that NUMBER had a special storage format, I thought Oracle had resurrected the idea of "binary coded decimal", which was used in old microcomputers like the Commodore 64. Instead of storing 8- or 16-bit integers, you could fit one decimal digit 0-9 in each nibble of memory, so you could get 0-99 in a byte instead of 0-255. The 6502 processor could do addition and subtraction in this mode. It bought you some speed doing things like implementing a scoreboard in a video game. When someone scored 15,850 points in a pinball game, instead of doing a lot of slow divide-by-10 operations to figure out what digits to display on the screen, you could use the BCD values as indexes in a lookup table to grab the bitmaps for "1", "5", and so on.

But no, the NUMBER storage format is different. More compact than BCD, but still slower than native machine operations. If all this business with decimal digits has no bearing on your results, that's what BINARY_FLOAT, BINARY_DOUBLE, and PLS_INTEGER are for -- raw speed. That last one, PLS_INTEGER, can't be stored in a table. It's only for doing arithmetic within PL/SQL, which can speed up your triggers and stored procedures, in case you were declaring loop indexes as INTEGER thinking that was like an "int" in C. These days, BINARY_INTEGER is a synonym for PLS_INTEGER, but it used to be slower, so I use PLS_INTEGER in case my code ever gets run on an older database release.

Most languages these days have done one thing or another to slow down the '80s style of pure machine arithmetic. In Java, for example, although you've still got the C-style int, many of the data structures require the type Java.lang.Integer, which is just an object containing an int. Manipulating these objects requires extra levels of indirection behind the scenes before you can actually add, subtract, etc. them.

I say the '80s were the int's heyday because everyone was going to so much trouble to avoid any kind of floating-point arithmetic. Want to draw a line? Use Bresenham's line algorithm. Want to draw a circle? Use Bresenham's circle algorithm. Bezier curves?!

OK, the Bresenham algorithms are from the '60s, but it was in the '80s that everyone wanted to write their own version of Flight Simulator. I used these algorithms to implement an Amiga-style windowing and graphics system on Atari STs.

It was in the '90s that things changed. I trace the moment to the release of the IBM RS/6000 processor. (Sorry, RISC System/6000. So many other companies had trademarks on "RS" terms that we were told not to use the acronym in anything official.) Its floating-point unit was so good that we would advise FORTRAN programmers to do calculations on integer values by converting from integer to floating point, doing all the multiplies and divides, and then converting back to integer.