Sunday, April 20, 2008

A Word about Functional Programming

In all the back-and-forth about procedural programming and object-oriented programming, don't forget about another possibility: functional programming. It offers valuable lessons that can help with both hardcore SQL database operations, and the most modern and hip OO coding styles.

In some quarters, functional programming has a bit of a stigma from the Lisp days. In its purest form, functional programming does away with variables, turns even the most complicated program into a one-liner (with tons of brackets to nest properly), and has performance that varies depending on the characteristics of the language compiler.

These days, the introductory Comp Sci course at UC Berkeley (61A) uses Scheme as the teaching language. It's more accessible than Lisp, but still I found it more pleasant to follow along at home doing the exercises in Javascript.

On a pragmatic level, the upside of functional programming goes something like this. Once you write a function that can reliably read the color value for the pixel at coordinates (x,y), and another that can reliably write a color value for the pixel at (x,y), writing Photoshop just boils down to some data structures and a zillion calls to these functions. In a web development context, if you can make a popup window appear reliably for Firefox, Internet Explorer, Opera, Safari, etc., it doesn't really matter how unappealing the code underneath is, you can build more and more layers that call the same few functions. You can see the downside on many popular web sites; once people learned how to make ads appear and dance across pages, or pull code from other sites to track people's browsing behavior, they built on top of those functions without considering the ultimate performance hit or memory consumption to do all that for hundreds of browser pages/tabs at a time.

If you just want to introduce a bit of functional programming into your everyday life, you can do that without switching languages. Here are some ways you could use its precents to improve PL/SQL or other database code.

With functional programming, every program unit is a function (i.e. has a return value). This lets you map the outline of a program using essentially pseudo-code style. Instead of:

if x > y then x := 0; end if;

you would write something like:

if exceeds_limit() then reset_value(); end if;

When you've tried this before, you might have found it cumbersome to pass all the input values as parameters -- it's not any better to write is_greater_than(x,y) rather than x > y. But PL/SQL lets you write nested subprograms that access the variables of the parent subprogram, so you can make calls without parameters and let the subprograms get and set the values they need.

OK, it's true that reset_value() above is a procedure, not a function. But that's why I said full-on functional programming isn't strictly required. In traditional FP style, instead of doing assignments, you'd issue RETURN statements, and all the assignments would happen at outer layers of the program.

These pseudo-code outlines make it easier to implement bug-free versions of the named subprograms, since they're generally dead simple to write. You're also protected if you need to write several different variations of the main program, that call different combinations of subprograms or do things in a different order; you don't wind up duplicating all the code for the detailed logic.

Functional programming does instill a certain discipline to consider all the possible cases of execution:

function compare(x number, y number) return number is
begin
if x > 0 then return 1; end if;
if x < 0 then return -1; end if;
if x = 0 then return 0; end if;
-- If there is some unexpected combination of parameters, NULLs or NaNs or
-- what have you, need to handle that case also.
return null; -- Or raise an exception...
end;

When you're comfortable handling all the "halting conditions" for straightforward functions like that, it gives you confidence to tackle recursive calls. Typically, a recursive call will do a little bit of manipulation or testing to see if the function has enough information to return a final value, and if not, will call itself with a slightly modified (simpler) version of the input parameter. For example, N-1 for an integer parameter, or SUBSTR(param,2) for a string parameter. Something that trends towards zero, the empty string, the empty list, etc.

PL/SQL can do recursion, but its more complex data structures don't jump out as being ideally suited to recursive calls. For example, if you have a collection, it's not a simple matter to construct a new collection consisting of elements 2-N, to chop off the first one, or to "concatenate" collections to pass the results back through recursive calls. That's a common operation in doing parsing operations, and other scenarios often tackled through recursion. Perhaps with the enhancements to collection operations over the last couple of releases, recursive operations involving collections are now straightforward. If so, I'd love to hear from people who have done so.

SQL coders have been dealing with performance implications of functions for some time now, so they could swap war stories over beer with Lisp coders. If you aren't careful, a SQL query can make astronomical calls to expensive functions. But if you are careful, then you can be like one of those Lisp programmers who looks at any problem and says, "I could solve that in one line, since 1985!".

Oracle subselects and inline views are very good for doing "composition of functions", that is, producing a set of values and passing those values through a series of functions.

select line from
(
select howmany || ': ' || label line from
(
select howmany, label from
(
select count(*) howmany, title label from t1
group by title
)
order by howmany desc
)
)
where rownum < 11
;

In the example above, we're producing a large number of values in the innermost section, and in the outer layers we're sorting the results, concatenating different fields into a single string, and finally limiting the number of results that come out at the end. Along the way, we're renaming some of the expressions and columns of the inner layers to be more descriptive and generic.

Performance-wise, it's best to push the most expensive operations and the limiting step as far inside as possible. If the innermost query only produces 10 rows, it doesn't matter how much transforming and sorting we do in the outer parts, because those steps only happen 10 times.

For example, consider these variations on the same idea:

select distinct(substr(lower(colname),2)) from t;
select distinct(substr(lower(distinct(colname)),2));

Let's say the table has 10,000 rows. The first query is relatively inefficient, because we're running LOWER 10,000 times, then running SUBSTR 10,000 times on the results, then finally doing DISTINCT which might prune the final results down to 1000. The second query runs DISTINCT right at the beginning (the innermost DISTINCT), which prunes the results let's say to 1500; then LOWER gets run 1500 times, and SUBSTR gets run 1500 times, and then the second DISTINCT brings the final results from 1500 rows down to 1000 rows.

I actually ran the first query above in a moment of Lisp-like "see, I can do this in one line", and it processed 22,000 rows in a fraction of a second. In a real business situation, the functions might be things like IS_GOOD_CREDIT_RISK() or DESERVES_A_RAISE_THIS_YEAR(), things that might even query other tables, and anyway take a long time to compute for every row. In which case it really is important to make the result set small in the inner layers, and save the expensive calls for the outer ones.

A whole class of Oracle features go into the idea of optimizing queries that call functions.

There is the function-based index, where the database pre-computes the results of a function for each row, and stores the results in the index rather than as a separate column. The function must be declared DETERMINISTIC, that is, it is guaranteed to give the same result every time when passed the same input parameters. (The database already knows that built-in functions like LOWER() and SUBSTR() are deterministic, so you can build function-based indexes that incorporate calls to those functions.) This approach can be better than having a separate column, because what if the logic breaks down and that column doesn't get updated when the other columns do? The function results will get out-of-date. If you only need the function results for use in WHERE clauses, the function-based index protects against that case. It actually protects a little too well in my experience, because when you recompile the function, the index becomes invalid and prevents DML operations on the table. So you need to separate out the code for that function so it doesn't get recompiled except when you really change it. Either that or add an ALTER INDEX...REBUILD statement after the CREATE OR REPLACE statement for the function.

In 11g, there is the function result cache, which devotes some memory to the results from functions. The idea being that if you have a function that translates 'CA' to 'California', 'IA' to 'Indiana', etc. the database should be smart enough to keep those 50 or so values cached in memory and skip the actual call to your function. That way, you could query large amounts of data without displaying cryptic verbatim column values, and without paying a performance penalty for the transformations. I haven't personally tried the result cache, but I'm itching to -- got those DETERMINISTIC keywords locked in and ready to go!

When you make a nested expression like regexp_replace(lower(substr(to_char(...)))), that's known as "composition of functions", and although that seems like a straightforward notion, there are things you can do with it. If you take that whole expression and make it into a function:

procedure transform_it(what varchar2)
return varchar2
deterministic
is
begin
return regexp_replace(lower(etc. etc.

then you can reference this transformation from many queries; if you later realize there is a faster way to do it, you can rearrange the nesting and all those queries will speed up; and you can use a function-based index or the 11g function result cache to supercharge it.

1 comment:

Chris said...

While PL/SQL lacks a lot of essential constructs for true functional programming (especially first class functions and closures), avoidance of side effects and abstraction, as you mostly wrote about in the first part, still make better programs.

Therefore I'd like to mention one enhancement of PL/SQL 9 that you didn't write about, case. A more "functional" version of your compare routine (similar to lisp's cond) looks like this:

function compare(x number, y number) return number is
begin
return
case
when x > 0 then 1
when x < 0 then -1
when x = 0 then 0
else null
end;
end;