Friday, July 20, 2007

PL/SQL Web Programming 1.0 - The URL is the API

Back in the transition between 8i and 9i, I almost had the chance to write a book about developing database-driven web apps for Oracle. But with so much web stuff moving to Java and the middle tier, it was too big and amorphous of a subject. I wrote about specific features, like PL/SQL Server Pages, but not a whole soup-to-nuts treatment.

Ironically, now that things are more sorted out, I know the subject a million times better from writing code, but I'm no longer writing doc. So my (unofficial) thoughts on the subject go here on the blog.

The notion that will really get you started with PL/SQL web programming is, realize that with typical PL/SQL web apps, your API is really the URL. You write stored procedures or packages as normal, and a component in the web server translates between URL format and the PL/SQL parameter types. If you have a full-blown Application Server installation, that component is the mod_plsql module. If you take the simpler way to web-enable your database, e.g. for a departmental application, that component is called the Embedded PL/SQL Gateway and comes with the database.

What do I mean by "the URL is the API"? Well, by looking at the URL, you can see how the application code is structured. Consider this URL that I own on Oracle.com:

http://www.oracle.com/pls/db102/homepage

/pls is a common convention for the virtual path under which all the URLs represent PL/SQL stored procedures, similar to the way you see /cgi-bin for Perl or other types of scripts.

db102 means the stored procedure is owned by the DB102 schema.

homepage means that's the name of the stored procedure. It could also be a synonym pointing to the stored procedure. If the stored procedure was part of a package, the last part of the URL would be pkg_name.proc_name instead of just proc_name.

I find it a little annoying the way dots are significant in the last part of the URL, because it keeps me from using URLs that end with "filename.ext", as is typical of other web development environments. There's always the Apache module mod_rewrite to mangle the paths of URLs into some friendlier form, but that introduces complexity, and involves coordination with other groups if you don't own the web server.

But the path part of the URL is just the appetizer. The real API is the query string, the part of the URL after the "?" character. For example, in a URL like this:

http://www.oracle.com/pls/db102/search?remark=quick_search&word=john+russell&tab_id=&format=ranked

the part after the ? represents the parameters to the procedure. (The top-level subprograms that you call to generate a page in a PL/SQL web app are always procedures, not functions.) We can see that the procedure takes a REMARK parameter, a WORD, parameter, a TAB_ID parameter, and a FORMAT parameter.

Perhaps there are other parameters, but if so, they must have default values, which makes them optional for the URL. We can't infer anything about the order of the parameters; they can be specified in the URL in a different order than they're declared in the procedure, and mod_plsql takes care of it.

Each parameter is part of a name/value pair separated by the = sign. The & character is the separator between each name/value pair. Since there's nothing between the = sign and the & character following TAB_ID=, that parameter is getting passed a null.

The + in the parameter value "john+russell" above represents a space, in the standard web URL encoding scheme. Because we need a way to pass special characters like &, =, space, /, characters outside the ASCII range, etc. we can represent them as %xx, where xx is 2 hex digits representing the character code. Space is represented by either + or %20 (ASCII 32). And it follows that real % and + characters are represented by their own %xx sequences.

Your PL/SQL procedure can receive VARCHAR2 values with all of these characters with no trouble, the mod_plsql module translates all the values for you when it invokes your procedure. But if you generate a page of HTML from a PL/SQL procedure, and that page includes links to other PL/SQL procedures including parameters, then you do have to URL-encode the value part of each name/value pair.

Encoding most special characters is a relatively straightforward matter, but you have to take care with the characters that are special within the URL. For example, you wouldn't URL-encode the part of the URL that normally includes / characters, or up to the ? character that begins the parameters. (That last portion of the URL is known as the "query string".) Within the query string, you wouldn't URL-encode the = characters in the name/value pairs, and you especially wouldn't URL-encode the & characters that separate the name/value pairs. In fact, to be strictly correct, you must turn each & separator character into the sequence &, so that the browser or whatever user agent parses your HTML file doesn't find a lone unescaped & character in the middle of a string literal. That makes for invalid XML and will trip you up when you start using XHTML content-type declarations, strict file validators or link checkers, and so on.

The & business is the trickiest part of basic PL/SQL web development. Normally the & character is used by SQL*Plus for substitution variables. So even to use it in a program compiled by SQL*Plus, you either have to issue SET DEFINE OFF to turn off substitution variables entirely, or do something like

define amp = &
define urlamp = '&'

and put &. in your strings everywhere you want a normal &, and &urlamp. everywhere you want &. So your search-and-replace code that replaces & with & winds up looking like:

expanded_separators := replace(uneexpanded_separators,'&.','&urlamp.');

When a link coded like <a href="proc?parm1=val1&parm2=val2"> is displayed on a web page, any preview you see in the status bar will just show you regular & characters. There is a bit of a Catch-22 in that some older crawler software will fail to parse the & characters correctly, so your Apache log might fill up with invalid URL requests even though you coded the links according to the standards.

You must order any substitutions so that once you have introduced & or % characters into your encoded string, you don't substitute those characters a second time.

So, let's deconstruct an imaginary URL:

http://www.example.com/pls/my_schema/my_proc?parm3=val3&parm1=weird+string%28value%29%26%3D%2F%3C%3E

That means somewhere in your code, you have CREATE OR REPLACE PROCEDURE MY_PROC...

We put param3 first in the URL because the order of parameters in the URL doesn't need to match the order of parameters in the procedure; you just have to fill in values for all parameters that don't have default values.

The fact that we omitted parm2=val2 means the procedure will just use the default value for that parameter.

The value for parm1 includes some unusual characters that get quoted as + (space) or %xx (punctuation and non-ASCII characters). Your procedure doesn't have to decode the escape sequences, but it does have to encode them if you write out such a link.

The above URL with the & characters is what you would literally open up in your browser. But if you were writing out a link tag in your own page, you would write it as:

http://www.example.com/pls/my_schema/my_proc?parm3=val3&parm1=weird+string%28value%29%26%3D%2F%3C%3E

with & instead of & as the separator.

What about datatypes, you ask? mod_plsql will helpfully convert values to PL/SQL datatypes where possible. Most parameters can be declared VARCHAR2 if you want. NUMBER and INTEGER are also OK.

I stay away from anything with too exotic a representation, e.g. DATE. Better to pass it as a VARCHAR2, then call TO_DATE(param_value, 'format_string') inside the procedure. If there is a problem with the conversion inside the procedure, you can trap the exception and handle the error yourself. But if the error happens while mod_plsql is converting the value to the type of your parameter, you can't trap the exception and you will see an unhelpful (to the user) PL/SQL stack dump instead of the regular page.

Boolean is one common type that isn't supported for parameters in a mod_plsql URL. Use INTEGER and pass a 0 or 1 instead. The non-support for Boolean is handy in that you can put a required Boolean parameter on a procedure, which ensures that it can't be called unexpectedly by someone constructing a mod_plsql URL, trying to hack into your pages. Although I wouldn't rely on that as the primary security for a procedure, since mod_plsql could always add support for Boolean parameters.

You also commonly see URLs with the same parameter repeated multiple times, such as:

http://www.example.com/pls/my_schema/my_proc?name=Joe&name=Fred&name=Sue

It feels intuitive that the procedure is getting an array-like data structure with 3 elements. But what type should it be exactly? That's not quite so intuitive. If you use a TYPE declaration to make a new kind of VARRAY in your own procedure, mod_plsql can't see it early enough for you to declare parameters of that type. So what you really need is a package that declares a type that's a VARRAY or nested table of VARCHAR2s, then declare the procedure parameter as YOUR_PACKAGE.YOUR_TYPE. Luckily, there is a predefined type like this, OWA_UTIL.IDENT_ARR. But the elements of that type aren't especially long, so if you need to pass sizeable values, you'll need to make your own type.

A further complication comes if you want to make a multiple-occurrence parameter optional, that is, to make both of these URLs work:

http://www.example.com/pls/my_schema/my_proc?name=Joe&name=Fred&name=Sue
http://www.example.com/pls/my_schema/my_proc?nickname=Wheezy

You need to declare a default value in your procedure for the NAME parameter, but how do you do that for a type that you just declared? You can't just make the default be NULL. You need to make a real but empty instance of that type.

The answer is to declare inside the same package as the type, an empty instance of that type (for index-by tables a.k.a. associative arrays), or a function that returns an empty instance of that type (for nested tables or varrays). Something like:

-- In the package spec.
type type1_t is table of varchar2(100);

type type2_t is table of varchar2(100) index by pls_integer;
empty_type2 type2_t;

-- In the package body.
function empty_type1 return type1_t is
begin
return type1_t();
end;

Then you can declare your procedure like so:

create or replace procedure my_proc
(
name my_package.type1_t := my_package.empty_type1(), -- Construct nested table with no elements
nickname varchar2(100) := null
)
...

or

create or replace procedure my_proc
(
name my_package.type2_t := my_package.empty_type2, -- Construct associative array with no elements
nickname varchar2(100) := null
)
...

With parameters set up like this, you can call the procedure with one or more NAME values, or leave out the NAME parameter and pass NICKNAME instead.

A useful pattern for procedures that generate a web page is to have null or empty defaults for all parameters, and make the procedure display a minimal page (typically with HTML form elements) when invoked with no parameters, and display a more elaborate page when invoked with parameters. For example, a search page invoked with no parameters can display just the search box. When invoked with parameters, it can display results too. Instead of pointing the form tag to a different URL, point it to the same procedure. For example, if the procedure name is SEARCH, generate a form tag like so:

<form action="search" ...>

This example also illustrates how procedures within the same schema can link to each other without elaborate paths or qualifiers in the URLs. Just put the procedure name in the action= attribute of a form tag, or the procedure name + the query string in the href= attribute of an anchor ("a") tag.

Planning and preparation like this is all well and good, but how do you actually generate the HTML for the page? Ah, that's another post coming up!

Friday, July 6, 2007

Search Engine Optimization for Programming Languages

Why didn't language X take over the world? Why does language Y enjoy broad adoption despite clear shortcomings? While there are lots of factors, for example the degree of strictness vs. whimsy and how that matches up with the mindset of programmers, I am convinced that it largely comes down to how easy it is to search for sample code in a given language.

For example, once Java became popular, if you ran across a mention of some class or method name, you could always find lots of example by searching for those names. Or if you wanted to see how to use a certain GUI method in combination with a certain container type, you could search for pages with both names. If any name was ambiguous, you could add some other keywords -- class, extends, etc. -- to make sure you really found a full code sample and not some general discussion.

So, Java's adoption was helped by a confluence of factors: a standard set of relatively verbose names, and a bunch of students posting their homework solutions online.

Perl, for me, didn't fulfill its promise and I lament sometimes that it didn't take over the world like it could have. When your source code is full of idioms like $_ and <>, it's hard to search for code samples. (By search, I mean using Google et al, rather than using vi, emacs, or grep on your own source.) When I look at my Perl code, the words inside are mostly generic like for, if, print, and die. Searching for them, you would come up with lots of non-programming discussion, or source code from other languages. For example, what does "unset local $/" do? Google will find instances of the phrase "unset local", but not the full line of code.

So, Perl's adoption was hindered by factors related to searchability. The best Perl source on the web is posted as CGIs, where a search engine crawler can only spider the output, not the underlying source. It's hard to find sample code generally. And the way Perl modules are documented, when you search for popular module or method names, the top hits tend to be dozens of copies of the same Perldoc files. (And if the Perldoc was sufficient, you wouldn't be doing the search in the first place.)

I see SQL and PL/SQL as somewhere in between Java and Perl in terms of searchability. There are lots of unique names from PL/SQL packages and methods. Some of the words that might otherwise be too common are grouped into phrases like "connect by" and "order by". PL/SQL has some words like begin and end that aren't especially searchable by themselves, but if you combine them with others like declare, exception, etc. you can devise searches that home in on PL/SQL source code.

Where SQL and PL/SQL get into trouble with searchability is in assigning specialized behavior to single, common words. PL/SQL has a function called TABLE(); good luck in finding that by searching given that TABLE is used in so many other contexts like CREATE TABLE or "Doing xyz to a Table". There also used to be a SQL function called THE(), now happily gone, but I still get requests from time to time "how come I can't search for the THE function?".

When I get some free time, I'm going to look into the idea of auto-classifying source code via Bayesian analysis. Oracle Text has the CTX_CLS package, where you feed it training data consisting of already-categorized documents, and it deduces rules so that it can analyze unknown documents and match them up with those same categories. The Oracle doc library has tons of files with source code in a single known language, multiple languages in the same file, or source code from one language in a book that is primarily for a different language. I'd like to compare and contrast the CTX_CLS approach with a "homegrown" one similar to Peter Norvig's spelling corrector.