Wednesday, December 31, 2008

Oracle PeopleSoft hosted docs

Just in time for the new year! The Oracle PeopleSoft group now have their docs on Oracle.com in HTML format, hooked up to a Tahiti search:

Oracle PeopleSoft Enterprise Hosted PeopleBooks

Sunday, December 21, 2008

Coding Horror: Hardware is Cheap, Programmers are Expensive


You've probably heard both sides of this argument: throw hardware at performance problems, no no, improve the code. Jeff Atwood at Coding Horror comes down on the "more hardware" side in this post:

Coding Horror: Hardware is Cheap, Programmers are Expensive

Usually I agree with Jeff, but I don't agree with his thesis here. The commenters on this page present some good convincing counterarguments and examples that tuning or modifying algorithms can be very cost-effective.

But sometimes I do like to throw hardware at a performance problem. Wait, haven't I just contradicted myself?

It's all in the context and the human elements. Let's consider the possibilities:

  1. You're in charge, you make the decision. Obviously, you'll consider all the pros and cons and make the right choice. Sometimes the right choice for return on investment is to beef up the hardware, sometimes it's to tune the code, change the algorithm.

  2. Someone else is making the decision, and they're just like you, or they always listen to your good advice. Again, sometimes the choice will go one way, sometimes another, but probably the right path will present itself.

  3. Someone else is making the decision, and they always prefer one approach or the other. Sometimes they'll make the right choice, sometimes the wrong one. Plenty of examples of both in the article comments.

  4. Someone else is making the decision, and they are open to either approach, but they don't have enough information to make an informed decision. For example, they may not know the price/performance of available hardware, or may not know the code base or the team's coding skills well enough to estimate the potential for improvement vs. the time involved.


If all situations were equally likely, we could say don't worry, a majority of the time it'll turn out OK. In the real world, scenarios 3 and 4 occur more often than you might like. You'll need to have strong arguments for "more hardware" and "more tuning" at the ready at all times, not just have one dictum for all situations.

Thursday, December 18, 2008

The Humble COUNT( ) Function



Here's another ode to a small but fundamental aspect of Oracle, following the same theme as The Humble IF Statement. This time, let's look at the COUNT( ) function. I think when you look at it the right way, it opens up the whole story about database performance.

What's the first thing you do when poking around an unfamiliar system? I'll bet it involves SELECT COUNT(*) queries in one way or another. SELECT COUNT(*) FROM USER_OBJECTS to see how much "stuff" is in the schema. SELECT COUNT(*) FROM data_table to see how much data is around. SELECT COUNT(*) FROM audit_table to see how much usage the application gets. You can run SELECT COUNT(*) on various other data dictionary views, and with some WHERE clauses on regular tables, to answer questions like "how many...?", "are there any...?", and "confirm there are no...".

You can also do SELECT COUNT(column_name) FROM table_name. That brings back a count of the number of rows that have non-null values in that column. What's that for?

  1. It demonstrates that you at least know something about the table, the name of a column. That's a small stake in the ground when you're establishing your credibility with experienced DBAs and developers.

  2. It suggests that you'll probably want to use that column in a WHERE clause test, later in a bigger query. If you use that column in an =, >=, BETWEEN, LIKE, etc. test, that comparison can only match non-null values. COUNT(column_name) represents an upper limit on the rows that could come back from such a query.

  3. It demonstrates that you're thinking on a higher plane. Not "how many physical rows total in this table", rather "how many significant rows with data in them". If you're a DBA trying to reclaim space, COUNT(*) might make more sense. If you're a developer tuning query performance, COUNT(column_name) is probably the right one to use.

  4. The COUNT( ) query has a better chance of using an index if you put in a column name rather than *. COUNT(*) can take a long long time on tables with millions of rows.


Let's think about that last point. A typical index doesn't represent null values. The default index is a tree data structure (a B-tree) that would be right at home in a Java programming course. You could have a table with 50 million rows, but if only 1 million rows had something filled in for column X, you could build an index on that column and the index would take a relatively small amount of space -- disk space to store the index, and memory to manipulate pointers to the real data. Any query that tested that column could immediately ignore the other 49 million rows as irrelevant -- and that's why the index doesn't try to record which rows have nulls.

That means if we had a kind of index that indexed nulls, or one that guaranteed there weren't any nulls in the table, COUNT(*) could run really fast. What do you know, we do have indexes like that. Bitmap indexes represent a string of zeros and ones for every different value in a column, including null, to say very quickly whether a row contains a particular value. So when you run COUNT(*) on a table with a bitmap index, it could just count the number of zeros and ones for some value from that column. And a NOT NULL constraint on a column, which also comes for free with a primary key, means COUNT(*) could just count the values in the index, knowing that all the rows have some value for that column, so the number of values in the index matches the number of rows in the table.

That leads us to a thought experiment that's also practical. In SQL*Plus, you can issue SET AUTOTRACE ONto show the explain plan for every SQL query you run. If you want to get a preview of query performance, you can turn on autotrace, wander around your database doing COUNT( ) queries to see how many rows would be returned by your real queries, without the overhead of actually bringing back all the data.

If you do some tests this way, you'll quickly see the results in the explain plans. Index range scans, fast full index scans, and other things with the word "index" in them are generally cause for happiness. Full table scans are generally cause to look closer at the query; for small tables they're not so bad, for big tables they're often a warning signal.

Comparing COUNT(*) and COUNT(column_name) this way, you'll see illustrations of what I've said above. COUNT(indexed_column) produces a happy explain plan. If the table has any bitmap indexes, a primary key, or a NOT NULL constraint, COUNT(*) will take advantage of that fact and also use an efficient explain plan.

That said, sometimes you have a table with, say, 35 millions rows and a primary key, and COUNT(*) still takes a long time to come back. What's up? Well, an index with 35 million entries will take a lot of I/O to read and scan through. That's the basis of all the histograms, cardinality, and other arithmetic that revolves around database performance. (Like "don't use bitmap indexes if the number of different values exceeds 20% of the total number of rows", or "don't use bitmap indexes for small sets of values like Male/Female, where the rows with different values are all mixed together".) Sometimes the theoretical speedup from reading the fancy data structures and dereferencing pointers to get to the disk data, is outweighed by the practical disk and memory considerations to go through the index. Maybe the explain plan for a COUNT( ) query will be different from the same query bringing back real data, for that reason. And that's why there are different levels of expertise when it comes to database performance and query tuning.

Now, I'm just a guy who took the SQL Tuning course, read a lot of Oracle Concepts and Performance docs long before having any practical use for them, more than once drummed his fingers waiting for COUNT(*) to come back with an answer, and in a previous life wrote about close-to-the-metal performance considerations for IBM's POWER and PowerPC chips. This post doubtless doesn't capture all the nuances of query performance. This is just the way I think about that simple little COUNT( ) function.

Tuesday, December 16, 2008

Reformatting in vim


Things never really change. Even with XML, sometimes long lines are a big hassle. For example, if you have a wide code example that runs off the side of the page, and your formatter doesn't offer a way to adjust the page margins for one element, or to auto-shrink the font size.

I was spoiled back in the day by the ability of editors like XEDIT to apply formats to ranges of lines, so I hadn't really explored those capabilities too much in my favourite cross-platform vi clone, vim. When faced with the need, I would usually record a macro (q<letter>) to rearrange the contents of one line and move to the next line, play back the macro (@<letter>), and keep doing @@ to apply the same transform to a range of lines.

vim has language constructs to do conditional logic, but sometimes the old ways are best. Needing to wrap a bunch of wide code examples, without disturbing the other long lines in the file, I adapted this tip from the vim wiki. The essence of the tip is a command like:

:g/.\{80,\}/ .!par w70

Two little tweaks make it work better for me. OS X doesn't have a par command, so I change the last part of the command to fold -w70. And the g// command doesn't have to be global for the file, it can work on a range of lines. My habit when editing interactively is to set a mark a for the top of a range (ma) and a mark b for the bottom of a range (mb). So I set those marks at the top and bottom of a code example, then:

:'a,'b g/.\{75,\}/ .!fold -w75

(Numbers adjusted for my particular situation.)

Once this command has proved itself, the next step is to bind it to a function key (another practice that has gotten harder as editors have evolved; thanks CUA):

map <F5> :'a,'b g/.\{75,\}/ .!fold -w75^M

(With the ^M a real carriage return, generated by the key sequence <Ctrl-V><Enter>.)

At the point where I identify all the contexts where this might be useful, e.g. inside Docbook <programlisting>...</programlisting> tags, I would add one more layer of abstraction to the macro, to search for the start tag, set the mark, search for the end tag, set the mark, then run the command to apply the transformation between the 2 marks.

vim has kind of a weird regular expression syntax for doing OR-style searches. Don't know if it's new since vi, or if I just never used that particular capability in vi. To search for let's say "<programlisting>" or "<screen>", you'd enter:

/<programlisting\|<screen>

In this case, the \ doesn't mean interpret the character | literally, it means interpret that character as a special search function. For more info, within vim enter:

help pattern
help magic

Tuesday, December 2, 2008

Don't Be Cruel


Ran across this old paper from Edsger Dijkstra, discussed on Slashdot:

On the Cruelty of Really Teaching Computer Science, as a PDF of scanned handwriting or the text transcript.

I'm not going to attack or defend the overall thesis, but anyway this paragraph I will agree with:

My next linguistical suggestion is more rigorous. It is to fight the "if-this-guy-wants-to-talk-to-that-guy" syndrome: never refer to parts of programs or pieces of equipment in an anthropomorphic terminology, nor allow your students to do so. This linguistical improvement is much harder to implement than you might think, and your department might consider the introduction of fines for violations, say a quarter for undergraduates, two quarters for graduate students, and five dollars for faculty members: by the end of the first semester of the new regime, you will have collected enough money for two scholarships.

The problem is the false assumptions and analogies that get introduced by these lines of thinking. If a network is "this guy talking to that guy", your thinking will be constrained by what you know about human conversation. If there's a problem, someone can talk louder, slower, etc. and the analogy holds. But if the solution involves something that has no equivalent in the day-to-day world, how are you going to conceptualize it?

My pet peeve, that descends from this same idea, is from the teaching of object orientation. A car is a type of vehicle; a truck is a type of vehicle; they both have wheels; maybe the number of wheels is different; maybe each has a different turning radius or procedure for backing up.

Great, now you understand inheritance, polymorphism, member functions, etc. However, in practice, you use object orientation to avoid zillions of "if" statements, special case code, large blocks of almost-but-not-quite duplicated code. These are concrete issues that you can see while writing programs, that have concrete effects like making things run faster or slower, making the code bulkier or more compact, more or less readable and maintainable.

In my experience, someone who learned OO by simple analogies is likely to be locked into thinking "I have to write every program by making a tree of classes like with the cars and trucks". Rather than "there are different ways to structure the code, and a good OO way will get rid of a lot of branching, magic numbers, and redundant code, without slowing things down too much or obfuscating what's really happening".

I remember seeing such analogies many times in the early days of Java. I don't know to what extent they made it into textbooks and university classroom settings, but I still see them here and there, for example this Q&A about how to implement an association relationship in Java.

Saturday, November 22, 2008

The No-Output Debugging Technique


I run into this situation now and then by accident. If I could just figure out how to bottle it, all the world's software problems would be solved!

Here's the scenario:

- Notice a slight bug in some program output (typically a web page).
- Make the obvious code change, reload the page, bug is still there.
- Make more aggressive code change, making code more modular. Still no dice.
- Add several assertions, improve default assignments. Make control flow easier to follow. Comment the heck out of everything. Still the same bad output.
- Put in some impossible-to-ignore condition: remove the procedure, or always raise an exception, or print some big banner with debug info. Still the same page as before!
- Notice I'm loading the production version of the page instead of the test version. Try the right version with the updated code. Bug was fixed by very first edit.

This is a great way to trick yourself into writing better code.

Tuesday, November 18, 2008

Mystery of the FIRST_ROWS hint


I've always been intrigued by the FIRST_ROWS hint, so I paid special attention when we reached it in the 11g SQL Tuning class. But I'm still puzzled.

The course notes said that although you shouldn't be using hints generally, when you do, FIRST_ROWS is the most useful of the hints (in the form FIRST_ROWS(n) where you specify how many rows to optimize for). Also that it's not effective when the query block contains any "sorting or grouping" operations.

Now, I always assumed that this hint would be used like so:

select /*+ FIRST_ROWS(10) */ * from
(
select count(*) howmany, x from t group by x order by count(*) desc
)
where rownum < 11;

which would get you the top 10 items from a particular table. Notice that the ORDER BY is in an inner block, so presumably not covered by the "no sorts in the block with the hint" restriction.

However, when you put a WHERE ROWNUM < (N+1) clause in the query, the plan will show a line with COUNT STOPKEY that presumably means the optimizer knows only N rows are needed. So shouldn't the hint be a no-op in that case? I always figured there must be some other case where it's needed. I've been trying to find some authoritative information.

If the WHERE clause uses a variable, e.g. ROWNUM < N, maybe the hint is just a way of suggesting what the likely value of N is, for purposes of COUNT STOPKEY. But that's just speculation on my part.

I got one suggestion that the FIRST_ROWS hint makes the optimizer spend less time looking for the ideal query plan, since it's not going to traverse the whole table anyway, speeding up the hard parse phase. Could that be the meaning of "optimized for throughput"? One of the SQL internals guys didn't think so.

I got another suggestion that it could really change the execution plan, but you would use it not in a top N situation, rather where you wanted arbitrary values. E.g. I wonder if certain values in my table are uppercase or lowercase, so I'll just look at a few, don't care which ones (and in real life might be some devilishly complicated join):

select /*+ FIRST_ROWS */ my_col from my_table where rownum < 6;

In this scenario, I could imagine a change in the execution plan producing different results, depending on how exactly the query ran through the index. But I don't have confirmation that could ever happen.

If you are getting the top N values from an inner query block, maybe the hint helps figure out how to efficiently process the inner results, which after all aren't a real table with indexes and such. That's just my speculation again.

The SQL Reference states: The optimizer ignores this hint ... in SELECT statement blocks that include any blocking operations, such as sorts or groupings. However, I find explicit advice and examples featuring ORDER BY here in the Oracle Text docs. The PSOUG page on hints is a little more specific in its language -- it mentions a bunch of things that render FIRST_ROWS inoperative, including GROUP BY and DISTINCT but not ORDER BY; and it also uses ORDER BY in its examples.

So, after much thought... I'm back where I started. FIRST_ROWS is useful for doing top N queries... or for filtering top N queries done inside an inner block... or only for getting arbitrary values, not top N at all. It does this by speeding up parsing... or choosing a different explain plan (which might or might not change the order since no sorts are allowed)... or just confirming to the COUNT STOPKEY operation how many results are likely to come back from WHERE ROWNUM < N.

Any performance mavens who can say from hands-on experience which, if any, of these likely stories are true?

Addendum



I will update the original post, as new answers (or new questions) come in, and with responses to suggestions in the comments.

One other top-N-almost-but-not-quite scenario I sometimes encounter is this: display the top N results, but if there is a tie for Nth place, keep going until you've exhausted all the tied values. It's common for traffic analysis on new or lightly visited sites, where many pages may be tied with 1 visitor, and the value 1 makes the top N rankings. The intuitive, strictly correct thing to do is to include AND howmany >= (subquery to find the value of Nth item; then either bump up the value of N in WHERE ROWNUM < (N+1), or leave out ROWNUM entirely and just stop fetching after getting the top N values plus all successive values tied for Nth. I've never delved deep into what is the best performing way to do this, since like I say it usually happens where relatively small amounts of data are being analyzed. But I could surmise that FIRST_ROWS might help in cases where you omit WHERE ROWNUM < (N+1), yet you stop fetching before exhausting the whole result set.

One of the reasons I stopped including FIRST_ROWS hints in my new queries was that mod_plsql always processes the whole procedure before returning any HTML output. So if there is a query that takes 5 minutes, it doesn't matter if FIRST_ROWS starts sending back output after a few seconds; the user won't see anything until the full 5 minutes are elapsed. (Unlike the client-side GUI situation mentioned by one commenter, where presumably the output could start appearing before all the result set rows were available.)

The Jonathan Lewis blog post referenced by a commenter provides some useful information, but still leaves me wanting more information. I feel like I need a FIRST_ROWS FAQ, with short direct answers -- even if some of them are "it depends". Here are my FIRST_ROWS FAQs and the best answers I have:

Q: Any version dependencies?
A: The optimizer is smarter in 9i, meaning less need for the hint. It's smarter again in 10g (especially as regards WHERE ROWNUM < (N+1) so even less need for the hint.

Q: Any effect on parse time, does FIRST_ROWS result in faster hard parses?
A: Have heard both "think so" and "don't think so".

Q: Does ORDER BY totally negate the performance benefits?
A: It depends. Sometimes the query results come back from an index already in the ORDER BY order, in which case ORDER BY doesn't hurt the potential FIRST_ROWS optimization. If the FIRST_ROWS hint goes on the outer block and ORDER BY is on a subquery... don't know if FIRST_ROWS might still help, or if the sorting in the subquery has already lost the chance to make the outer query any faster.

Q: Are those longstanding Oracle Text examples with ORDER BY wrong or obsolete?
A: Don't know for sure, see above. Might depend on results already being sorted and ORDER BY being a no-op. Might be obsolete based on optimizer improvements in 9i and 10g. Might be immaterial when using Oracle Text for a web-based application through mod_plsql, due to the lack of incremental output.

Q: Any difference when a query includes WHERE ROWNUM < (N+1) using a variable N instead of a constant?
A: Don't know if bind variable peeking plays any role here. I know that I am tempted to use e.g. FIRST_ROWS_10 in cases where the number of items is a variable, but the default or most common case is top 10. However, given all the other factors above (particularly the interaction with ORDER BY), this might just be superstition on my part.

The comments in the Jonathan Lewis blog post discuss different approaches to displaying "top N results" combined with "page X of Y" like in a search engine. I personally prefer the technique "figure out the top N * pages results, then skip all pages prior to the current one". What I do in the Tahiti search engine is just grab the ROWIDs of those top N * pages results, find the desired range of values, then query based on ROWID; so I'm only retrieving the actual data for the 10 or so on the page that's really displayed.

Saturday, November 8, 2008

Things That Make Us Go


I've just finished taking the (EM-based) Performance Tuning course and the SQL Tuning course. It's always a strange feeling taking a course where I already have hands-on experience, or I was around in the early days when standards were being hammered out.

Sometimes that odd feeling (I know this subject backwards and forwards, but I don't know THAT) is a disincentive to take a class. That's probably why I never took a C++ class, after doing the intro to the language with John Vlissides, hearing the early design discussions for the STL, and watching my department mate (and future manager) transcribe the original C++ reference manual to publish with IBM's first C++ compiler.

Other times, behind-the-scenes experience gives a dreamlike quality to the class. Small details become major epiphanies -- THAT'S what that person meant by that remark I didn't believe when they asked me to include it in the documentation. Discussions of timelines kick off a parallel track in my head -- yes, this feature first appeared in that release, but it almost made it in the previous release, or maybe it was there but never exposed, or it was almost called XYZ but changed at the last minute to ABC, and that's why the syntax is the way it is.

My mindset during the performance courses was a mixture of those feelings. I could handwave my way through 95% of the SQL stuff, but didn't know some of the reasons why technique X is just as fast as technique Y; or still believed urban myth Z, which stopped being true some time ago. (Like when SELECT DISTINCT stopped always returning results in sorted order.) I didn't know as much of the EM stuff, but it helped put into perspective why someone might need to know 50 pages worth of internal details just so they could decide whether or not to select one specific checkbox in some dialog.

Just to make things interesting, I took the Enterprise Manager course before the SQL course. The recommended order is the reverse. Sometimes, it's better to wade through the tough stuff first, being thrown into the deep end, and then feel like the easy stuff is a cinch. Like when I first read all those famous science fiction series, for one reason or another I usually read them out of order: Lord of the Rings (3, 1, 2), Dune (2, 3, 1), Riverworld (4, 1, then after that it's too complicated to explain).

My expectation with the SQL Tuning course, was that I would find out I had made dumb newbie decisions about bitmap indexes and function-based indexes all those years ago, and learn that index-organized tables were ideal for some new stuff I'm doing now. Actually, I learned that my original index choices were spot on, and my latest data structures might not be appropriate for IOTs after all.

I will revisit some of my ancient dealings with performance in light of the course material, and write some blog posts about what I find. Back in the late 8i days, I was in charge of the "Indexes" chapter in "Application Developer's Guide - Fundamentals". I was never satisfied with the presentation of bitmap, function-based, reverse key, etc. indexes, but I didn't know enough at the time to be dangerous.

Sunday, November 2, 2008

Don't Fence Me In: All About Constraints


Constraints are simultaneously one of my most favorite and least favorite Oracle Database features. They're great for keeping bad data out of the database. They're a terrible imposition on object-oriented, agile, or <insert your favorite buzzword here> coding style. They save a ton of repetitive coding, writing the same logic in different languages. Hey, we already wrote all that redundant code in 10 languages and we'd hate for all those weeks of debugging to be wasted. Etc. etc.

If you're coming from a Java background, you might think of constraints like a straitjacket that forces your exception handling code to be organized a certain way. But that's the wrong way to think of it. Robust database code must TRY/CATCH for unexpected conditions anyway, in case the power goes out or the hard drive fills up just as you ask to do a completely innocuous INSERT or UPDATE. What if some maverick team member fires up SQL*Plus to make bulk updates to one table, bypassing all the Java application code and its error checking, without understanding the conventions that everyone agreed on for the data values, or the corresponding changes that must be made in other tables? The exception from a constraint is just another error that you expect never to happen, but you don't rely on it not happening.

If you're coming from a SQL, PL/SQL, or maybe more of a scripting background, or if your team programs in different languages or consists of fallible human beings, you can think of constraints like your very own bridge-keeper from "Monty Python and the Holy Grail". Whoever would insert or update table T must answer me these questions three, else an exception raised shall be.

For example, on a command like INSERT INTO GRAIL_KNIGHTS SELECT * FROM ROUND_TABLE_KNIGHTS...

  • If Sir Lancelot and Sir Robin both give the same answers, a UNIQUE constraint will catapult Sir Robin into the Gorge of Eternal Peril.
  • If Sir Galahad can't name a favorite color, a NOT NULL constraint will send him to the same fate.
  • If King Arthur gives the wrong airspeed velocity of an unladen swallow, a CHECK constraint will spell his end.

As a bonus...

  • If you don't count correctly when intoning the ritual, the Holy Hand Grenade of Antioch won't work because of its PRIMARY KEY constraint. (It's probably best to count to 3 via a sequence, and not write a psalm about that special value. In the database world, the ritual is different every time.)
  • If you are a French knight and receive a gift of a Trojan Rabbit, a FOREIGN KEY constraint will keep it out of your fort.

When I first joined Oracle, almost the first thing I had to do was document constraints in 8i, in the Application Developer's Guide - Fundamentals. That was a frustrating experience for me as an Oracle newbie. There was information about "benefits of constraints" -- but who exactly were we trying to convince, or was that just marketing fluff? You could set up constraints with CREATE TABLE or ALTER TABLE -- why one or the other? Who exactly would be issuing all these DDL statements to work with constraints, a developer or a DBA? Here was technology leaps and bounds ahead of the competition, but as a result, there was maybe less depth of understanding about how that feature would be used in the real world.

Years later, now I (think I) know some of the answers, but it's too late to rearrange and rewrite the documentation to my whims. (I make suggestions with the reader comment forms in the 11g library, same as anyone else.)

The "benefits" information is aimed at the crowd who like to put all their error-checking logic in the middle tier. If anything, I'd like to have more simple examples to extol the benefits, particularly for scenarios involving Java client code, to get a feel for best practices for exception handling.

The CREATE TABLE syntax is used by the experienced enterprise types who plan everything out in advance. The ALTER TABLE syntax is used by those who came around to the idea of constraints after their coding was already underway, or who want to experiment before deciding. (I call this the "try it and see" style of development.) I'd like to be able to bridge from one style to the other: here is the ALTER TABLE syntax, and if you decide this is a worthy change, here's the equivalent CREATE TABLE syntax to make in your installation script.

There's at least one counterintuitive case, where careful advance planning requires you to use ALTER TABLE instead of CREATE TABLE. If you want to periodically turn off constraint checking for a primary key, you must create the table without the primary key clause, create an index on the primary key column, and do an ALTER TABLE to add the primary key. The primary key will then use your index, which must not be a UNIQUE index so that later you can "defer" the primary key checking.

The extra keywords in the CREATE TABLE / ALTER TABLE syntax for constraints indicate how dogmatic you are about keeping bad data out of your tables. DEFERRABLE means you may some time in the future allow bad data in the table for a brief period, but you're confident you can fix any problems before committing. INITIALLY DEFERRED means you expect to do this cleanup the very first time you insert data, or that the table already contains bad data and you'll clean it up immediately after creating the constraint. NOVALIDATE means you don't want to test the data that's already in the table -- either you're really confident that it's good, or you don't care if there are problems -- but you do want to check future data that's inserted or updated.

Once upon a time, you would do ALTER SESSION SET CONSTRAINTS = DEFERRED. These days, you could instead use SQL, as in SET CONSTRAINTS ALL DEFERRED or SET CONSTRAINT MY_TABLE_FK DEFERRED.

When/why would you use all of those keywords? I guess it boils down to a combination of convenience and performance. Consider a development scenario, where you insert personal data that has minor problems with the data, perhaps bad capitalization in some of the names. You could run a fixer function as the data goes into the table, either by calling the function in the INSERT statement, coding it into a SQL*Plus datafile, or by setting up a trigger. But (a) each of those is less convenient than just loading the data and then running 1-2 UPDATE statements afterward, and (b) maybe you want to start crunching numbers that don't rely on the textual columns ASAP, so you don't want the cleanup processing to slow down the initial data load.

My favorite analogy here is to XML processing. We've all been given XML documents that were invalid for some ticky-tacky reason... an unquoted attribute like width=100, or an uppercase tag name like <P> in an XHTML document that requires lowercase tags. It's cumbersome to do anything with those documents as XML, the "constraint checking" is so tightly integrated into the parsers. Oracle SQL lets you choose how "constraining" such easily detected/fixed data inconsistencies should be -- fatal error, try again, or don't check until I've cleaned up the data.

The ability to rename constraints is another offshoot of the "try it and see" development methodology. If you create a constraint without giving it a name, it might be years before bit rot sets in and somebody actually tries to insert bad data. At that point, the error message will cite some unintelligible name like SYS_C000123. After some detective work querying the USER_/DBA_/ALL_CONSTRAINTS views, you'll track down that this name refers to some primary key, check constraint, or what have you, and you'll use the ALTER TABLE ... RENAME CONSTRAINT syntax to give the constraint a descriptive name, so future error messages will be easier to understand. Again, I haven't found a clear path in the documentation that leads from troubleshooting examples of ALTER TABLE syntax, back to the equivalent CREATE TABLE examples or syntax so you can plan ahead next time.

As to who actually needs the information about constraints, that's still fuzzy for me. The world is full of application developers who don't take enough advantage of stored procedures, triggers, and constraints. It's also full of DBAs who aren't allowed to poke around with business logic in application code. So anywhere the constraint best practices are documented, it never feels to me like they are reaching quite the right receptive and empowered audience. There's no "Oracle Database Jack-of-All-Trades User's Guide" for the person who fills both developer and DBA roles.

Further reading:

Thursday, October 30, 2008

Local Angles from OOW

TCHO, a San Francisco chocolate company, had a booth. As an East Bay booster, I have to stay loyal to Scharffen Berger, one of whose founders recently died, and Charles Chocolates. Both just down the street from me, both offering factory tours and sometimes free samples.

Enterprising panhandlers stock up on conference swag. Hey, I think I saw one of those guys Wednesday afternoon in Moscone South. When they say Oracle is cheap, I presume that means it's hard to pick up unauthorized swag. True dat. You're more likely to see me in a 10-year-old DB2 t-shirt or IBM Toronto polo shirt than an Oracle logoed top. The employee OOW registration doesn't include the nice bag and reusable water bottle. Last time at HQ I needed to cart around a borrowed laptop, the closest bag at hand was not Oracle logoed and raised eyebrows in the elevator -- "what were you doing at THAT conference?". :-)

Former San Francisco mayor Willie Brown visits OpenWorld. I noticed the wheelie bags for laptops too. Glad to find out they're for ergonomic reasons, not to rush straight from Moscone to the airport with one carry-on.

Sunday, September 28, 2008

Thoughts on Debugging PL/SQL Web Applications

At OOW, I ran into Stephen Feuerstein after seeing him demonstrate Quest Software's "Quest Code Tester" product. Considering how I might use a product like that for testing web-based applications, I suggested a couple of enhancements.

The biggest, most important procedures that I test in PL/SQL are those that generate entire web pages. For that kind of testing, you can't look at whether data has been changed in a table, you have to look at the HTML output of the procedure. In a testing scenario, that output would be in the internal buffer used by the HTP package and the others in the PL/SQL web toolkit.

An important procedure could generate a big web page. For that reason, I'd like to be able to compare at a finer granularity than whether the generated web page matches exactly some version that was stored for testing purposes. I think the ideal technique would be to run a regular expression test over each line of output, and be able to check "does any part of the page match this pattern?". It's that kind of flexibility that's missing in a lot of test environments, e.g. causing anxiety over the prospect of changing some text in an error message it might break some text case that does an exact match on all output, not just looking for the error number.

The contents of a web page could be unpredictable. For example, a page of search results might not be exactly the same after the search index has been refreshed. And Web 2.0-style pages could have random elements like "Tip of the Day" or a list of online friends, Twitter messages, or some set of recently viewed links. Even just personalized text like "Hello John".

In testing, I would like to ignore all those things and just focus on the parts that vary according to the parameters. For example, in a search of Oracle documentation, if the search term is "oracle", I expect that somewhere on the page will be a "Next>>" link. If I pass in the right parameters to retrieve page 1 of results, I expect that nowhere on the page will be a "<<Previous" link. If I pass in a nonsensical search term, I expect that the page will contain a particular message saying there aren't any results. For intentional misspellings, I might want to confirm that the right "Did you mean?" message comes up.

In addition, I might want to test certain invisible properties of the page, like the attributes of meta tags, links to stylesheets, or instead of showing exceptions to the user I'll catch them but embed little coded messages inside HTML comment tags.

Now, of course, someone could write a set of test cases in Perl or Python, even shell scripts, to retrieve URLs using some combination of parameters, then do this scanning of the content. But a PL/SQL-aware tool could be more convenient by hooking into the data dictionary to see the available parameters and the procedure source, which is why I'm intrigued by the Code Tester product.

Of course, any big procedure is composed of little parts. It's those little parts that do things like returning a string in a certain format, computing what number to display on a page, opening a cursor for a query. Those procedures and functions are the easy ones to unit test in an automated way, which is what the demo focused on. If you pass this number in, do you get this number out? If you pass zero, a negative value, a too-high number, do you get back the appropriate result or the expected exception? And so on for strings, collections, and other datatypes.

The wrinkle I throw into that testing scenario is internal procedures and functions. If something is directly callable, it'll be a top-level procedure or function, or inside a package. If it's reusable, chances are high it'll be in a package. But if something is abstracted purely so that my procedure body can be written like pseucode:

prepare_for_action();
run_query();
while results_left_to_process() loop
process_results();
end loop;

then I'll nest those procedures and functions inside the current one. No danger of them being called by accident (or by a malicious attacker, in a web scenario with mod_plsql). No name conflict if I want to use the same step name in a different procedure. They can access all the variables from the current procedure, so I don't need to load down the calls with lots of parameters.

Automated testing for those internal procedures and functions could be tricky. They can't be called directly from unit tests outside the original procedure. Stephen suggested using conditional compilation. The first idea that jumped to my mind is to generate instrumentation code and use some SQL*Plus hackery to embed it in the main procedure:

procedure test_me is
procedure step1 is...
procedure step2 is...
function func1...
-- This source file could be generated by a testing tool.
-- But using @ halfway through a procedure is a SQL*Plus-ism
-- that's doubtless unsupported.
@unit_tests_for_test_me;
begin
-- Conditionally compile this block under non-testing circumstances...
step1();
step2();
-- Conditionally compile this block under testing circumstances...
run_unit_tests();
end;

Saturday, September 27, 2008

HP Oracle Database Machine

Here's the HP Oracle Database Machine talked about in Larry's keynote. On the way out, some audience members said they were drooling over it. I don't know if that's a good idea; didn't see anything about moisture resistance in the tech specs.

I noticed that the box was about the same height as Larry. In the same way we talk about pizza boxes, 5U vs. 10U servers, etc., will we one day measure the form factor of big servers in Larrys?

Thursday, September 11, 2008

The Roads Must Roll

I have this theory. It's a variation on the Sapir-Whorf hypothesis, which says that the language someone speaks influences their thought patterns. That is, if a language has lots of words for XYZ, the people who speak it might be preoccupied with XYZ. Science fiction fans would naturally glom onto the idea that Klingon has lots of words for war and fighting; but that would be the Sapir-Worf hypothesis.

I was reminded of this theory the other day, reading this post comparing the usability of Apple's campus in Cupertino with Microsoft's campus in Bellevue. It seems logical that the founders who planned out a company headquarters would leave some mark of their own personality. And that environment would reinforce the same company culture on future generations of workers, forming a closed loop.

My theory, and I hesitate to name it after myself because maybe I'll have some better theory later, is that software (particularly networking) companies are defined by the roads around their headquarters.

Now I'm from the sparsely populated east coast of Canada. In my home town, there's not much excitement on the roads except once or twice a year, when a regatta or air show backs things up so much that it takes an hour to get out of the parking lots. There's only one major east-west highway across the province, and many communities are only reachable by boat. So it's not surprising that network-oriented companies from Newfoundland concentrate mostly on marine navigation, sonar, radar, etc.

I spent some time in Toronto. Very grid-like structure for the surface streets. Most highways are also essentially east-west or north-south. In fact, take one wrong turn or miss one exit and you'll never make it to your destination. There's also a deeply nested system of lanes on the major '401' highway, with "collector" lanes on the outside and "express" lanes in the middle. But that's a trap! The collector lanes are the ones that flow freely, the express lanes are jammed up with big trucks. Don't get too deeply nested or it'll take forever to get somewhere.

What software technology do we associate with Toronto and area? SGML and XML, via Tim Bray, Yuri Rubinsky (RIP) and SoftQuad. Lots of nesting; error detection but not a lot of error correction.

I've only visited Boston a few times, but I remember driving through the Big Dig tunnel when an ambulance sped by, siren blaring. And a dozen cars were tailgating the ambulance, passing all the drivers who kindly got out of the way. Who's in Boston? Right, Akamai, with big pipes transferring traffic as fast as they can.

Now the Bay Area has some geographical quirks that are reflected in the roadways. You can get on the wrong highway entirely, and still get to your destination. Try to figure out the interconnections between 101 and 280 through San Francisco. Cross the Bay Bridge and you're on 80, 880, and 580 simultaneously. Take 580 east from Oakland -- it's the one that goes south -- and even if you meant to take 880, in 20 miles you'll see a sign saying "880 this way", and everything is OK again. Go south on the east side of the Bay or the west side, and you'll end up in San Jose either way. It's impossible to go through the 92/880N interchange without fantasizing some better way to organize it.

Which explains why Google is so hot for directed acyclic graphs. And why Google maps will doubtless keep getting more and more features revolving around traffic routing.

Wednesday, September 3, 2008

First Thoughts about Google's Chrome Browser

I haven't tried Google's new Chrome browser yet. (Mac version if you please!)

But that doesn't stop me from having opinions. :-) For me, the most interesting aspect is the architecture behind the tabs. This whole issue has been driving me crazy for some time now. All the tabbed browsers have one shortcoming or another. Let's see if Chrome can fix some things.

For example, have you noticed that the best way to quit Firefox is to run 'kill -9 ' from the command line? That's fast. Using the real Quit menu item takes forever. When I want to reboot my Mac, OS X asks Firefox nicely to do its Quit action; but it takes so long that the reboot process times out, every time. Killing the operating system process is instantaneous. It's all the same in the end, all the memory gets released. Yet I trust 'kill' to be more thorough about releasing memory, given all the leaks and fragmentation while Firefox is running. Also, Firefox is more reliable about offering to restore all the closed tabs upon restart, if it was killed rather than Quit. Usually I get 2 offers to restore the tabs, from Firefox and from SessionSaver. It's a complicated dance to know which one to accept. Skip the first one, maybe the second one won't happen this time because a Firefox upgrade has made certain extensions incompatible, or because you don't get the same offer after a Quit. So for me it's kill kill kill.

I'm typically restoring ~100 tabs whenever I restart Firefox (or Safari for that matter). What does that mean? No performance or responsiveness until they're pretty much all loaded. Is it too much to ask to devote more attention to the frontmost tab in each window, and particularly the active window that I'm reading? Page down, right click, hello...? Safari tries to do this a little, for example when it defers loading things like Flash until you actually switch to that tab. The worst is when a tab has some auto-play movie or ad with audio; you hear it but you can't find the right tab to stop it. Sometimes half a dozen tabs try to play audio at once, producing complete babble.

So, it makes sense to favor some tabs over others. But which ones? Like I say, usually the frontmost tab deserves the most CPU. But just because a tab is deselected doesn't mean it shouldn't run. I want this background tab to keep running because it's playing a teaser ad before a video clip, and I'm reading something else until the ad is finished. I want that tab to get little to no CPU, because it's just playing ads with animated GIFs and sending AJAX requests for more ads in an endless loop. I'm hoping that Chrome comes with the equivalent of "Task Manager for tabs", where I can freeze certain ones, maybe blacklist sites so their tabs always freeze when backgrounded, see which tabs have gone nuts with memory and CPU, maybe clamp an upper limit on memory use for some tabs.

The question is, will Chrome simplify or complicate my browsing equation, which currently goes like this:

- Playing a Flash-based game on Facebook? Use Safari because Flash in Firefox tends to stall for a few seconds every now and then. No good for 3-minute rounds of Scramble on Facebook.

- Reading Slashdot? Use Firefox, because Safari has a Javascript bug that causes crashes sometimes when closing or even just leaving a page on that site. It's unpredictable which page, but happens consistently for the same page. Because I occasionally slip up, I usually have half a dozen old Slashdot tabs that I have to leave open until I'm done with every other single tab and can let the browser crash without consequence. Since Chrome uses the same Webkit renderer as Safari, that's something for their team to take note.

- Using Windows? Stay away from Firefox once it starts to get slow; switch to Safari or Opera instead. When doing Ctrl-click to open links in new tabs, sometimes there's a wait of 30 seconds or more between clicking the link and when the click is recognized. In the meantime, you have to sit there like a fool with your finger on the Ctrl key; otherwise, when the click is finally processed, the browser will forget it was Ctrl-click and will replace the current window. I don't know if this sluggishness in registering meta keys is something inherent to Windows event processing, a problem in Firefox coding, or a combination (since I don't see the same effect on OS X).

- Doing research opening several related sites in adjacent tabs? Use Firefox because it has better tools for bookmarking a group of tabs at once. I hear there's a plugin for Safari that offers the same feature, but I'm keeping Safari plugin-free to avoid performance issues. Also, that particular plugin appears to be several years old.

- Using some business application like for net meetings, that only runs on IE? Well, guess I have to run IE. Using Remote Desktop connection to a Windows machine. Not exactly the most efficient use of network bandwidth to have animation and audio streamed to a remote computer and then streamed again to a local one, but hey.

- Firefox grinds to a halt because its virtual memory size exceeds the amount of RAM on my computer? Start up Safari. One browser can apparently be in its death throes, while the machine still has plenty of CPU and real memory to run a different one at full speed.

- Debugging some web code? Run Firefox for the Firebug debugger. Curse various sites that load up the error log with tons of Javascript warnings and errors from background tabs. Often, I can't tell which background tabs are causing the continuous stream of errors. Even if I can recognize which site is misbehaving, I can't necessarily find the relevant tab to shut it down. (That would be a nice addition to Firebug.)

- Reading ad-heavy sites? Use Firefox for its combination of ad-blocking and Flash-blocking plugins.

- Reading certain other sites, like the Dilbert comic strip? Use Safari. Sometimes all the ad- and Flash-blocking prevents the desired content from showing up. Even with whitelisting and the ability to selectively load Flash, I can't read the Flash-based Dilbert strips anymore in Firefox.

- Switching a lot between computers? Use Safari. Its bare-bones bookmarking features mean I don't even bother with bookmarks for it, preferring Del.icio.us instead. I will use Firefox to bookmark groups of tabs, but it's too much bother to try and synchronize them between machines. Also, I find that the Firefox auto-update system and compatibility checking for plugins means that at any one time, several machines that from my perspective should all be on the same level of Firefox, have different combinations of disabled plugins that I can never predict. One machine will check for plugin updates and re-enable something, while a different machine will report no compatible plugins found. (Might be a Mac vs. PC thing, but why would that make a difference for Firefox plugins?)

Sunday, August 24, 2008

I'm Not Dead Yet...

This piece on the always-dying-but-never-quite-dead mainframes led me down memory lane:

Interop Systems - The Open Enterprise - The mainframe isn’t dead after all

My earliest full-time job was at IBM, developing and publishing in a mainframe environment on VM/CMS. (TOROLAB4 to be precise.) Coming from a UNIX background at university, I knew there was something better. Then IBM tried moving off mainframes to OS/2. Then Windows ate OS/2's lunch. The last time I had a pleasant thought about Windows, it was Windows 95. Now I think the mainframe wasn't so bad after all.

Let's look at some of the things that were OK in the mainframe, things that have gone downhill since then and the industry is still trying to recapture.

The storage was amazingly fast. The job queue system could take a snapshot of your entire user account in a couple of seconds, after which you could continue working on your hundreds of interlinked source files without worrying that ongoing edits would mess up the build. You might make some fixes, submit another 1/2-hour job, rinse and repeat several times while the first job was being processed. When the output files started arriving in the inbox, you would just throw out all but the ones from the last build. Today, you might have a dozen terminal windows open at once, but that's little consolation if all of them are tied up packing or transferring files, and you have to hold off on changes until that operation is finished. Those little progress indicators in sftp/scp are hypnotic. (Don't forget that scp is faster than sftp.)

Storage was also somewhat abstracted from us as users. Need more? Sure, you just have to request it. We were never sure how much it really was, as it was all in terms of cylinders. A newbie would get 10 cylinders; for a big project, maybe you'd need 40. These days, I spend far to much time running "du -k" and "df -k" -- and that's just on my home machines! I remember in the early days of DB2 on workstations, users didn't like that the performance wizard asked for the RPMs of their hard drive, that was considered arcane information; these days, any tech-savvy 12-year-old knows how many RPMs in their desktop, laptop, and MP3 player.

In the early days of AIX, they carried over some of that abstraction. You could have tons of free space on a drive, yet all filesystems showed as full. You could add space from that free pool to any filesystem, it was just a deliberate decision that couldn't be undone, so the temptation was to only allocate as much as needed. Contrast that with the need to repartition on typical PC-scale OSes.

Having a small set of storage areas (equivalent to C:, D:, etc. drives) without any subdirectories was a pain, but enforced discipline when it came to naming conventions. Today, there's plenty of wasted time trying to track down duplicate DLLs in an execution path, or arrange Java class files in a 5-level-deep directory tree. In my experience, whenever people are given the chance to make something into a hierarchy, they'll always use 1 or 2 levels too many.

Editing on VM/CMS was pretty cool too. I was never an ISPF guy, that was for the MVS crowd; for me, XEDIT was the bomb. I was heavily into automating the hide/show settings for ranges of lines, which was much more productive than a standard tree view in today's editors. The integrated REXX language allowed infinite customization and automation for editing chores.

Strangely, when REXX came to the Amiga in the form of ARexx, it didn't interest me the same way. And Applescript, don't get me started. The major mainframe software packages that could be automated through REXX just had a more interesting or extensive set of actions.

One thing you could always do was customize the function keys. That's why they were called programmable function keys. "set pf3 = save" etc. Very convenient to have 24 custom actions available in any app. (I think you could set up both regular and shifted function keys.) The biggest problem was, too much freedom! With all those custom function keys, no standardization, not enough of an assembly line feel. Thus the rise of GUI standards in the form of Common User Access (CUA). So things everybody knows now for usability and accessibility, like "F1 = help", "F10 = activate menu bar", "tab between all controls", got their start on the mainframe.

That must have been the apex of function key usage. X Windows never seemed to do much with 'em. These days, outside of of course Photoshop, the only function key I use much is F8 (on Windows) to recall commands in a DOS window, or on OS X to activate features like Expose and Spaces. Since F8 is also the default to bring up the Spaces display, I can't even rely on that when remotely connecting to Windows from OS X!

The publishing software on VM/CMS, BookMaster, deserves its own whole blog post so I won't go into that here. Let's just say that SGML clearly suffers from the "second system" effect when compared with BookMaster, the same way XML could be said to suffer the second system effect when compared with SGML. (I.e. does a lot more, but is needlessly complicated and overstuffed with features.)

The one aspect of BookMaster that I've never seen replicated anywhere else, is its tag-based language for doing Gantt charts. Say what you like about the waterfall model versus other forms of software development. Any good projects run out of IBM in the '90s were because the project schedules were all managed via text files printed out as Gantt charts. I had my own REXX-based extension to that tag language too, with shortcuts for moving any combination of planned, actual, start, and end dates, or confirming when activities started on time, late, or (very occasionally) early. When IBM moved us all off the mainframe, and people started investigating other options like MS Project, that was the first time I thought someone had taken an activity that you used to be able to assign to a summer student, and turned it into something for which you'd need a PhD in a related field.

If I could recapture the best points of the mainframe, I'd use the Hercules emulator, the THE editor or one of its relatives, and write stuff in BookMaster and then downconvert to Docbook, DITA, or what have you. Unfortunately, although IBM released the B2H HTML converter, I've never encountered any free BookMaster workalike that (a) would do the nice hardcopy formatting, since this was pre-PDF days, or (b) included Gantt charting capability.

IBM has opened up a lot of former internal packages from the mainframe days. However, my 2 big contributions, LOOKIT and SCHEDIT, are still locked up because I don't have a current IBM employee who could be designated as the owner. So if you are at IBM and still have access to the old VMTOOLS and TXTTOOLS repositories, and wouldn't mind being listed as the contact for those packages, please let me know.

Saturday, August 16, 2008

How PL/SQL Could Be More Like Python

In previous posts, I suggested a couple of syntactic or thematic items from Perl and PHP that I thought would make PL/SQL friendlier and more accessible. Hey, maybe Python has some things too!

One thing to like about Python is the way its composite data structures build on top of each other. A list or a tuple can hold a bunch of numbers or strings in an array-like structure, but those things can also hold heterogeneous data items like an assortment of numbers, strings, other lists, other tuples, etc.

Another powerful idea is that a composite data structure can be used as the lookup key for a hash table / associative array / Python dictionary.

Finally, complex data structures with different numbers of elements can be passed as parameters or function return values, without the need to declare explicit types.

Underlying all this functionality, at least from a user's perspective, is that these complex data structures can be represented nicely as strings. Here's a little interactive Python output showing some list and tuple values. The >>> lines are what I entered, the other lines are how Python mangles the escape and quote characters for output.

>>> ['i do not care']
['i do not care']
>>> ['i don\'t care']
["i don't care"]
>>> ['i don\'t care about "that"']
['i don\'t care about "that"']
>>> ('i','don\'t','care')
('i', "don't", 'care')
>>> [ 1, 2, ("a", "b", ["x", 3] ) ]
[1, 2, ('a', 'b', ['x', 3])]

So we can see that there is a fairly well-defined way of making these data types into strings.

PL/SQL gained a lot of power when it got the ability to use string values as keys for INDEX-BY tables, which soon became known as associative arrays to be more familiar to people who used such things in other languages. It would be even more powerful if you could use VARRAYs, nested tables, etc. as lookup keys. You could solve problems that require a higher order of thinking than just "have I already encountered this integer" or "I will assign the symbolic name 'XYZ' to such-and-such a DATE value". You could say, "I know my web site is under attack when the counter corresponding to this combination of IP address, URL, and time interval is greater than 100".

PL/SQL can express something like this already, using nested collections. But it requires a lot of type definitions, which aren't readily reusable across different procedures without a lot of advance planning. There is a point where foresight crosses the threshold into rigidity. When dealing with in-memory data structures, I like the flexible approach of Python.

I don't think this kind of processing really requires much, if any, change in PL/SQL itself. All you would need are functions that could turn a VARRAY, nested table, etc. into a single string as in the examples above, so the string could be used as a key into an associative array. And the contents of the associative array element could be another string that represented a tuple or other data type that isn't directly supported by PL/SQL.

With a set of functions to go back and forth between strings and built-in types, you could emulate some features that are limited now by PL/SQL's type strictness. For example, it's difficult to write a function that accepts a VARRAY with 5 elements, and returns a modified VARRAY with 4, or 6, or 3 elements. So certain classes of recursive operations are not practical.

If the data structures were all passed around as strings, and unpacked into associative arrays behind the scenes, that would be powerful but slow. Perhaps the best compromise is to use a small number of uniform associative array types at the top level:

type python_dictionary_t is table of varchar2(32767) index by varchar2(32767);
type python_list_t is table of varchar2(32767) index by pls_integer;

Each string element could encode a single value, a list, a tuple, a dictionary, or whatever. But the basic datum being passed around could be one of these user-defined types.

Notice that a list in this scheme has numeric indices, yet isn't fixed size like a VARRAY. You could pass to a function a "list" of 4 elements with indices 0..3, and get back another "list" of 5, 1, 10, or whatever elements. There's nothing intrinsically forcing the index values to be sequential, that would have to be enforced behind the scenes.

Imagining how all this could be coded up inside the existing PL/SQL language, it occurs to me that the part that would be a real performance roadblock would be reassigning all the numeric index values when doing list operations like pop, push, etc. Element 0 becomes element 1, element 1 becomes element 2, element N becomes element N+1, or vice versa. So perhaps the biggest change needed to PL/SQL itself is a fast way of changing the key for an associative array element, without actually copying the element itself under a new key. Something like:

element.key(i) := j;

There could be a performance or memory hit when stringifying entire complex datatypes to use as keys for hash tables, if every complex value was passed through TO_CHAR() while evaluating it as a key. Perhaps the right approach is another built-in function like TO_KEY(), which could take something like a varray of gigantic VARCHAR2s, or an object type with member functions, and create a composite key with a bunch of MD5 or similar checksums, instead of the entire source text of the strings or member functions.

Friday, August 1, 2008

Ruminations on Metadata

Metadata is a great subject for web development and publishing, but when it comes to music, that's more a subject for my more personal blog. Thoughts here.

Monday, July 14, 2008

Feeling Researchy

For some reason, lately I'm feeling a mood I can only describe as "researchy". Related trivia that caught my eye lately... I ran across a research group at UC Berkeley and attended one of their public seminars; found that former co-worker Kevin Stoodley from the IBM Toronto Lab is an IBM fellow; and reminisced over some history on IBM's research group, with extra acronyms thrown in.

Saturday, July 12, 2008

Javascript Performance Considered as a Helix of Semi-Precious Hash Tables

I had a discussion about Javascript performance the other day, which reminded me that the more things change, the more they stay the same. Some of the things casual Javascript programmers are grappling with today, were big research issues for IBM's
TOBEY compiler technology a few years ago. (And maybe still, for all I know.)

Let's start with the basics. Whenever you're referencing objects in Javascript, you're getting pointers, not deep copies. So:

var x = a.b.c.d.e;
x.something = 0;
x.something_else = 'foo';

is actually modifying attributes of the original a.b.c.d.e object.

Intuitively, you can see that you're saving bytes in the program, which means less time to transmit the script file, less time to parse, and saving space in the browser cache which could avoid re-fetching some page, image, or other resource later. But there are performance implications at a deeper level too.

Javascript's object orientation is like Perl's, based all on hash tables, except it doesn't hit you over the head with that fact. So x.y.z involves looking up the key 'y' in x's hash table, which returns another hash table, in which you look up the key 'z'.

Array notation and object notation are effectively interchangeable. x.foo works the same as x['foo']. x[0], x[1], x[2], etc. are not necessarily contiguous in memory, they're just entries in a hash table whose keys happen to be integers. Looping through an array involves a hash table lookup each time, not just incrementing a pointer.

The scoping and closure features in Javascript, plus its interpreted nature, mean it has to do more lookups to figure out how to resolve each variable reference. For example:

function foo()
{
var a, x;

function bar()
{
var a, y;

for (...)
{
var a, z;

a = 1;
// a is found in current scope, requiring just one hash table lookup.
x = 'hello';
// x is not found in current scope;
// x is not found in parent scope;
// x _is_ found in grandparent scope.
// Meaning 3 hash table lookups to figure out where x is.

When all these slight variations in efficiency get put inside loops or frequently called functions, that's when the performance can start to drag. That's why you see people doing things like:

for (var i = 0, limit = some_object.length; i < limit; i++) ...

If i is declared outside the loop, each reference to it to test or increment means having to find it in some outer scope. If some_object.length is referenced in the loop bounds test, each time through the loop has to locate some_object in the right scope, plus do another hash table lookup to retrieve the length property. The length property could be modified by code inside the loop, so even though you consider some_object.length to be a constant value, Javascript can't assume the same.

Getting back to the notion of our first example, when you store a pointer to some deeply nested object and refer directly to that object, multiple times:

var it = x.y.z.style;
it.marginTop = "0em";
it.marginBottom = "1em";
it.marginLeft = "1em";
it.marginRight = "2em";

you're saving all those lookups, both for scope and traversing the hash tables of object members, for each reference.

At the DOM level, you've got a property element.childNodes and a method element.hasChildNodes(). Seems intuitive that checking for the existence of child nodes should be faster than returning the actual nodes, right?

Remember, childNodes is just giving you a pointer to a data structure that is being maintained continuously, not making a copy of anything, or assembling the structure on-demand. So the overhead in each case is constant. That data structure could be empty, which is why the has*() method could be useful. There's a distinction sometimes between "this object exists" and "this object contains something useful", so you wind up doing things like:

els = document.getElementsByTagName('a');
if (els && els.length > 0) ...
// Must test that the assignment succeeded,
// _and_ that what came back wasn't an empty structure

Saturday, June 21, 2008

For Want of a %s...

Today's post involves a three-way grudge match among usability, security, and inertia.

One of the basic tenets of usability is to answer user questions clearly and concisely. One concrete technique is to use placeholder strings liberally in messages (typically coded in C-like languages as "%s" for strings or "%d" for numbers).

This notion pops into my head whenever I'm doing anything significant in SQL*Plus. Here's a condensed excerpt from a typical log file of mine. The questions that run through my mind aren't that deep -- what table is being updated? What index, package, procedure, etc. is being created? It seems redundant for me to have to put that information into PROMPT commands.

34094 rows updated.
Commit complete.
Index created.
Table truncated.
Package created.
Package body created.
No errors.
Package created.
No errors.
Function created.
No errors.
Procedure created.
No errors.

I never pushed too hard though to get a change like this into SQL*Plus, because just imagine how many test cases and automated builds expect to see some exact sequence of output that would suddenly be different. I expect it would require some sort of SET option within SQL*Plus, which in turn would involve new documentation, a long adoption cycle, scripts that used the new SET option would throw errors for older versions of SQL*Plus... all of a sudden it's not so clean.

Another possibility for this usability technique is the set of database error messages. How many messages say that some limit has been exceeded, or there's a problem with some object, where the erroneous thing that was encountered, or the correct thing that was expected, is never specified?

Again, test cases are difficult to change. Extensive error messages are also tricky for translation, because maybe the placeholder needs to be moved around because of different sentence structures in different languages. Maybe the value of the placeholder causes a ripple effect elsewhere in the message, as with the "le / la" distinction in French depending on whether a noun is considered masculine or feminine.

The lesson here is that these little touches are important to bake in from the beginning.

Security is another consideration that wasn't so obvious when the usability bibles were being written. If someone can induce an application to print a < character where one isn't expected, and the output is written to a web page, presto an attacker can construct a string to be echoed in an error message with a <script>, <iframe>, or other tag that can change the contents of the page as part of a cross-site scripting attack. So there's an extra level of scrutiny for every message string, to ensure that it doesn't have disallowed characters, that the substitution text isn't too long, and so on.

Saturday, June 7, 2008

The Goals of a Computer Science Education

I'm far enough along in my UC Berkeley Comp Sci refresher podcasts -- 61A (Structure and Interpretation of Computer Programs), 61B (Data Structures), and most of 162 (Operating Systems) -- that it feels like time to make some blog posts about computer science in general.

After all, one of the perennial topics for argum^H^H^H^H^H discussion is whether it's better to hire a Comp Sci major, a naturally talented coder, or someone with a lot of vocational training and experience. What anyway are the goals of a Comp Sci degree?

IBM has a multidimensional measure of software quality and/or customer satisfaction called CUPRIMD [that link is a PDF file]. Standing for Capability, Usability, Performance, Reliability, Installability, Maintainability, and Documentation. Never one to leave well enough alone, IBM later extended this acronym to CUPRIMDSO by adding measurements for Support and Overall. I contend that the goal of computer science is really all about the CPR in that acronym: capability, performance, and reliability.

For example, CS 61A focuses on capability (how do you write a program to accomplish X?) and reliability (what practices are needed to ensure the answer is right and cover all the possible inputs?). CS 61B focuses on capability (how do you sort, find the highest/lowest value, or transform a data structure?) and performance (is it faster to sort this way or that, what are the best and worst cases to do such-and-such a transformation on this type of data structure?). And CS 162 focuses on reliability (how do we keep a process from writing over another's memory, how do we avoid deadlock?) and performance (how do we keep address translation, resource tracking, et al from grinding the system to a halt?).

I know that all the while, students are exhorted about documentation ("write good comments", "learn Javadoc notation"), and later on, depending on the institution, there might be optional branches off into specialties for technical writing and usability. But those are a bit squishy for the core Comp Sci goals. The best technical writers put their deep knowledge aside to write things that are mostly correct and comprehensive, in words that an eight-grader can understand. The questions "is Ruby on Rails usable?" and "is this ATM touchscreen usable?" lead off in very different directions.

The goal of teaching someone who can make the most reliable, fastest, and fullest-featured program or system runs into some conundrums in the real world.

Today, if you want to enable your company to do something it couldn't do before, you might do a "View Source" on a web page to see how to do a certain effect in CSS, or adopt a new language or code library that solves your toughest problem in one line of code. So capability is not really such a distinguishing factor between people with higher education, natural skill, or on-the-job training.

That leaves us with performance and reliability. That's where the real paradoxes come in.

You might call in someone with a CS degree, who turns an O(N**3) solution into O(N  log N) and so saves untold hours of waiting. Or they might reduce memory requirements so that the underpowered machine devoted to an application suddenly runs like a champ. Or they might network together several servers to turn formerly idle time into productive work.

On the other hand, if you have something that works most of the time but occasionally gives a wrong answer or hangs, the Comp Sci solution may add quite a lot of complexity and overhead to make things 100% reliable. UCB's operating systems course really puts that into perspective, taking me down memory lane from the Commodore 64 days of gleefully smashing the stack, to Motorola 68000 and the Amiga 1000 where there was no paging or swapping and we dreamed of the 68020 with MMU, to the world of today with robust multitasking and memory management all accelerated by hardware. Not a day goes by that I don't say to myself, things were a lot faster in the '80s. :-)

I see echoes of these CS goals in the "tough tech interview"-style word problems that people post from time to time. (C'mon, admit it, you've looked at them.) Hey, this one is really recursive unwinding, that one's a merge sort, I'll bet the clock algorithm would work here, and so on. For the most part, all drawn from those early fundamental CS courses. As an employer, you want your recent grads to really have internalized those lessons, rather than deciding that this theory stuff isn't important in the real world. And you want your industry hires to not have completely forgotten the basics after years of dealing with COTS ("commercial off-the-shelf") software and office politics.

Tuesday, June 3, 2008

How PL/SQL Could Be More Like PHP

This is really a trick subject line -- PL/SQL already has the capability to be very much like PHP. PL/SQL Server Pages lets you write template HTML files including PL/SQL parameters, expressions, and blocks, and turn them into stored procedures that produce HTML output using HTP.P() procedure calls. The angle-bracket syntax around the expressions and blocks is very similar to that of PHP. And a PL/SQL coder has a headstart when it comes to optimizing DB performance and avoiding security problems.

PL/SQL Server Pages was the first web-related feature that I documented at Oracle. When I started using it myself, eating my own dogfood so to speak, that's when the potential functionality enhancements sprang to mind.

PHP is so popular because it's all based around text files, you can modify your source with any kind of HTML tool, and you don't have to do anything special to get the code to run, just put the files in the right places.

PL/SQL Server pages are also based around text files, you can also run them through HTML Tidy, and edit them in DreamWeaver or what have you. However, to use them requires the loadpsp command. With this command, you specify the user ID and password on the command line. That's not the ideal mode of loading code if you normally sit at the SQL*Plus command prompt, already logged in, and load new/changed stored procedures via '@script_name'.

With PHP, the way to load code into different parts of the application structure is to put the files in different subdirectories. The equivalent for PL/SQL Server Pages would be to use a template file to generate a packaged subprogram, an internal procedure, or a method for an object type. But, to my knowledge, loadpsp can only create a top-level procedure, which best practice says you should keep to a minimum.

Now, all of this could be worked around if loadpsp could generate the source code for the procedure and stop there, leaving the file behind instead of loading it into the database. You could set up a makefile to transform each .psp file into a .sql file. You could load the .sql files into the database from inside SQL*Plus without doing a separate login each time. You could set up a script using a SQL*Plus HOST command to go through both steps. Perhaps that would incite the SQL*Plus dev team to create a new directive similar to '@' that would run loadpsp and then '@' the resulting output file. If the output file could start with 'PROCEDURE ... IS', leaving off the initial 'CREATE OR REPLACE', you could use my favorite trick of:

package body foo is
@procedure1;
@procedure2;
...

to load PSP-generated pages as packaged procedures or object methods.

I never quite bought into the idea that with PHP, coders and designers would happily edit the same files, passing them back and forth and limiting their edits to either the HTML tags or the code blocks depending on their role. In my experience, every expert tends to have ingrained assumptions that don't work well with top-down planning. The designer tasked with making a more attractive-looking page will go off and write HTML that's structured very differently from the existing page, that PL/SQL blocks and widgets can't be neatly dropped into. The PL/SQL coder tasked with adding some dynamic elements to the page will chafe at working inside markup produced by some HTML authoring tool they disdain; sometimes, it's too painful even to read past the DOCTYPE tag. :-)

Thursday, May 29, 2008

How PL/SQL Could Be More Like Perl

An idea has been rattling around in my head for a while now. I know, I know, there's a lot of empty space so it makes a lot of noise. :-) Could PL/SQL become more popular and easier to use by taking a couple of points from the Perl playbook?

I read a blog post about how scripting languages are or aren't gaining on Java. I think supporting curly-brace syntax would be a nifty way to encourage PL/SQL adoption among both scripters and Java coders. But maybe that's just me.

My first impressions of Ada and then PL/SQL were somewhat negative: hey, this verbose BEGIN/END, IF...THEN...END IF syntax reminds me of Pascal. I first programmed in Pascal on the Commodore 64, and it was deadly slow to compile and run, owing to the whole UCSD P-Code business. Then at university, it was the first language for coursework, and even though it ran really fast on UNIX boxes, I always felt that behind the scenes, more deadly slowness was occurring.

These days, 95% of the languages I code in use curly braces for blocks:

if ... { ... }
function ... { ... }
for ... { ... }

Imagine if PL/SQL did the same. Suddenly, brace matching in vi/vim would work. (Hit % while cursor is on a brace character to find the matching brace at the other end of the block.) It would be unlikely to introduce logic errors; in my experience, mis-nesting of blocks is not a frequent source of errors in Perl or Java, and in deeply nested PL/SQL, even today a careless coder can just add END IF statements until the syntax error goes away. I sometimes write little comments after the END IF, END LOOP, etc. to remind myself of which block each one closes.

On the documentation end, the PL/SQL User's Guide and Reference, aka PL/SQL Language Reference in 11g, has never devoted reference topics specifically to "BEGIN Statement", "END Statement", "END IF Statement", etc. Instead there are generic topics like "Block Declaration", or a section on "LOOP Statements" that doesn't split out the "END..." part as a separate item. That's always made it hard for me to write auto-linking pretty printers for PL/SQL (i.e. no "BEGIN" topic to link from the "BEGIN" keyword), but it would make syntax enhancements slot right into the docs.

From a parsing standpoint, I think it's pretty straightforward. If you leave out a THEN or put an END IF where an END LOOP is expected, PL/SQL gives you a straightforward error message. It knows what block is open and what the expected closing symbol is. So I imagine that it's practical to parse syntax like:

procedure foo is
x number;
{
if 1 > 0
{
for bar in (select * from t1)
{
x := 3.1;
declare
x varchar2(64);
{
x := 'hello';
}
}
}
}


{ takes the place of THEN, LOOP, and BEGIN. If a { is seen where one of those tokens is expected, treat it like the equivalent block beginning keyword. When a } is seen, treat it like whatever block closing keyword is valid at that spot. When a { is seen with nothing special preceding it, treat it like a nested BEGIN.

When I think about it more deeply, I see potential problems around ELSE and WHEN clauses, where those keywords act as both the end of one block and the beginning of another, but anyway this whole idea is probably just a fantasy on my part.

One other Perlism that really struck home lately. Perl lets you get away with a comma after the last item within an array-style declaration. This seems like a minor point, but when writing programs that generate other programs (or anyway JSON-style data structures), it is sometimes difficult to give special treatment to the first or last data item. For example, with SQL or PL/SQL it's simple to write out 1000 INSERT statements, each ending with a semicolon. But let's say you have an XSLT transform with a FOR-EACH loop that produces an initialization statement for a data structure. The output goes

'item1',
'item2',
'item3',
...
'itemN',

and then you're left with that final trailing comma that you need to get rid of somehow. I'll often put the comma at the beginning of each line for just that reason, and to allow easier reordering of parameters and such. But then the first item is a special case. Think of how many times you have to introduce a Boolean variable to make the output slightly different for the first or last pass through a loop.

I used to think Perl's optional no-op trailing comma was just a flippant design choice, but now I think it was a very shrewd way to make dynamic programming easier. One that some other languages could emulate.

Tuesday, May 27, 2008

Doubleplusgood vi Shortcut for PL/SQL Coders

One thing that bugs me (no pun intended) about PL/SQL syntax is the lack of the ++ and similar operators. If you use long variable names, you're penalized by having to type them twice in a simple increment operation:

rows_processed := rows_processed + 1;

I use vim, a cross-platform vi clone, for editing PL/SQL code. vim has a lot of extensions that the old-school vi user might never realize. One such is Ctrl-P when in insert mode. It looks backward ("previous") for the last matching word, and fills it in. For example, you could write the line above by entering:

rows_processed := <Ctrl-P>

vim fills in the variable name again, then you continue entering "+ 1;".

If the variable name you want is not the very last word, you can enter one or more letters first and vim will fill in the last word that starts with that prefix. For example:

toggle_switch := initial_value - t<Ctrl-P>;

produces:

toggle_switch := initial_value - toggle_switch;

The matching isn't limited to the current line, so you could fill in variable, procedure, etc. names from much earlier in the file.

Vim has a whole set of such extensions. Some of them like 'ab' are familiar from the original vi days, but you can boost your productivity by brushing up on the newer techniques.

Wednesday, May 14, 2008

Query vs. CREATE TABLE AS SELECT

Half the time, I want to know some answer that can be determined with a query. The other half of the time, I want to start from that answer and use it as an intermediate step towards some larger solution. (That's not even considering the third half... :-)

So, I have a bunch of .sql files that do SELECT statements. Sometimes, I end up editing those .sql files into new ones that do CREATE TABLE AS SELECT... using the same query. I don't like the proliferation of .sql files that are slight variations of each other. That's the aspect of SQL coding that I would most like to see be object-oriented, the many SQL scripts that differ from each other only in terms of connection string, table name, or other minor detail.

Now, I know the WITH clause is there for pulling out a query to reuse its results as part of a bigger query. But sometimes the natural-feeling technique is to do CREATE TABLE AS SELECT colname... and then several other queries with clauses WHERE colname IN SELECT * FROM newly_created_table.

I had a sense there must be some shortcut to turning the original query into a CTAS construct. And there is! Remember how SQL*Plus will happily substitute the contents of a script anywhere you start a line with an '@' sign? Well, turns out you can use this technique to substitute a script as the last half of a SQL statement:

- Create a .sql file that has a single SELECT statement.
- Issue a CTAS statement like so:

create table t1 as
@script_with_select_statement

The trick is to make sure the SQL script has no blank lines, as they will cause SQL*Plus to stop processing the command prematurely. '--' comments are OK, so you can comment and format the script using -- lines where you would normally use blank lines.

CTAS is a bit of a blunt instrument, especially if the same situation comes up again and again. I may adapt this technique to use with CREATE GLOBAL TEMPORARY TABLE t1... and INSERT INTO t1 SELECT... constructs.