Wednesday, December 31, 2008

Oracle PeopleSoft hosted docs

Just in time for the new year! The Oracle PeopleSoft group now have their docs on Oracle.com in HTML format, hooked up to a Tahiti search:

Oracle PeopleSoft Enterprise Hosted PeopleBooks

Sunday, December 21, 2008

Coding Horror: Hardware is Cheap, Programmers are Expensive


You've probably heard both sides of this argument: throw hardware at performance problems, no no, improve the code. Jeff Atwood at Coding Horror comes down on the "more hardware" side in this post:

Coding Horror: Hardware is Cheap, Programmers are Expensive

Usually I agree with Jeff, but I don't agree with his thesis here. The commenters on this page present some good convincing counterarguments and examples that tuning or modifying algorithms can be very cost-effective.

But sometimes I do like to throw hardware at a performance problem. Wait, haven't I just contradicted myself?

It's all in the context and the human elements. Let's consider the possibilities:

  1. You're in charge, you make the decision. Obviously, you'll consider all the pros and cons and make the right choice. Sometimes the right choice for return on investment is to beef up the hardware, sometimes it's to tune the code, change the algorithm.

  2. Someone else is making the decision, and they're just like you, or they always listen to your good advice. Again, sometimes the choice will go one way, sometimes another, but probably the right path will present itself.

  3. Someone else is making the decision, and they always prefer one approach or the other. Sometimes they'll make the right choice, sometimes the wrong one. Plenty of examples of both in the article comments.

  4. Someone else is making the decision, and they are open to either approach, but they don't have enough information to make an informed decision. For example, they may not know the price/performance of available hardware, or may not know the code base or the team's coding skills well enough to estimate the potential for improvement vs. the time involved.


If all situations were equally likely, we could say don't worry, a majority of the time it'll turn out OK. In the real world, scenarios 3 and 4 occur more often than you might like. You'll need to have strong arguments for "more hardware" and "more tuning" at the ready at all times, not just have one dictum for all situations.

Thursday, December 18, 2008

The Humble COUNT( ) Function



Here's another ode to a small but fundamental aspect of Oracle, following the same theme as The Humble IF Statement. This time, let's look at the COUNT( ) function. I think when you look at it the right way, it opens up the whole story about database performance.

What's the first thing you do when poking around an unfamiliar system? I'll bet it involves SELECT COUNT(*) queries in one way or another. SELECT COUNT(*) FROM USER_OBJECTS to see how much "stuff" is in the schema. SELECT COUNT(*) FROM data_table to see how much data is around. SELECT COUNT(*) FROM audit_table to see how much usage the application gets. You can run SELECT COUNT(*) on various other data dictionary views, and with some WHERE clauses on regular tables, to answer questions like "how many...?", "are there any...?", and "confirm there are no...".

You can also do SELECT COUNT(column_name) FROM table_name. That brings back a count of the number of rows that have non-null values in that column. What's that for?

  1. It demonstrates that you at least know something about the table, the name of a column. That's a small stake in the ground when you're establishing your credibility with experienced DBAs and developers.

  2. It suggests that you'll probably want to use that column in a WHERE clause test, later in a bigger query. If you use that column in an =, >=, BETWEEN, LIKE, etc. test, that comparison can only match non-null values. COUNT(column_name) represents an upper limit on the rows that could come back from such a query.

  3. It demonstrates that you're thinking on a higher plane. Not "how many physical rows total in this table", rather "how many significant rows with data in them". If you're a DBA trying to reclaim space, COUNT(*) might make more sense. If you're a developer tuning query performance, COUNT(column_name) is probably the right one to use.

  4. The COUNT( ) query has a better chance of using an index if you put in a column name rather than *. COUNT(*) can take a long long time on tables with millions of rows.


Let's think about that last point. A typical index doesn't represent null values. The default index is a tree data structure (a B-tree) that would be right at home in a Java programming course. You could have a table with 50 million rows, but if only 1 million rows had something filled in for column X, you could build an index on that column and the index would take a relatively small amount of space -- disk space to store the index, and memory to manipulate pointers to the real data. Any query that tested that column could immediately ignore the other 49 million rows as irrelevant -- and that's why the index doesn't try to record which rows have nulls.

That means if we had a kind of index that indexed nulls, or one that guaranteed there weren't any nulls in the table, COUNT(*) could run really fast. What do you know, we do have indexes like that. Bitmap indexes represent a string of zeros and ones for every different value in a column, including null, to say very quickly whether a row contains a particular value. So when you run COUNT(*) on a table with a bitmap index, it could just count the number of zeros and ones for some value from that column. And a NOT NULL constraint on a column, which also comes for free with a primary key, means COUNT(*) could just count the values in the index, knowing that all the rows have some value for that column, so the number of values in the index matches the number of rows in the table.

That leads us to a thought experiment that's also practical. In SQL*Plus, you can issue SET AUTOTRACE ONto show the explain plan for every SQL query you run. If you want to get a preview of query performance, you can turn on autotrace, wander around your database doing COUNT( ) queries to see how many rows would be returned by your real queries, without the overhead of actually bringing back all the data.

If you do some tests this way, you'll quickly see the results in the explain plans. Index range scans, fast full index scans, and other things with the word "index" in them are generally cause for happiness. Full table scans are generally cause to look closer at the query; for small tables they're not so bad, for big tables they're often a warning signal.

Comparing COUNT(*) and COUNT(column_name) this way, you'll see illustrations of what I've said above. COUNT(indexed_column) produces a happy explain plan. If the table has any bitmap indexes, a primary key, or a NOT NULL constraint, COUNT(*) will take advantage of that fact and also use an efficient explain plan.

That said, sometimes you have a table with, say, 35 millions rows and a primary key, and COUNT(*) still takes a long time to come back. What's up? Well, an index with 35 million entries will take a lot of I/O to read and scan through. That's the basis of all the histograms, cardinality, and other arithmetic that revolves around database performance. (Like "don't use bitmap indexes if the number of different values exceeds 20% of the total number of rows", or "don't use bitmap indexes for small sets of values like Male/Female, where the rows with different values are all mixed together".) Sometimes the theoretical speedup from reading the fancy data structures and dereferencing pointers to get to the disk data, is outweighed by the practical disk and memory considerations to go through the index. Maybe the explain plan for a COUNT( ) query will be different from the same query bringing back real data, for that reason. And that's why there are different levels of expertise when it comes to database performance and query tuning.

Now, I'm just a guy who took the SQL Tuning course, read a lot of Oracle Concepts and Performance docs long before having any practical use for them, more than once drummed his fingers waiting for COUNT(*) to come back with an answer, and in a previous life wrote about close-to-the-metal performance considerations for IBM's POWER and PowerPC chips. This post doubtless doesn't capture all the nuances of query performance. This is just the way I think about that simple little COUNT( ) function.

Tuesday, December 16, 2008

Reformatting in vim


Things never really change. Even with XML, sometimes long lines are a big hassle. For example, if you have a wide code example that runs off the side of the page, and your formatter doesn't offer a way to adjust the page margins for one element, or to auto-shrink the font size.

I was spoiled back in the day by the ability of editors like XEDIT to apply formats to ranges of lines, so I hadn't really explored those capabilities too much in my favourite cross-platform vi clone, vim. When faced with the need, I would usually record a macro (q<letter>) to rearrange the contents of one line and move to the next line, play back the macro (@<letter>), and keep doing @@ to apply the same transform to a range of lines.

vim has language constructs to do conditional logic, but sometimes the old ways are best. Needing to wrap a bunch of wide code examples, without disturbing the other long lines in the file, I adapted this tip from the vim wiki. The essence of the tip is a command like:

:g/.\{80,\}/ .!par w70

Two little tweaks make it work better for me. OS X doesn't have a par command, so I change the last part of the command to fold -w70. And the g// command doesn't have to be global for the file, it can work on a range of lines. My habit when editing interactively is to set a mark a for the top of a range (ma) and a mark b for the bottom of a range (mb). So I set those marks at the top and bottom of a code example, then:

:'a,'b g/.\{75,\}/ .!fold -w75

(Numbers adjusted for my particular situation.)

Once this command has proved itself, the next step is to bind it to a function key (another practice that has gotten harder as editors have evolved; thanks CUA):

map <F5> :'a,'b g/.\{75,\}/ .!fold -w75^M

(With the ^M a real carriage return, generated by the key sequence <Ctrl-V><Enter>.)

At the point where I identify all the contexts where this might be useful, e.g. inside Docbook <programlisting>...</programlisting> tags, I would add one more layer of abstraction to the macro, to search for the start tag, set the mark, search for the end tag, set the mark, then run the command to apply the transformation between the 2 marks.

vim has kind of a weird regular expression syntax for doing OR-style searches. Don't know if it's new since vi, or if I just never used that particular capability in vi. To search for let's say "<programlisting>" or "<screen>", you'd enter:

/<programlisting\|<screen>

In this case, the \ doesn't mean interpret the character | literally, it means interpret that character as a special search function. For more info, within vim enter:

help pattern
help magic

Tuesday, December 2, 2008

Don't Be Cruel


Ran across this old paper from Edsger Dijkstra, discussed on Slashdot:

On the Cruelty of Really Teaching Computer Science, as a PDF of scanned handwriting or the text transcript.

I'm not going to attack or defend the overall thesis, but anyway this paragraph I will agree with:

My next linguistical suggestion is more rigorous. It is to fight the "if-this-guy-wants-to-talk-to-that-guy" syndrome: never refer to parts of programs or pieces of equipment in an anthropomorphic terminology, nor allow your students to do so. This linguistical improvement is much harder to implement than you might think, and your department might consider the introduction of fines for violations, say a quarter for undergraduates, two quarters for graduate students, and five dollars for faculty members: by the end of the first semester of the new regime, you will have collected enough money for two scholarships.

The problem is the false assumptions and analogies that get introduced by these lines of thinking. If a network is "this guy talking to that guy", your thinking will be constrained by what you know about human conversation. If there's a problem, someone can talk louder, slower, etc. and the analogy holds. But if the solution involves something that has no equivalent in the day-to-day world, how are you going to conceptualize it?

My pet peeve, that descends from this same idea, is from the teaching of object orientation. A car is a type of vehicle; a truck is a type of vehicle; they both have wheels; maybe the number of wheels is different; maybe each has a different turning radius or procedure for backing up.

Great, now you understand inheritance, polymorphism, member functions, etc. However, in practice, you use object orientation to avoid zillions of "if" statements, special case code, large blocks of almost-but-not-quite duplicated code. These are concrete issues that you can see while writing programs, that have concrete effects like making things run faster or slower, making the code bulkier or more compact, more or less readable and maintainable.

In my experience, someone who learned OO by simple analogies is likely to be locked into thinking "I have to write every program by making a tree of classes like with the cars and trucks". Rather than "there are different ways to structure the code, and a good OO way will get rid of a lot of branching, magic numbers, and redundant code, without slowing things down too much or obfuscating what's really happening".

I remember seeing such analogies many times in the early days of Java. I don't know to what extent they made it into textbooks and university classroom settings, but I still see them here and there, for example this Q&A about how to implement an association relationship in Java.