Here's another ode to a small but fundamental aspect of Oracle, following the same theme as The Humble IF Statement. This time, let's look at the COUNT( ) function. I think when you look at it the right way, it opens up the whole story about database performance.
What's the first thing you do when poking around an unfamiliar system? I'll bet it involves SELECT COUNT(*) queries in one way or another. SELECT COUNT(*) FROM USER_OBJECTS to see how much "stuff" is in the schema. SELECT COUNT(*) FROM data_table to see how much data is around. SELECT COUNT(*) FROM audit_table to see how much usage the application gets. You can run SELECT COUNT(*) on various other data dictionary views, and with some WHERE clauses on regular tables, to answer questions like "how many...?", "are there any...?", and "confirm there are no...".
You can also do SELECT COUNT(column_name) FROM table_name. That brings back a count of the number of rows that have non-null values in that column. What's that for?
- It demonstrates that you at least know something about the table, the name of a column. That's a small stake in the ground when you're establishing your credibility with experienced DBAs and developers.
- It suggests that you'll probably want to use that column in a WHERE clause test, later in a bigger query. If you use that column in an =, >=, BETWEEN, LIKE, etc. test, that comparison can only match non-null values. COUNT(column_name) represents an upper limit on the rows that could come back from such a query.
- It demonstrates that you're thinking on a higher plane. Not "how many physical rows total in this table", rather "how many significant rows with data in them". If you're a DBA trying to reclaim space, COUNT(*) might make more sense. If you're a developer tuning query performance, COUNT(column_name) is probably the right one to use.
- The COUNT( ) query has a better chance of using an index if you put in a column name rather than *. COUNT(*) can take a long long time on tables with millions of rows.
Let's think about that last point. A typical index doesn't represent null values. The default index is a tree data structure (a B-tree) that would be right at home in a Java programming course. You could have a table with 50 million rows, but if only 1 million rows had something filled in for column X, you could build an index on that column and the index would take a relatively small amount of space -- disk space to store the index, and memory to manipulate pointers to the real data. Any query that tested that column could immediately ignore the other 49 million rows as irrelevant -- and that's why the index doesn't try to record which rows have nulls.
That means if we had a kind of index that indexed nulls, or one that guaranteed there weren't any nulls in the table, COUNT(*) could run really fast. What do you know, we do have indexes like that. Bitmap indexes represent a string of zeros and ones for every different value in a column, including null, to say very quickly whether a row contains a particular value. So when you run COUNT(*) on a table with a bitmap index, it could just count the number of zeros and ones for some value from that column. And a NOT NULL constraint on a column, which also comes for free with a primary key, means COUNT(*) could just count the values in the index, knowing that all the rows have some value for that column, so the number of values in the index matches the number of rows in the table.
That leads us to a thought experiment that's also practical. In SQL*Plus, you can issue SET AUTOTRACE ONto show the explain plan for every SQL query you run. If you want to get a preview of query performance, you can turn on autotrace, wander around your database doing COUNT( ) queries to see how many rows would be returned by your real queries, without the overhead of actually bringing back all the data.
If you do some tests this way, you'll quickly see the results in the explain plans. Index range scans, fast full index scans, and other things with the word "index" in them are generally cause for happiness. Full table scans are generally cause to look closer at the query; for small tables they're not so bad, for big tables they're often a warning signal.
Comparing COUNT(*) and COUNT(column_name) this way, you'll see illustrations of what I've said above. COUNT(indexed_column) produces a happy explain plan. If the table has any bitmap indexes, a primary key, or a NOT NULL constraint, COUNT(*) will take advantage of that fact and also use an efficient explain plan.
That said, sometimes you have a table with, say, 35 millions rows and a primary key, and COUNT(*) still takes a long time to come back. What's up? Well, an index with 35 million entries will take a lot of I/O to read and scan through. That's the basis of all the histograms, cardinality, and other arithmetic that revolves around database performance. (Like "don't use bitmap indexes if the number of different values exceeds 20% of the total number of rows", or "don't use bitmap indexes for small sets of values like Male/Female, where the rows with different values are all mixed together".) Sometimes the theoretical speedup from reading the fancy data structures and dereferencing pointers to get to the disk data, is outweighed by the practical disk and memory considerations to go through the index. Maybe the explain plan for a COUNT( ) query will be different from the same query bringing back real data, for that reason. And that's why there are different levels of expertise when it comes to database performance and query tuning.
Now, I'm just a guy who took the SQL Tuning course, read a lot of Oracle Concepts and Performance docs long before having any practical use for them, more than once drummed his fingers waiting for COUNT(*) to come back with an answer, and in a previous life wrote about close-to-the-metal performance considerations for IBM's POWER and PowerPC chips. This post doubtless doesn't capture all the nuances of query performance. This is just the way I think about that simple little COUNT( ) function.
No comments:
Post a Comment