Saturday, April 12, 2008

Deleting Data, Part 2

In a previous post, I listed a few different techniques for deleting Oracle data. Just to recap:

DELETE statement in SQL (or PL/SQL).
FORALL loop construct wrapped around a DELETE statement in PL/SQL.
TRUNCATE TABLE statement.
Temporary tables.

Different themes emerge. Some ways are more flexible (DELETE statement), some are faster for large amounts of data (FORALL, TRUNCATE), and some are more convenient (temporary tables).

As I thought about it some more, I realized there must be 50 ways to... delete your data. Here are some of the others. I'm sure there are plenty more that I haven't had direct experience with.

ALTER TABLE DROP PARTITION falls in between a normal DELETE and a TRUNCATE. You've specified some criteria to divide up the values in a large table, for example by year or by area code, and with one statement you can get rid of all the data in one of those buckets. Perhaps every Jan. 1, you remove the data for the 5-years-ago year. Dropping a partition simplifies the behind-the-scenes bookkeeping for the physical storage and the indexes, as opposed to issuing a DELETE in a regular table.

If you add a foreign key constraint to a column, you can include the ON DELETE CASCADE clause. The foreign key references another table, meaning you can't insert data unless the value in that column matches one of the key values in a different table. ON DELETE CASCADE means that when data is deleted from the other table, the matching key values are deleted from your table also, saving you from having to issue multiple sets of DELETE statements for interrelated tables.

You can add a foreign key constraint via the CREATE TABLE statement or the ALTER TABLE statement. Here's where it becomes difficult to dream up different "tasks" for different ways to achieve the same end. Putting a foreign key constraint in a CREATE TABLE statement suggests you are planning ahead and have constructed a detailed database design. Putting a foreign key constraint in an ALTER TABLE statement could mean that you are troubleshooting a problem once the database is in operation; perhaps it was overoptimistic to assume that every programmer would remember to DELETE from all the right tables. Or it could just mean that you were writing a setup script, found it easier to guess the right constraint syntax for ALTER TABLE than for CREATE TABLE, and so wrote a simple vanilla CREATE TABLE statement immediately followed by one or more ALTER TABLEs.

Personally, I use foreign key constraints a little less than I might like. In a data warehousing situation, such as with a search engine, I might be fiddling with or reloading data to get everything just right, and want to avoid unexpected side-effects in other tables. For example, if you reload the set of US state abbreviations into a 50-row table, you wouldn't want all the customer data for California to disappear because the 'CA' row was removed for a moment.

Another space-saving technique that applies to PL/SQL only is the SERIALLY_REUSABLE pragma. It lets the database free up memory space used by package variables, more frequently than happens by default. Then you don't need to feel so guilty about having your package code load the entire contents of a table into a PL/SQL collection data structure. But the memory gets freed up so often that the pragma seems only useful for specialized situations with long-running sessions.

Let's broaden the objective a little bit. Some of these techniques make data disappear without requiring action on the programmer's part, keeping bad old data out of query results and freeing up space on disk. Taking off the DBA hat and putting on the developer one, perhaps we wouldn't care quite as much about the disk space. What about techniques to make data disappear from queries, even if it doesn't really disappear from the database?

The CREATE VIEW statement can create a rolling window based on dates or some other value in the data stream. For example, you might use a condition WHERE date_col >= SYSDATE - 7 to only examine the last 7 days worth of data. Because Oracle dates include a time component, that would mean everything from 1:25 PM last Saturday up to 1:25 PM this Saturday, using the example of my own clock as I type this. If you wanted only full-day periods, you would use a condition such as WHERE date_col BETWEEN TRUNC(SYSDATE) - 7 AND TRUNC(SYSDATE). That would include everything from the start of last Saturday, up to the start of today, and would give you consistent results if you ran the same query again during the same day.

Other views might restrict queries to the last 1000 web page views or financial transactions. I think of such views broadly under the "deleting data" umbrella, because just based on the passage of time or activity within an application, older data stops appearing in query results without any inconvenience to the programmer.

Another technique along the same lines is the CREATE MATERIALIZED VIEW statement. This has 2 distinct uses: either you have a table normally accessed through a database link that is so slow you just want to have a local copy that you can query, or you have such an expensive/complicated query against a big table that you want to keep the query results in their own table so you can retrieve or subset them much faster. The "deleting" part comes because the materialized view can be automatically refreshed, either on a schedule or whenever the data in the underlying table changes. So as with other views, you can sit tight and old data will stop showing up in your queries.

I've started making some headway with materialized views, but am not yet a master of the refreshing techniques. That seems to require perusing both the SQL Reference and the Data Warehousing Guide. All those Data Warehousing examples with quarterly sales figures etc. don't resonate much with me when I'm learning about MVs, analytic functions, or partitioned tables.

I think those are all the related techniques that I use or consider using for such purposes. Of course, you can always manufacture other situations or add layers on top of these techniques. You could set up a scheduled job that deleted old data, you could have a trigger that acted like a foreign key constraint by deleting from a related table -- the possibilities are endless!

No comments: