Saturday, May 5, 2007

Perl vs. PL/SQL: Regular Expressions

One recent addition to Oracle's SQL dialect, and also PL/SQL, is Perl-like regular expressions.

It's very convenient to declare a CHECK constraint as a regular expression, using the REGEXP_LIKE operator. The data has to match a pattern, or it doesn't get into the database. The pattern could be all caps or all lowercase, some set of allowed characters, alphanumeric values with certain patterns, and so on.

It's also very convenient to set up a trigger that does a replace using regular expressions, using REGEXP_REPLACE instead of a combination of SUBSTR, TRIM, and REPLACE stretching out over dozens of lines. Back-references are particularly useful, say for recognizing a flexible pattern within a block of HTML and enclosing that pattern in tags for highlighting.

A couple of things are not quite ideal yet in the PL/SQL implementation. I have found that running REGEXP_REPLACE on certain moderately large strings can fail (and I've opened a bug for that). I wish there were more metacharacters supported; in the name of NLS, some \ sequences familiar from Perl are left out. And I am having an issue that I haven't fully diagnosed yet, doing backreferences in a pattern containing multiply nested parentheses.

Don't have code at hand to post samples; I'll try to come back to that later.

1 comment:

Byte64 said...

Hello John,
it would be interesting if you could try out some of these nasty patterns at Yocoya' Regular Expression Workbench for Oracle.
In my own experience there is an elusive ORA-21779 error that pops up from time to time and it seems related to some low memory condition but is very difficult to track down as it has an erratic frequency, which means that the same query sometimes works while sometimes it doesn't.
Another interesting aspect of Oracle's REGEXP is the so-called "greedy" implementation.
I'll write something specific in my blog later on, but basically it means that a pattern like "a.+a" will match the longest string between the two a's, not the shortest, as you can easily see yourself by running the following query:

select REGEXP_SUBSTR(
'ahahahahaaha',
'a.+a',
1,
1,
'c') test
from dual;

By the way, Oracle Developers must have some common taste for web sites if we all picked the same blogger template! ;-)

Cheers,
Flavio Casetta