Saturday, December 7, 2013

My history with Big Data

Before I joined Cloudera, I hadn't had much formal experience with Big Data. But I had crossed paths with one of its major use cases before, so I found it easy to pick up the mindset.

My previous big project involved a relational database hooked up to a web server. Naturally I wanted to be able to track visitor stats, detect denial-of-service attacks, and chart the most popular pages, search terms, referrers, etc. Since I had more control over the database side, I threw all the relevant fields from the HTTP header and query string into a database table, all divided nicely into columns. With SQL queries, problems were easy to detect, trends were easy to spot, questions were easy to answer. Life was good.

As the system grew, the logging table grew to a point where it hit a bottleneck due to defaults set up years before. A simple single-row INSERT statement into this logging table could stall for minutes at a time. There was a more scalable layout for the table (IIUC, this feature didn't exist when my application was first deployed), but the combination of constant traffic on the site plus relatively little spare space to reorganize things meant that the switchover had to be done in tiny increments, over a period of a couple of weeks.

I looked at the size of the table: a few tens of millions of rows, no more than 16GB total. I would shuffle more data than that around my hard drive after one day of shooting photos on vacation. Why couldn't I have a simple append-only data store without the relational and transactional overhead? But still with the nice SQL query interface, that was so convenient.

As time went by, the volume of log data came up again and again as a sticking point -- even though it was still miniscule by the standards of any computer hobbyist. The mechanism for looking at this info changed from querying a relational database to analyzing the raw Apache logs through Java code. Now there was an ETL process for getting each day's worth of data. It took long enough to transfer the files, then uncompress and analyze them, that now it was a race to finish before it was time to start on the next day's data. There wasn't space to keep around all the raw data files, so only summary information was kept permanently. Now, doing any new kind of analysis involved writing a custom Java program, and the analysis could only apply to a limited time period.

Given the tradeoffs, I preferred the original way of doing things, with a SQL interface to the data covering all time, even if meant more overhead on the recording side and periodic pleas for more storage space. I know Java, Java is on my resume, but I didn't think it made sense to write elaborate Java programs with classes and loops for what was a perfect use case for SQL 1-liners (or failing that, a scripting job for Python or Perl). What do you know, Hadoop is going through exactly the same maturation, with people concluding that Java-based MapReduce jobs aren't the most effective use of developer or computer time for queries that are in SQL's sweet spot.

This is why I am so keen on the dynamics of Hadoop and Big Data, especially as exemplified by Cloudera Impala. I know there is an army of people out there who know SQL, who have been held back all these years by having to plead for even modest amounts of storage space, and spent much of their time twiddling their thumbs waiting for some lengthy ETL process to finish. Give 'em some gigabytes, terabytes, or even petabytes of storage to work with; give 'em read access to the original data files without needless copying and conversion steps. And let the insights roll in.

Tuesday, May 14, 2013

Documentation for Cloudera Impala 1.0.1

The Cloudera docs have been undergoing some reorganization lately. That applies double to the Impala documentation, which moved from beta to GA status and has been restructured at the same time. The June 1.0.1 release brings some new reference information, so I've broken out the SQL language links too. For posterity (and a little Googlerank mojo), here's the current layout of the Impala docs in mid-June 2013:

Monday, May 6, 2013

New Job for John

Hello readers of my infrequent blog posts! I have started a new job, working on documentation for Cloudera, specifically for the Impala project, which is bringing fast interactive SQL to the Hadoop ecosystem. Read the Impala documentation. Download the Impala software. Get the QuickStart VM to play around with a pre-installed system. (You'll need to start Impala from within the Cloudera Manager interface after booting the VM.)

Monday, December 10, 2012

Error Handling: Or, How to Start an Argument Among Programmers

Lately I see a lot of discussions come up about error handling. For example: Dr. Dobbs "The Scourge of Error Handling" covering mainly C-descended languages; this blog post stemming from OTN discussions of Oracle exception handling in PL/SQL. I've been on both sides of the fence (cowboy coder and database programmer) so I'll offer my take. I'll write mostly from the perspective of PL/SQL stored procedures, because that's where I've coded the biggest variety of error handlers. The same opinions apply to my programming in Python, Javascript, Java, or C++, that's just a lower volume of code.

I have a lot of sympathy for the Dr. Dobbs article's position that we've neglected error codes too much. I also don't mind PL/SQL's WHEN OTHERS clause as much as some people do.

It's true that exception handling is often used as a kludge that winds up simulating GOTOs and producing spaghetti code. In fact, I don't think I've ever seen a substantial body of code that used "admirable" exception-handling practices. People know some but not all of the things that could throw exceptions, so their try/catch blocks are insufficient. If you really want to protect and debug every operation that could throw an error, you need way more exception handling than anyone does in practice. (In theory, you could get some sort of out of memory, memory corrupted, or disk full error at any random time.) If you wind up maintaining someone else's code, basically any new bug is going to highlight the need for more granular exception handling, or dealing with some never-before-seen condition, or better recovery code in the caller.

For example, I was the author of the Oracle PL/SQL User's Guide for a time, and wrote more than 100K lines of PL/SQL code. I had many first-time encounters with hard-to-debug exceptions in purely PL/SQL operations, for things like nonexistent collection items, index out of range, bad datatype conversions, and so on. Those were often in query-intensive pieces of code where there was no need for a rollback, just an error message for a user or a slightly less-than-optimal query result for a search engine result page. The structure of stored procedures dealing with collection types often required putting separate try/catch blocks around every reference to an index variable, or constructing some debugging metadata using extra variables designating how far the routine had gotten. That didn't feel right; once the language leads you down a path that seems clunky, coders are less likely to stick to best practices in all other areas.

I also dealt with built-in PL/SQL stored procedures where I didn't know in advance what exceptions could be raised, and so learned over time what were the most likely ones to come back based on my imperfect understanding of the internal workings of some package. But how could I ever predict in advance all the things that could go wrong when settings were changed, the database upgraded, or some hardware limit was reached that I couldn't anticipate in a test environment? The people writing such routines, who have greater expertise with the internal operations, can just RAISE any exceptions and put the burden of predicting and handling them on you, the ill-informed caller. That feels like the wrong way round.

For those reasons, I wouldn't demand that WHEN OTHERS always propagate an exception back to the caller. It could be protecting a purely optional section of code, for example one that logs error conditions. (Which is already likely to be called when things are messed up in unexpected ways.) Or that produces some optional UI adornment that can be skipped in the name of keeping things running in the face of moderate problems. Personally, I'm happy when PL/SQL stored procedures tackle some new problem domain where not every code path is mission critical or represents a DML operation inside a transaction.

As a developer, you also don't necessarily get to control how unhandled exceptions are displayed or logged all the way back at the web layer. For example, the PL/SQL web gateway by default shows a stack trace, which makes unhandled exceptions easy to debug in a development/test environment. But that might be turned into a generic 404-style error page on the production web server, meaning it's just as well to handle the exception entirely yourself (WHEN OTHERS with no RAISE) and put up your own error page or make subtle changes to your page of search engine results or whatever.
The best argument I've seen against WHEN OTHERS THEN <something besides raise> is that the system could be so hosed that another exception occurs doing something you thought was completely innocuous inside the handler, leading to an infinite loop of exceptions. I've never encountered that myself, so I can't say how that works in practice.

Getting away from PL/SQL for a second...

In Javascript, you're almost always dealing with non-critical code paths. Which makes it more practical to do less granular error handling. (If errors X, Y, or Z occur during this function, don't display the ad widget we were trying to retrieve.)

Python's nice robust standard library and introspection makes it very pleasant to write exception handlers for routines you didn't write yourself. There are plenty of things you can do with scripting languages where, if you run out of memory or out of disk space, you can live with a sudden halt and an error report rather than extensive recovery actions. So in Python (and Perl, et. al.) I don't mind having a top-level exception handler and not going overboard with try/catch blocks for every operation that could fail.

Wednesday, August 29, 2012

Separate docs for MySQL Connectors

The MySQL documentation section has always had this Topic Guides page containing links to the docs for the various MySQL Connectors -- the official database drivers for various languages and programming technologies. That is the most convenient way to get the information for each Connector in PDF form, rather than downloading the entire Ref Man PDF. For HTML, it was more of a shortcut, because most HTML links led back into the Ref Man, where the Connectors material is embedded all in one chapter. (For example, "22.3. MySQL Connector/J".)

Now we have started building each of the Connectors sections as its own HTML doc, linked from the Topic Guides page. That allows for faster turnaround for updates -- sometimes the Ref Man takes a few days to be rebuilt and posted. Also, it gives each Connector its own TOC, making the the topics more visible than when they're pushed down several levels in the Ref Man. (For example, "22.3.7. Connection Pooling with Connector/J" becomes "Chapter 7. Connection Pooling with Connector/J".)

The language-themed Connectors are:

Thursday, April 1, 2010

InnoDB Plugin Doc

The InnoDB Plugin manual is now available on the MySQL web site.

Wednesday, March 10, 2010

More on Single-Action UIs

In my post Deconstructing the iPod Shuffle UI, I talked a bit about the notion of a limited UI where you really only do one thing -- in that case, click the button on the headphones.

Every now and then, I rediscover the infrared remote that goes with my iMac, and realize that for many mundane tasks, the remote does everything I need. (Mira lets me assign different actions for the remote button for specific applications. Candelair works around a Snow Leopard issue with Mira.)

For example, with a new album in a photo library, you might want to Next-Next-Next through each photo, and if one is particularly bad, trash it; if one is particularly good, mark it with a high rank or a coloured label; otherwise, leave it as-is and proceed to the next one. The up/down/left/right/play buttons on the remote can accomodate those actions comfortably. All the photo software offers keyboard shortcuts for these actions, that can be assigned to the remote buttons. So this brain-intensive task can be accomplished while sitting back and taking in the view on a large screen.

One thing had been bothering me about the UI for the iPod Touch. If I'm listening to music with the screen turned off, say in the car or the office, and I'm interrupted (want to stop the music suddenly) or annoyed (want to immediately skip to the next song), normally this requires turning on the screen (Home button or power button), and unlocking the iPod by swiping before the music controls are available. I should have realized Apple would bring the "single-action UI" to the rescue. Instead of unlocking the whole unit, once it is powered on in a locked state, 2 more presses of the Home button bring up music controls that let you pause or skip without unlocking. It's almost iPod Shuffle-like. Now if only there were a way to assign star ratings without unlocking.

Many other day-to-day activities could benefit from this same kind of UI, in particular with the remote. You've got a lot of objects to consider or evaluate, and a limited number of actions for each one, including a default of "do nothing at all". Someone could play through a set of songs and give high / medium / low rating to each one, or trash it. A programmer could scroll through a list of bugs or functional specs, and for each one indicate "closed", "waiting for more information", "this does not apply to me", etc. That may be the way of the future, where carpal tunnel and hunched shoulders are just a distant memory.

Monday, February 15, 2010

Snow Leopard upgrade

I finally upgraded the main iMac to Snow Leopard. For the first time ever, an upgrade actually resulted in more free space, an extra 6 GB worth. The main features that I notice are fairly minor -- the ability to view stacks on the dock using a normal icon instead of the smashed-together icons of the apps in the folder; the ability to have the time announced every 15, 30, or 60 minutes by the computer voice.

I didn't notice any increase in speed, but maybe that's because Adblock in Safari went away until I installed a new version of that extension. Also I made the mistake of enabling Spotlight in hopes of getting "search entire message" working again in Mail.app. Although I think I've turned Spotlight off again, I'm always suspicious that it might still be active when I hear the drive whirring when I think the system should be idle.

My only real software incompatibility was with MediaWiki (the software that powers Wikipedia). It wouldn't run after the upgrade, because the newer PHP with Snow Leopard resulted in syntax errors in the MediaWiki PHP code. I upgraded MySQL because I was running an old PowerPC-based version via emulation. But I hadn't exported the wiki data, so I had to fall back to the older MySQL. When I updated the MediaWiki code to get past the PHP errors, I discovered that upgrading from such an old codebase meant running an upgrade script to change around the MySQL tables, and this script is also in PHP. And for some reason, the PHP script can't connect to the database. Since so many things needed to be updated at once, the cause could be related to PHP, or MySQL, or maybe that I had installed them by a different method before (the Marc Liyanage "entropy" packages). For example, after installing the latest MySQL through the official installer, mysqld now restarts when I kill it, which I don't think was the case previously.

Update 1

The Canon scanner software with my Pixma MX330 all-in-one printer refuses to start after the Snow Leopard upgrade. I can still scan using a dialog that opens up within Print & Fax preference pane for the printer, but only with barebones options. The recommended troubleshooting from Apple (reset the printing system, remove and re-add the printer) doesn't work. The latest software from Canon, supposed to be Snow Leopard-compatible, also doesn't help.

The problem with MediaWiki boiled down to a config setting for PHP that needed to be applied or reapplied after going to Snow Leopard, 3 instances in /etc/php.ini of the path to mysql.sock:

http://maestric.com/doc/mac/apache_php_mysql_snow_leopard
pdo_mysql.default_socket=/tmp/mysql.sock
mysql.default_socket = /tmp/mysql.sock
mysqli.default_socket = /tmp/mysql.sock

This particular item was difficult to track down. Most of the advice that I found re: Snow Leopard upgrades was to make sure this setting was in place:
mysql.default_port = 3306

Once the socket path was set up, it was a simple matter to run update.php and upgrade the MediaWiki schema, dump and re-import the data from the old PPC installation of MySQL to the native Intel version, and start the wiki going again. Thanks to Chris Jones of Oracle and Johannes Schluter, ah, also now of Oracle, for the help. It was getting a bit embarrassing, when I wanted to look up something that was on the wiki, having to read through the mysqldump file!

Tuesday, January 12, 2010

Follow me on Twitter

I'm Max Webster there.

Ten Years Gone

I've been pretty quiet lately, because I'm in a transitional period. After 10 years on documentation for Oracle Database and other enterprise server products, I'm switching to the InnoDB group that already works with MySQL. New development environments, new customers, it's an exciting time!

A decade seems to be the right timeframe for me. It was 10 years at IBM before that. Check back in 2019, I'm sure there'll be something new then too.