- Cloudera Impala Release Notes
- Installing and Using Cloudera Impala
- Impala Concepts and Architecture
- Installing Cloudera Impala
- Configuring Impala
- Upgrading Impala
- Impala Security with Kerberos Authentication
- Starting Impala
- Learning Impala Tutorial
- Language Reference
- Using the Impala Shell
- Tuning Impala Performance
- Understanding File Formats
- Using Impala to Query HBase Tables
- Using Impala Logging
- Appendix A - Ports Used by Impala
- Appendix B - Troubleshooting Impala
- Cloudera Impala Frequently Asked Questions
Tuesday, May 14, 2013
Documentation for Cloudera Impala 1.0
The Cloudera docs have been undergoing some reorganization lately. That applies double to the Impala documentation, which moved from beta to GA status and has been restructured at the same time. For posterity (and a little Googlerank mojo), here's the current layout of the Impala docs in mid-May 2013:
Monday, May 6, 2013
New Job for John
Hello readers of my infrequent blog posts! I have started a new job, working on documentation for Cloudera, specifically for the Impala project, which is bringing fast interactive SQL to the Hadoop ecosystem. Read the Impala documentation. Download the Impala software. Get the QuickStart VM to play around with a pre-installed system. (You'll need to start Impala from within the Cloudera Manager interface after booting the VM.)
Monday, December 10, 2012
Error Handling: Or, How to Start an Argument Among Programmers
Lately I see a lot of discussions come up about error handling. For example: Dr. Dobbs "The Scourge of Error Handling" covering mainly C-descended languages; this blog post stemming from OTN discussions of Oracle exception handling in PL/SQL. I've been on both sides of the fence (cowboy coder and database programmer) so I'll offer my take. I'll write mostly from the perspective of PL/SQL stored procedures, because that's where I've coded the biggest variety of error handlers. The same opinions apply to my programming in Python, Javascript, Java, or C++, that's just a lower volume of code.
I have a lot of sympathy for the Dr. Dobbs article's position that we've neglected error codes too much. I also don't mind PL/SQL's WHEN OTHERS clause as much as some people do.
It's true that exception handling is often used as a kludge that winds up simulating GOTOs and producing spaghetti code. In fact, I don't think I've ever seen a substantial body of code that used "admirable" exception-handling practices. People know some but not all of the things that could throw exceptions, so their try/catch blocks are insufficient. If you really want to protect and debug every operation that could throw an error, you need way more exception handling than anyone does in practice. (In theory, you could get some sort of out of memory, memory corrupted, or disk full error at any random time.) If you wind up maintaining someone else's code, basically any new bug is going to highlight the need for more granular exception handling, or dealing with some never-before-seen condition, or better recovery code in the caller.
For example, I was the author of the Oracle PL/SQL User's Guide for a time, and wrote more than 100K lines of PL/SQL code. I had many first-time encounters with hard-to-debug exceptions in purely PL/SQL operations, for things like nonexistent collection items, index out of range, bad datatype conversions, and so on. Those were often in query-intensive pieces of code where there was no need for a rollback, just an error message for a user or a slightly less-than-optimal query result for a search engine result page. The structure of stored procedures dealing with collection types often required putting separate try/catch blocks around every reference to an index variable, or constructing some debugging metadata using extra variables designating how far the routine had gotten. That didn't feel right; once the language leads you down a path that seems clunky, coders are less likely to stick to best practices in all other areas.
I also dealt with built-in PL/SQL stored procedures where I didn't know in advance what exceptions could be raised, and so learned over time what were the most likely ones to come back based on my imperfect understanding of the internal workings of some package. But how could I ever predict in advance all the things that could go wrong when settings were changed, the database upgraded, or some hardware limit was reached that I couldn't anticipate in a test environment? The people writing such routines, who have greater expertise with the internal operations, can just RAISE any exceptions and put the burden of predicting and handling them on you, the ill-informed caller. That feels like the wrong way round.
For those reasons, I wouldn't demand that WHEN OTHERS always propagate an exception back to the caller. It could be protecting a purely optional section of code, for example one that logs error conditions. (Which is already likely to be called when things are messed up in unexpected ways.) Or that produces some optional UI adornment that can be skipped in the name of keeping things running in the face of moderate problems. Personally, I'm happy when PL/SQL stored procedures tackle some new problem domain where not every code path is mission critical or represents a DML operation inside a transaction.
As a developer, you also don't necessarily get to control how unhandled exceptions are displayed or logged all the way back at the web layer. For example, the PL/SQL web gateway by default shows a stack trace, which makes unhandled exceptions easy to debug in a development/test environment. But that might be turned into a generic 404-style error page on the production web server, meaning it's just as well to handle the exception entirely yourself (WHEN OTHERS with no RAISE) and put up your own error page or make subtle changes to your page of search engine results or whatever.
The best argument I've seen against WHEN OTHERS THEN <something besides raise> is that the system could be so hosed that another exception occurs doing something you thought was completely innocuous inside the handler, leading to an infinite loop of exceptions. I've never encountered that myself, so I can't say how that works in practice.
Getting away from PL/SQL for a second...
In Javascript, you're almost always dealing with non-critical code paths. Which makes it more practical to do less granular error handling. (If errors X, Y, or Z occur during this function, don't display the ad widget we were trying to retrieve.)
Python's nice robust standard library and introspection makes it very pleasant to write exception handlers for routines you didn't write yourself. There are plenty of things you can do with scripting languages where, if you run out of memory or out of disk space, you can live with a sudden halt and an error report rather than extensive recovery actions. So in Python (and Perl, et. al.) I don't mind having a top-level exception handler and not going overboard with try/catch blocks for every operation that could fail.
I have a lot of sympathy for the Dr. Dobbs article's position that we've neglected error codes too much. I also don't mind PL/SQL's WHEN OTHERS clause as much as some people do.
It's true that exception handling is often used as a kludge that winds up simulating GOTOs and producing spaghetti code. In fact, I don't think I've ever seen a substantial body of code that used "admirable" exception-handling practices. People know some but not all of the things that could throw exceptions, so their try/catch blocks are insufficient. If you really want to protect and debug every operation that could throw an error, you need way more exception handling than anyone does in practice. (In theory, you could get some sort of out of memory, memory corrupted, or disk full error at any random time.) If you wind up maintaining someone else's code, basically any new bug is going to highlight the need for more granular exception handling, or dealing with some never-before-seen condition, or better recovery code in the caller.
For example, I was the author of the Oracle PL/SQL User's Guide for a time, and wrote more than 100K lines of PL/SQL code. I had many first-time encounters with hard-to-debug exceptions in purely PL/SQL operations, for things like nonexistent collection items, index out of range, bad datatype conversions, and so on. Those were often in query-intensive pieces of code where there was no need for a rollback, just an error message for a user or a slightly less-than-optimal query result for a search engine result page. The structure of stored procedures dealing with collection types often required putting separate try/catch blocks around every reference to an index variable, or constructing some debugging metadata using extra variables designating how far the routine had gotten. That didn't feel right; once the language leads you down a path that seems clunky, coders are less likely to stick to best practices in all other areas.
I also dealt with built-in PL/SQL stored procedures where I didn't know in advance what exceptions could be raised, and so learned over time what were the most likely ones to come back based on my imperfect understanding of the internal workings of some package. But how could I ever predict in advance all the things that could go wrong when settings were changed, the database upgraded, or some hardware limit was reached that I couldn't anticipate in a test environment? The people writing such routines, who have greater expertise with the internal operations, can just RAISE any exceptions and put the burden of predicting and handling them on you, the ill-informed caller. That feels like the wrong way round.
For those reasons, I wouldn't demand that WHEN OTHERS always propagate an exception back to the caller. It could be protecting a purely optional section of code, for example one that logs error conditions. (Which is already likely to be called when things are messed up in unexpected ways.) Or that produces some optional UI adornment that can be skipped in the name of keeping things running in the face of moderate problems. Personally, I'm happy when PL/SQL stored procedures tackle some new problem domain where not every code path is mission critical or represents a DML operation inside a transaction.
As a developer, you also don't necessarily get to control how unhandled exceptions are displayed or logged all the way back at the web layer. For example, the PL/SQL web gateway by default shows a stack trace, which makes unhandled exceptions easy to debug in a development/test environment. But that might be turned into a generic 404-style error page on the production web server, meaning it's just as well to handle the exception entirely yourself (WHEN OTHERS with no RAISE) and put up your own error page or make subtle changes to your page of search engine results or whatever.
The best argument I've seen against WHEN OTHERS THEN <something besides raise> is that the system could be so hosed that another exception occurs doing something you thought was completely innocuous inside the handler, leading to an infinite loop of exceptions. I've never encountered that myself, so I can't say how that works in practice.
Getting away from PL/SQL for a second...
In Javascript, you're almost always dealing with non-critical code paths. Which makes it more practical to do less granular error handling. (If errors X, Y, or Z occur during this function, don't display the ad widget we were trying to retrieve.)
Python's nice robust standard library and introspection makes it very pleasant to write exception handlers for routines you didn't write yourself. There are plenty of things you can do with scripting languages where, if you run out of memory or out of disk space, you can live with a sudden halt and an error report rather than extensive recovery actions. So in Python (and Perl, et. al.) I don't mind having a top-level exception handler and not going overboard with try/catch blocks for every operation that could fail.
Wednesday, August 29, 2012
Separate docs for MySQL Connectors
The MySQL documentation section has always had this Topic Guides page containing links to the docs for the various MySQL Connectors -- the official database drivers for various languages and programming technologies. That is the most convenient way to get the information for each Connector in PDF form, rather than downloading the entire Ref Man PDF. For HTML, it was more of a shortcut, because most HTML links led back into the Ref Man, where the Connectors material is embedded all in one chapter. (For example, "22.3. MySQL Connector/J".)
Now we have started building each of the Connectors sections as its own HTML doc, linked from the Topic Guides page. That allows for faster turnaround for updates -- sometimes the Ref Man takes a few days to be rebuilt and posted. Also, it gives each Connector its own TOC, making the the topics more visible than when they're pushed down several levels in the Ref Man. (For example, "22.3.7. Connection Pooling with Connector/J" becomes "Chapter 7. Connection Pooling with Connector/J".)
The language-themed Connectors are:
Now we have started building each of the Connectors sections as its own HTML doc, linked from the Topic Guides page. That allows for faster turnaround for updates -- sometimes the Ref Man takes a few days to be rebuilt and posted. Also, it gives each Connector its own TOC, making the the topics more visible than when they're pushed down several levels in the Ref Man. (For example, "22.3.7. Connection Pooling with Connector/J" becomes "Chapter 7. Connection Pooling with Connector/J".)
The language-themed Connectors are:
MySQL Connector/J for applications written in Java to the JDBC spec.
MySQL Connector/Net for applications written in C# and other .NET languages.
MySQL Connector/Python for applications written in Python. This one is new: an alternative to the MySQLdb module, currently in beta.
MySQL Connector/C++ for applications written in C++.
MySQL Connector/C for applications written in C.
MySQL Connector/ODBC for applications written to the ODBC spec.
Thursday, April 1, 2010
Wednesday, March 10, 2010
More on Single-Action UIs
In my post Deconstructing the iPod Shuffle UI, I talked a bit about the notion of a limited UI where you really only do one thing -- in that case, click the button on the headphones.
Every now and then, I rediscover the infrared remote that goes with my iMac, and realize that for many mundane tasks, the remote does everything I need. (Mira lets me assign different actions for the remote button for specific applications. Candelair works around a Snow Leopard issue with Mira.)
For example, with a new album in a photo library, you might want to Next-Next-Next through each photo, and if one is particularly bad, trash it; if one is particularly good, mark it with a high rank or a coloured label; otherwise, leave it as-is and proceed to the next one. The up/down/left/right/play buttons on the remote can accomodate those actions comfortably. All the photo software offers keyboard shortcuts for these actions, that can be assigned to the remote buttons. So this brain-intensive task can be accomplished while sitting back and taking in the view on a large screen.
One thing had been bothering me about the UI for the iPod Touch. If I'm listening to music with the screen turned off, say in the car or the office, and I'm interrupted (want to stop the music suddenly) or annoyed (want to immediately skip to the next song), normally this requires turning on the screen (Home button or power button), and unlocking the iPod by swiping before the music controls are available. I should have realized Apple would bring the "single-action UI" to the rescue. Instead of unlocking the whole unit, once it is powered on in a locked state, 2 more presses of the Home button bring up music controls that let you pause or skip without unlocking. It's almost iPod Shuffle-like. Now if only there were a way to assign star ratings without unlocking.
Many other day-to-day activities could benefit from this same kind of UI, in particular with the remote. You've got a lot of objects to consider or evaluate, and a limited number of actions for each one, including a default of "do nothing at all". Someone could play through a set of songs and give high / medium / low rating to each one, or trash it. A programmer could scroll through a list of bugs or functional specs, and for each one indicate "closed", "waiting for more information", "this does not apply to me", etc. That may be the way of the future, where carpal tunnel and hunched shoulders are just a distant memory.
Every now and then, I rediscover the infrared remote that goes with my iMac, and realize that for many mundane tasks, the remote does everything I need. (Mira lets me assign different actions for the remote button for specific applications. Candelair works around a Snow Leopard issue with Mira.)
For example, with a new album in a photo library, you might want to Next-Next-Next through each photo, and if one is particularly bad, trash it; if one is particularly good, mark it with a high rank or a coloured label; otherwise, leave it as-is and proceed to the next one. The up/down/left/right/play buttons on the remote can accomodate those actions comfortably. All the photo software offers keyboard shortcuts for these actions, that can be assigned to the remote buttons. So this brain-intensive task can be accomplished while sitting back and taking in the view on a large screen.
One thing had been bothering me about the UI for the iPod Touch. If I'm listening to music with the screen turned off, say in the car or the office, and I'm interrupted (want to stop the music suddenly) or annoyed (want to immediately skip to the next song), normally this requires turning on the screen (Home button or power button), and unlocking the iPod by swiping before the music controls are available. I should have realized Apple would bring the "single-action UI" to the rescue. Instead of unlocking the whole unit, once it is powered on in a locked state, 2 more presses of the Home button bring up music controls that let you pause or skip without unlocking. It's almost iPod Shuffle-like. Now if only there were a way to assign star ratings without unlocking.
Many other day-to-day activities could benefit from this same kind of UI, in particular with the remote. You've got a lot of objects to consider or evaluate, and a limited number of actions for each one, including a default of "do nothing at all". Someone could play through a set of songs and give high / medium / low rating to each one, or trash it. A programmer could scroll through a list of bugs or functional specs, and for each one indicate "closed", "waiting for more information", "this does not apply to me", etc. That may be the way of the future, where carpal tunnel and hunched shoulders are just a distant memory.
Monday, February 15, 2010
Snow Leopard upgrade
I finally upgraded the main iMac to Snow Leopard. For the first time ever, an upgrade actually resulted in more free space, an extra 6 GB worth. The main features that I notice are fairly minor -- the ability to view stacks on the dock using a normal icon instead of the smashed-together icons of the apps in the folder; the ability to have the time announced every 15, 30, or 60 minutes by the computer voice.
I didn't notice any increase in speed, but maybe that's because Adblock in Safari went away until I installed a new version of that extension. Also I made the mistake of enabling Spotlight in hopes of getting "search entire message" working again in Mail.app. Although I think I've turned Spotlight off again, I'm always suspicious that it might still be active when I hear the drive whirring when I think the system should be idle.
My only real software incompatibility was with MediaWiki (the software that powers Wikipedia). It wouldn't run after the upgrade, because the newer PHP with Snow Leopard resulted in syntax errors in the MediaWiki PHP code. I upgraded MySQL because I was running an old PowerPC-based version via emulation. But I hadn't exported the wiki data, so I had to fall back to the older MySQL. When I updated the MediaWiki code to get past the PHP errors, I discovered that upgrading from such an old codebase meant running an upgrade script to change around the MySQL tables, and this script is also in PHP. And for some reason, the PHP script can't connect to the database. Since so many things needed to be updated at once, the cause could be related to PHP, or MySQL, or maybe that I had installed them by a different method before (the Marc Liyanage "entropy" packages). For example, after installing the latest MySQL through the official installer, mysqld now restarts when I kill it, which I don't think was the case previously.
Update 1
The Canon scanner software with my Pixma MX330 all-in-one printer refuses to start after the Snow Leopard upgrade. I can still scan using a dialog that opens up within Print & Fax preference pane for the printer, but only with barebones options. The recommended troubleshooting from Apple (reset the printing system, remove and re-add the printer) doesn't work. The latest software from Canon, supposed to be Snow Leopard-compatible, also doesn't help.
The problem with MediaWiki boiled down to a config setting for PHP that needed to be applied or reapplied after going to Snow Leopard, 3 instances in /etc/php.ini of the path to mysql.sock:
http://maestric.com/doc/mac/apache_php_mysql_snow_leopard
This particular item was difficult to track down. Most of the advice that I found re: Snow Leopard upgrades was to make sure this setting was in place:
Once the socket path was set up, it was a simple matter to run update.php and upgrade the MediaWiki schema, dump and re-import the data from the old PPC installation of MySQL to the native Intel version, and start the wiki going again. Thanks to Chris Jones of Oracle and Johannes Schluter, ah, also now of Oracle, for the help. It was getting a bit embarrassing, when I wanted to look up something that was on the wiki, having to read through the mysqldump file!
I didn't notice any increase in speed, but maybe that's because Adblock in Safari went away until I installed a new version of that extension. Also I made the mistake of enabling Spotlight in hopes of getting "search entire message" working again in Mail.app. Although I think I've turned Spotlight off again, I'm always suspicious that it might still be active when I hear the drive whirring when I think the system should be idle.
My only real software incompatibility was with MediaWiki (the software that powers Wikipedia). It wouldn't run after the upgrade, because the newer PHP with Snow Leopard resulted in syntax errors in the MediaWiki PHP code. I upgraded MySQL because I was running an old PowerPC-based version via emulation. But I hadn't exported the wiki data, so I had to fall back to the older MySQL. When I updated the MediaWiki code to get past the PHP errors, I discovered that upgrading from such an old codebase meant running an upgrade script to change around the MySQL tables, and this script is also in PHP. And for some reason, the PHP script can't connect to the database. Since so many things needed to be updated at once, the cause could be related to PHP, or MySQL, or maybe that I had installed them by a different method before (the Marc Liyanage "entropy" packages). For example, after installing the latest MySQL through the official installer, mysqld now restarts when I kill it, which I don't think was the case previously.
Update 1
The Canon scanner software with my Pixma MX330 all-in-one printer refuses to start after the Snow Leopard upgrade. I can still scan using a dialog that opens up within Print & Fax preference pane for the printer, but only with barebones options. The recommended troubleshooting from Apple (reset the printing system, remove and re-add the printer) doesn't work. The latest software from Canon, supposed to be Snow Leopard-compatible, also doesn't help.
The problem with MediaWiki boiled down to a config setting for PHP that needed to be applied or reapplied after going to Snow Leopard, 3 instances in /etc/php.ini of the path to mysql.sock:
http://maestric.com/doc/mac/apache_php_mysql_snow_leopard
pdo_mysql.default_socket=/tmp/mysql.sock
mysql.default_socket = /tmp/mysql.sock
mysqli.default_socket = /tmp/mysql.sock
This particular item was difficult to track down. Most of the advice that I found re: Snow Leopard upgrades was to make sure this setting was in place:
mysql.default_port = 3306
Once the socket path was set up, it was a simple matter to run update.php and upgrade the MediaWiki schema, dump and re-import the data from the old PPC installation of MySQL to the native Intel version, and start the wiki going again. Thanks to Chris Jones of Oracle and Johannes Schluter, ah, also now of Oracle, for the help. It was getting a bit embarrassing, when I wanted to look up something that was on the wiki, having to read through the mysqldump file!
Tuesday, January 12, 2010
Ten Years Gone
I've been pretty quiet lately, because I'm in a transitional period. After 10 years on documentation for Oracle Database and other enterprise server products, I'm switching to the InnoDB group that already works with MySQL. New development environments, new customers, it's an exciting time!
A decade seems to be the right timeframe for me. It was 10 years at IBM before that. Check back in 2019, I'm sure there'll be something new then too.
A decade seems to be the right timeframe for me. It was 10 years at IBM before that. Check back in 2019, I'm sure there'll be something new then too.
Thursday, October 15, 2009
Deconstructing "Everything is UNIX"
From Linux magazine, an article by Jeremy Zawodny: Everything is UNIX.
For me, this is an example of the "Miller meme" from Repo Man. "Suppose you're thinkin' about a plate o' shrimp. Suddenly someone'll say, like, "plate," or "shrimp," or "plate o' shrimp" out of the blue, no explanation." You go through life thinking you'll find something better than UNIX. The man pages still have the same bad examples and incomplete option descriptions as in 1984. The window systems aren't up to snuff for desktop use. People are still finding performance bottlenecks due to system architecture, whichever architecture your favourite UNIX flavour uses.
And then you realize, there really is not much more. Everywhere you look, the same point is reinforced over and over. Windows Vista comes along, OS X is built on a UNIX base, and you get past your long-standing resentment about X11. You read the documentation for the latest and greatest web framework, and you think to yourself, this isn't any more usable than those thrown-together man pages. Regardless of the fact that no system incorporates every possible performance boost, the systems taking advantage of the latest and greatest hardware are often descendants of UNIX; and if you're looking for mainframe-style reliability, again you're probably tinkering around in the UNIX space.
I find it interesting that UC Berkeley interrupts their Scheme-based intro to programming languages with an interlude to consider UNIX shell scripting. If the problem by combining a set of well-understood and reliable steps, don't reinvent the wheel. That's a good lesson for people developing frameworks and class libraries.
In the sample Google interview questions floating around the web, one recurring theme I've noticed is how to solve a fairly large-scale problem involving string and file manipulation. To which the obvious answer, IMHO, is to use the basic UNIX commands up to and including simple Perl scripts. For something quick and general-purpose, you're not going to implement a better string-finding algorithm than grep or a better simple sort than sort. Or a better I/O pipeline than UNIX pipes, or better filesystem traversal than find, or a better use of a huge memory space than a simple Perl hash. If you're writing complicated custom code, you'd better be solving harder problems than those.
(Hat tip to Tim Bray for the article.)
For me, this is an example of the "Miller meme" from Repo Man. "Suppose you're thinkin' about a plate o' shrimp. Suddenly someone'll say, like, "plate," or "shrimp," or "plate o' shrimp" out of the blue, no explanation." You go through life thinking you'll find something better than UNIX. The man pages still have the same bad examples and incomplete option descriptions as in 1984. The window systems aren't up to snuff for desktop use. People are still finding performance bottlenecks due to system architecture, whichever architecture your favourite UNIX flavour uses.
And then you realize, there really is not much more. Everywhere you look, the same point is reinforced over and over. Windows Vista comes along, OS X is built on a UNIX base, and you get past your long-standing resentment about X11. You read the documentation for the latest and greatest web framework, and you think to yourself, this isn't any more usable than those thrown-together man pages. Regardless of the fact that no system incorporates every possible performance boost, the systems taking advantage of the latest and greatest hardware are often descendants of UNIX; and if you're looking for mainframe-style reliability, again you're probably tinkering around in the UNIX space.
I find it interesting that UC Berkeley interrupts their Scheme-based intro to programming languages with an interlude to consider UNIX shell scripting. If the problem by combining a set of well-understood and reliable steps, don't reinvent the wheel. That's a good lesson for people developing frameworks and class libraries.
In the sample Google interview questions floating around the web, one recurring theme I've noticed is how to solve a fairly large-scale problem involving string and file manipulation. To which the obvious answer, IMHO, is to use the basic UNIX commands up to and including simple Perl scripts. For something quick and general-purpose, you're not going to implement a better string-finding algorithm than grep or a better simple sort than sort. Or a better I/O pipeline than UNIX pipes, or better filesystem traversal than find, or a better use of a huge memory space than a simple Perl hash. If you're writing complicated custom code, you'd better be solving harder problems than those.
(Hat tip to Tim Bray for the article.)
Subscribe to:
Posts (Atom)