Saturday, December 7, 2013

My history with Big Data

Before I joined Cloudera, I hadn't had much formal experience with Big Data. But I had crossed paths with one of its major use cases before, so I found it easy to pick up the mindset.

My previous big project involved a relational database hooked up to a web server. Naturally I wanted to be able to track visitor stats, detect denial-of-service attacks, and chart the most popular pages, search terms, referrers, etc. Since I had more control over the database side, I threw all the relevant fields from the HTTP header and query string into a database table, all divided nicely into columns. With SQL queries, problems were easy to detect, trends were easy to spot, questions were easy to answer. Life was good.

As the system grew, the logging table grew to a point where it hit a bottleneck due to defaults set up years before. A simple single-row INSERT statement into this logging table could stall for minutes at a time. There was a more scalable layout for the table (IIUC, this feature didn't exist when my application was first deployed), but the combination of constant traffic on the site plus relatively little spare space to reorganize things meant that the switchover had to be done in tiny increments, over a period of a couple of weeks.

I looked at the size of the table: a few tens of millions of rows, no more than 16GB total. I would shuffle more data than that around my hard drive after one day of shooting photos on vacation. Why couldn't I have a simple append-only data store without the relational and transactional overhead? But still with the nice SQL query interface, that was so convenient.

As time went by, the volume of log data came up again and again as a sticking point -- even though it was still miniscule by the standards of any computer hobbyist. The mechanism for looking at this info changed from querying a relational database to analyzing the raw Apache logs through Java code. Now there was an ETL process for getting each day's worth of data. It took long enough to transfer the files, then uncompress and analyze them, that now it was a race to finish before it was time to start on the next day's data. There wasn't space to keep around all the raw data files, so only summary information was kept permanently. Now, doing any new kind of analysis involved writing a custom Java program, and the analysis could only apply to a limited time period.

Given the tradeoffs, I preferred the original way of doing things, with a SQL interface to the data covering all time, even if meant more overhead on the recording side and periodic pleas for more storage space. I know Java, Java is on my resume, but I didn't think it made sense to write elaborate Java programs with classes and loops for what was a perfect use case for SQL 1-liners (or failing that, a scripting job for Python or Perl). What do you know, Hadoop is going through exactly the same maturation, with people concluding that Java-based MapReduce jobs aren't the most effective use of developer or computer time for queries that are in SQL's sweet spot.

This is why I am so keen on the dynamics of Hadoop and Big Data, especially as exemplified by Cloudera Impala. I know there is an army of people out there who know SQL, who have been held back all these years by having to plead for even modest amounts of storage space, and spent much of their time twiddling their thumbs waiting for some lengthy ETL process to finish. Give 'em some gigabytes, terabytes, or even petabytes of storage to work with; give 'em read access to the original data files without needless copying and conversion steps. And let the insights roll in.