Friday, July 6, 2007

Search Engine Optimization for Programming Languages

Why didn't language X take over the world? Why does language Y enjoy broad adoption despite clear shortcomings? While there are lots of factors, for example the degree of strictness vs. whimsy and how that matches up with the mindset of programmers, I am convinced that it largely comes down to how easy it is to search for sample code in a given language.

For example, once Java became popular, if you ran across a mention of some class or method name, you could always find lots of example by searching for those names. Or if you wanted to see how to use a certain GUI method in combination with a certain container type, you could search for pages with both names. If any name was ambiguous, you could add some other keywords -- class, extends, etc. -- to make sure you really found a full code sample and not some general discussion.

So, Java's adoption was helped by a confluence of factors: a standard set of relatively verbose names, and a bunch of students posting their homework solutions online.

Perl, for me, didn't fulfill its promise and I lament sometimes that it didn't take over the world like it could have. When your source code is full of idioms like $_ and <>, it's hard to search for code samples. (By search, I mean using Google et al, rather than using vi, emacs, or grep on your own source.) When I look at my Perl code, the words inside are mostly generic like for, if, print, and die. Searching for them, you would come up with lots of non-programming discussion, or source code from other languages. For example, what does "unset local $/" do? Google will find instances of the phrase "unset local", but not the full line of code.

So, Perl's adoption was hindered by factors related to searchability. The best Perl source on the web is posted as CGIs, where a search engine crawler can only spider the output, not the underlying source. It's hard to find sample code generally. And the way Perl modules are documented, when you search for popular module or method names, the top hits tend to be dozens of copies of the same Perldoc files. (And if the Perldoc was sufficient, you wouldn't be doing the search in the first place.)

I see SQL and PL/SQL as somewhere in between Java and Perl in terms of searchability. There are lots of unique names from PL/SQL packages and methods. Some of the words that might otherwise be too common are grouped into phrases like "connect by" and "order by". PL/SQL has some words like begin and end that aren't especially searchable by themselves, but if you combine them with others like declare, exception, etc. you can devise searches that home in on PL/SQL source code.

Where SQL and PL/SQL get into trouble with searchability is in assigning specialized behavior to single, common words. PL/SQL has a function called TABLE(); good luck in finding that by searching given that TABLE is used in so many other contexts like CREATE TABLE or "Doing xyz to a Table". There also used to be a SQL function called THE(), now happily gone, but I still get requests from time to time "how come I can't search for the THE function?".

When I get some free time, I'm going to look into the idea of auto-classifying source code via Bayesian analysis. Oracle Text has the CTX_CLS package, where you feed it training data consisting of already-categorized documents, and it deduces rules so that it can analyze unknown documents and match them up with those same categories. The Oracle doc library has tons of files with source code in a single known language, multiple languages in the same file, or source code from one language in a book that is primarily for a different language. I'd like to compare and contrast the CTX_CLS approach with a "homegrown" one similar to Peter Norvig's spelling corrector.


clofresh said...

I think one of the biggest hindrances of Oracle's searchability is that the official Oracle docs aren't indexed by Google. Instead you get these ghetto tutorial pages from the 80's.

John Russell said...

Aha, in fact the Oracle docs now are indexed by Google. What this means as far as searchability of PL/SQL-related subjects deserves its own blog posting, which I will put into the queue.