Wednesday, February 15, 2012

Too Dense?

With this post I want to ask a question: what does it mean for code to be "too dense?" This question has implications on everything from languages to APIs to coding style.

I've seen debaters defending Java's verbosity precisely because it isn't "too dense." They say the sparsity of the code makes it easy to understand what's going on. Similarly it's common to bash programmers for playing "golf" when their code is dense. But if we're allergic to density then why do programmers seem to prefer to use tools that create code density when there are fairly straightforward ways to create less dense code?

For an example I'm going to use regular expressions(1) since just about every programmer knows what they are, they're very dense, they exist in direct or library form for every general purpose programming language, and they are easy to replace with "normal" code.

Regexes are tight little strings that have very little in the way of redundancy. They're frequently accused of being "write only" - impossible to read and maintain once written. They are the poster children for "too dense" if anything is.

With the modern-ish focus on refactoring and the understanding that code is read far more often than it is written then if regexes are too dense you'd think programmers would be eager to replace those dense strings with more standard code just to improve readability. After all, a regex encodes a simple state machine or perhaps something a bit stronger if the common Perl-ish extensions are used, so replacing them is easy.

Yet it doesn't happen, at least not much. Regexes remain a mainstay. New regexes are continually written and old ones aren't ripped out and rewritten as loops and if statements just to gain some more readability. They're expanded for performance reasons or when the logic needed exceeds the power of regexes, but they almost never get replaced with an explicit state machine just to improve maintainability.

Why is that? We can't blame a few bad programmers. Regexes are far too widely used for that simple cop out.

What regexes and our use of them suggests is that we're not allergic to density in information per character but to something else. One culprit is is simply unfamiliarity. Regexes are okay because we're familiar with them, other forms of density are bad because we're not familiar with them.

But maybe it's even stronger than that. Perhaps the familiarity with regexes makes us aware of a different kind of density/sparsity trade off. A regex's information density may make it slower to read in terms of characters per minute but we know that expanded code would be slower to read in terms of concepts per minute.

In this post, I picked on regular expressions because they're so widely known and used but the bigger question is in the design of languages, APIs, and coding conventions. This article started with a question and will end with more. Are regular expressions outliers, unusual in creating value out of density? Is there some optimum relationship between frequency of use and density where something becomes too dense if we don't use it often enough? If we create dense languages, APIs, or coding conventions are we creating impenetrable barriers to entry for newbies? If we don't create dense notations are we providing a disservice to those who will use the notation often? Is there any hope that a designer of a language, API, or coding convention can find a near optimum density for his or her target audience that remains near optimal for a long time over patterns of changing usage?

Footnotes

  1. Insert "now you have two problems" joke here.