Thursday, May 27, 2010

Who Will Throw the Hammer This Time?

Steve Jobs 1984 Who will throw the hammer this time?

During the 1984 Super Bowl, Apple released an ad where gray, drab people are transfixed on a giant screen with a talking head who tells them how utopian their uniform world is. A female athlete jogs in and throws a hammer at the screen, the screen explodes, and a new voice announces the release of the Apple Macintosh.

The ad riffed off of the George Orwell dystopian novell Nineteen Eighty-Four. Orwell's 1984 features a powerful central state that controls society absolutely. One of its mechanisms for control lies in saying one thing while meaning the opposite. The Ministry of Peace is in charge of war, the Ministry of Truth is in charge of the falsification and censorship, etc.

In an exchange with Gawker's Ryan Tate, Steve Jobs attempts some verbal judo but comes out with some near doublespeak of his own. Tate says "Revolutions are about freedom" and Jobs replies "Yep...Freedom from porn."

Look, I'm not advocating for porn. It seems to be doing just fine without me. But I am irritated by Jobs deliberately muddling "freedom" as in "in a democracy you have the freedom to vote" with the opposite "in a dictatorship you have freedom from worrying who to vote for." Combined with licensing that squashes Flash, Scratch and RunRev and the Apple world starts looking gray and drab. Meanwhile, Jobs explains that all this "freedom from" stuff is really what we want in his magic little utopia.

So we come full circle and I ask who will throw the hammer this time?

Post Script

BoingBoing has a video with a premise similar to my picture.

Tuesday, May 25, 2010

Anatomy of an Annoyance

The warning "Type safety : A generic array of T is created for a varargs parameter" is so irritating it makes me want to beat Java with a snoracle, an implement that allows defunct tech companies to breath while underwater. This article will explain what the warning is, where it comes from, and why it is pretty boneheaded. The warning also shows just how hard it is to extend an existing language without creating pain for your users.

My sample use case: anytime you build a Map in Java it takes approximately 6 metric tons of boilerplate.

Map<Integer, String> numbers = 
   new HashMap<Integer, String>(3);
numbers.put(1, "one");
numbers.put(2, "two");
numbers.put(3, "three");
// etc

It would sure be nice if that could be done in one line of code as in most sane languages. So let's fix it! First, we need the ubiquitous Pair class[1].

public static class Pair<A, B> {
   private A first;
   private B second;

   public Pair(A first, B second) {
      this.first = first;
      this.second = second;
   }

   public A getFirst() {
      return first;
   }

   public B getSecond() {
      return second;
   }
  
   /* a real Pair class would also have equals and 
     hashCode, but for this code I don't need them */
}

And then, to make it more pleasant to use, we write a static function that can take advantage of Java's half-assed type inference.

public static <A, B> Pair<A, B> pair(A first, B second) {
   return new Pair<A, B>(first, second);
}

Finally, the pièce de résistance

public static <A, B> Map<A, B> map(
      Pair<A, B>... pairs) {
   Map<A, B> map = 
      new HashMap<A, B>(pairs.length);
   for(Pair<A, B> pair : pairs) {
      map.put(pair.getFirst(), pair.getSecond());
   }
   return map;
}

And so, with faces smiling as the sunoracle shines down upon us, we use it

Map<Integer, String> numbers = 
   map(pair(1, "one"), pair(2, "two"), pair(3, "three"));

Ah, sweet sweet beauty[2]. Except one nasty, tiny, shoot-me-now-and-bury-my-carcass-in-downtown-Santa-Clara annoyance. That single lovely line of code that so innocently tried to use all our wonderful machinery is tagged with the warning "Type safety : A generic array of T is created for a varargs parameter." And the only way to "fix" it is to nuke the whole area from orbit with '@SuppressWarnings("unchecked"),' possibly hiding real problems in the same method. Gah.

History, Part The First

To understand the problem we have to go way back in history, at least as far as Java 1.0. Maybe even to Oak or to the egg and sperm that fused to make James Gosling. Whatever. In designing Java there was a question: since String is a subtype of Object shouldn't String[] be a subtype of Object[]. After all sometimes that's useful: if I have an array of Strings and you want to write a function that iterates through an array of Objects, aren't we good if I send you my Strings?

Well, no, we're not good. We're very bad. If String[] is a subtype of Object[] then Bad Things Happen™


void badThing(Object[] objects) {
   objects[0] = new Integer(42);
}

void test() {
   String[] strings = {"a", "b", "c"};
   badThing(strings);

   // length of an integer?
   System.out.println(strings[0].length());
}

If the above were allowed then we'd be trying to find the length of an Integer and that, um, won't work. Reading is okay but writing is bad. So to solve the problem the original Java designers made a pact with Satan. Instead of statically preventing code like that they added a dynamic check: anytime you store into a non-primitive array Java does an instanceOf-like check to make sure it's legit. In my sample code, the assignment is checked in "objects[0] = new Integer(42)". Failure will give an ArrayStoreException. That means every array store has to do expensive extra work and, more importantly to this article, at time of construction an array must be associated with a dynamic type tag. Like all pacts with Satan the negative consequences of this deal weren't obvious until much later.

History, Part the Second

With Java 1.5, Java programmers were finally able to move to the forefront of 1970's era technology with parameteric polymorphism, er, excuse me, generics. But see, there was this huge collection library and 3rd party libraries that extended it and...how the hell are we going to make the future safe for the past?

The decision was to compile with "type erasure" - dynamic type tags no longer carry all the information in the static type. Now, there's nothing evil about type erasure. Some of my best friends are languages that compile with even more type erasure than Java. It's just that Java has "instanceof" and other similar things that don't play well with it. Relevant to this article we have the aforementioned pact with Satan where arrays need a dynamic type tag or Bad Things Happen™.

Concerned with Bad Things, the Java designers started spackling over some holes. The upshot: this code fails to compile with the static error "Cannot create a generic array of T"

public <T> T[] arrayOfT(T a1, T a2, T a3) {
   return new T[]{a1, a2, a3}; // bzzt static error!
}

Same deal with this, "Cannot create a generic array of Pair<A, B>".

public <A, B> Pair<A, B>[] twoPairs(
      A a1, B b1, A a2, B b2) {
   return new Pair<A, B>[]{pair(a1, b2), pair(a2, b2)};
}

But this is a very shallow spackling indeed. Keep reading.

History, Part the Third

Also with Java 1.5, the designers added varargs: the ability to pass an arbitray number of arguments of the same static type to a method. So convenient! But, uh, apparently the varargs guys and the generics guys didn't talk to each other until late in the game. See, the varargs folks decided to implement varargs as sugar over arrays rather than something that's actually type safe.

public void whatClassIsIt(Object... things) {
   System.out.println(things.getClass());
}

whatClassIsIt(pair(1, "one"), 
   pair(2, "two"), pair(3, "three"));

There's our old buddy erasure. The above code prints "class [LPair;" - array of Pairs. But note it's not dynamically known to be an array of Pair<Integer, String>.

Making Bad Things Happen™

Now we've got enough knowledge to screw things up nicely. We've got statically unsafe arrays that depend on runtime type tags to provide a reasonably early error. And we have an ability to create arrays that don't carry all the information needed to do the runtime check.

public static <A, B> Pair<A, B>[] pairs(
      Pair<A, B>... pairs) {
  return pairs;
}

Pair<Integer, String>[] pairs = 
  pairs(pair(1, "one"), pair(2, "two"), pair(3, "three"));
Object[] objects = pairs;
// no error in this assignment, statically or dynamically!
objects[0] = pair('c', 2.0); 

// arrgg ClassCastException
System.out.println(pairs[0].getSecond().length()); 

The code doesn't have an error at the point of assigment. The error does finally occur later when I attempt to use a value. That means that in real code the error could be very far from the real cause. The only diagnostic that prevents this Bad Thing™ from happening is the warning on the first line; our lovely "Type safety : A generic array of T is created for a varargs parameter".[3] And that warning has to occur in a place that doesn't know whether the use of the array is actually bad or not. Hence, our code's beauty (or at least what passes for beauty in Java) is shot.

Conclusion

Ah, if only. If only Java didn't have covariant arrays. Or, since it does, if only the designers had used something other than arrays to carry varargs. Woulda coulda shoulda. This article shows just how hard it can be to extend an existing language without shooting yourself and your users in the ass. Runtime checked array assignment + varargs in arrays + type erasure == pain.

Now you know why Scala's arrays are invariant.

Footnotes

  1. Q: What makes Java so manly? A: It forces every programmer to grow a Pair.
  2. Please don't rag me about how I wrote 17 lines of code to avoid writing 3 - presumably I'd reuse this infrastructure a lot.
  3. For an even nastier problem with the erasure scheme and arrays see this example of trying to write higher order code using generics and varargs.

Wednesday, May 5, 2010

Types à la Chart

I recently saw a debate on whether a certain language had a "weak" or "strong" type system. Naturally the debate ended in flames. As far as I could tell neither side knew what the other was talking about which isn't surprising as I've almost as many definitions of "weak" and "strong" typing as there are developers who use the words. Frankly the words have too many meanings to be useful.

If the first part of the problem is ambiguity then it's still only a start of the problems. Even if you get something crisp like "static" vs "dynamic" you've still managed to condense out a lot of useful information. Agda and Java live in incredibly different spaces even though they are both statically typed. Or how 'bout the world of difference between E and Ruby, two dynamically typed languages?

In this article I'm going to try to convince you that the space is far more complicated than the typical debates would allow. The design space is very highly dimensioned, perhaps infinitely so, but I'll boil it down to 2 important ones which none-the-less already show considerably more sophistication. Here's the set up: what can you know statically about an expression, like f(x), without dynamically executing/evaluating it and if all you know is static types? The two dimensions I use are 1) How "safe" is it - are there back channels that might violate abstractions? And 2) How rich and sophisticated can the static type be?

One big, fat caveat: this is almost certainly wrong in many details. I've listed a bunch of languages. There just aren't enough "21 days" in a lifetime to understand in detail exactly what each one can promise and how one type logic does or does not contain another. Naturally if you are an expert on one of these languages and I got something a bit wrong you will conclude I'm an idiot even if everything else is 100% right. Such is a the price of writing for a technical audience. In spite of any such errors I contend that even if some details are wrong the overall picture should be reasonably accurate and should convey my central thesis that this space is too complicated for binary distinctions.

Click to zoom

Chart of Static Type Properties

Power

The horizontal axis is "power" or "sophistication" of the static type system's underlying logic. For this analysis I focused only on the ability of the static type system to describe actual data structures and computations. While it's interesting that C++, GHC Haskell, and Scala all have Turing complete type systems, none of their type systems seem to have the same ability to usefully describe computation as Agda's. Of course, I could be wrong. Feel free to bump the asterisked languages to the right accordingly.

The horizontal axis can also be read backwards in terms of type system simplicity.

The far left requires some explanation. The languages in this column tend to be dynamically typed which at first seems like a challenge for thinking about them in static terms. But there's a nice theoretical framework where you think of them as having one static type. A valid Scheme expression has "the Scheme type" and an invalid one does not. Different people who mean to emphasize different things will call these languages "untyped," "unityped", or "dynamically typed." For my purpose "unityped" gives the right connotation. This column corresponds to the untyped lambda calculus.

Moving to the right there are simple first order type systems. Such type systems are a bit richer than the unityped ones, but don't have any form of type abstraction: they don't allow a user to define types that will be parameterized by other types. This column corresponds to the simply typed lambda calculus.

Further right are second order type systems. These allow type abstraction qualified with "forall", and correspond to System F. In between first and second order is HM (Hindley-Milner) which allows full type inference but at the price of not allowing some second order things like higher rank polymorphism. Moving to the right are higher kinded languages, corresponding to System Fn up through System Fω

At the far right are dependently typed languages: languages that allows types to be parameterized by values[1][2]

Safety

The vertical axis is various kinds of "safety." For safety analysis I'm completely ignoring foreign function interfaces and the like. FFIs are useful and pragmatic but also make otherwise pretty safe languages no more safe than Assembly.[3]

You can also read this axis going the opposite direction in terms of a kind of operational power - Assembly isn't safe, but it lets you make the machine dance.

At the bottom end are memory unsafe languages. An expression like f(x) might very well totally hose memory.

A bit up from that is memory safe languages. Such languages have a few built in abstractions (like you can't write past the end of an array) that f(x) can't possibly violate. A couple of steps up is abstraction safety. That means that one developer's abstractions can be protected from another developer in some fashion. For instance, private variables can only be accessed through functions defined in a scope with access to the variable. Many popular "dynamic languages" are not abstraction safe because their powerful dynamic meta-programming facilities allow abstractions to be smashed. In between I've listed a few JVM languages as being configurably safe: the Java standard includes some pretty powerful reflection capabilities that allow many developer written abstractions to be violated, but the JVM standard includes a security mechanism that allows that to be turned off. Move these points up and down accordingly.

At the next level are capability safe languages. In most languages, any function call can have all manner of harmful side effects including formatting your hard drive or launching missiles. But at the capability safe level the expression cannot possibly do such crazy things unless granted that power in the form of an "object" that represents the capability to do so. While you cannot statically know whether f(x) has received a missile launching capability, you can statically know that if f(x) lives in a program that has not been granted missile launching capability then it will not launch missiles.

The pure level is for languages with a strict, statically enforced wall between side effecting expressions and non-side effecting ones. Examining the static type is sufficient to know the difference. Finally, the "pure & total" level is for languages where the expression f(x) is either guaranteed to terminate normally (no infinite loops, no exceptions)[4] or where static types indicate possible non-totality.

Random Notes

In many languages safety properties are enforced dynamically so a knee jerk reaction is "why is this stuff in a discussion about static properties." But remember the point was what I can statically know about an arbitrary expression f(x) without evaluating it. In Ruby I know with absolute conviction that, barring implementation bugs and foreign function interfacing, any arbitrary f(x) absolutely does not write past the end of an array. In C I don't know that.

It might seem odd to put Java further to the right than Haskell '98 but it's true. Haskell'98 has a slightly extended Hindley-Milner type system while Java has a kind of warped System F<: (use nesting to create rank-n polymorphism). Of course, a chart that ranked type systems on "bang for the buck", versatility vs ease of use, would rank things quite differently.

C# gets stuffed into the unsafe row because of the "unsafe" keyword and associated pointer machinery. But if you want to think of these as a kind of inline foreign function mechanism feel free to bump C# skyward.

I've marked Haskell '98 as pure but GHC Haskell as not pure due to the unsafe* family of functions. If you disable unsafe* then jump GHC Haskell up accordingly.

Assembly is problematic. In a certain weird sense some Assemblies are "memory safe" because they do protect their language abstractions by virtue of having almost none.

Conclusion

"Weak" and "strong" mean so many things that they mean nothing. "Static" and "dynamic" mean something but don't really convey many differences. This chart is not the end-all-be-all of language categorization, but it is the beginning of the start of an inkling of a glimpse at how rich the space really is. Hopefully this will make static typing advocates understand that dynamic typing doesn't necessarily mean total insanity while dynamic typing advocates can begin to see the range of sophisticated expression afforded by many modern statically typed languages. I highly recommend "What To Know Before Debating Type Systems" for more on the subject.

Postscript

Inspired in part by this post and a follow up discussion, David MacIver is investigating the dimensions of language design that are actually useful in picking the right tool for the job.

Footnotes

  1. Type theory folk will note that I've pretty casually collapsed two legs of the "lambda cube" into one dimension of power. That's not very valid but that seems to work out in practice. It's hard to find a sophisticated dependently typed language (types indexed by values) that isn't also very rich when it comes to types indexed by types.
  2. C++'s ability to parameterize templates by a few forms of literal shouldn't be confused with dependent typing.
  3. Technically there's no such thing as "Assembly;" it's a family of vaguely related languages with semantics tied extremely closely to underlying machines.
  4. A total language can of course crash due to resource exhaustion or hardware failure. There's no way to prevent that.