Showing posts with label C. Show all posts
Showing posts with label C. Show all posts

Monday, September 13, 2010

Moron Why C Is Not Assembly

In response to a previous article a poster in some forum called me an idiot and said "everybody knows that C is a portable assembly language with some syntax sugar." The idiot insult hurt deeply and I cried, but once the tears were dry I resolved to write a bit more on why C isn't assembly, or is at best an assembly for a strange, lobotomized machine. Also, I may have misspelled "more on" in this article's title.

To compare the two things I kinda need to define what I'm comparing. By C I mean an ANSI C standard, whichever version floats your boat. Most real C implementations will go some distance beyond the standard(s), of course, but I have to draw the line somewhere. Assembly is more problematic to define because there have been so many and some have been quite different. But I'll wave my hands a bit and mean standard assembly languages for mainstream processors. To level set: my assembly experience is professionally all x86 and I had a bit of RISC stuff in college plus I did some 6502 back in the day.

Assembly-Like Bits

There are some very assembly like things about C and its worth mentioning them. The main one is that C gives you very, very good control over the layout of data structures in memory. In fact, on modern architectures, I'd bet that this fact is the primary ingredient in the good showing that C (and C++) have in the language shootout. Other languages produce some damn fine executable code, but their semantics make it hard to avoid pointer indirection, poor locality of reference, or other performance losses.

There are a few other assembly-ish thing in C. But not much. Now to why C is just not assembly. I'll start with...

The Stack

As far as I'm concerned the number one thing that makes C not assembly is "The Stack." On entry into a function C allocates space needed for bookkeeping and local variables and on function exit the memory gets freed. It's very convenient, but it's also very nearly totally opaque to the programmer. That's completely unlike any mainstream assembly where "the stack" (lower case) is just a pointer, probably held in a register. A typical assembly has no problem with you swapping the stack (e.g. to implement continuations, coroutines, whatever) while C defines no way to do much of anything with the stack. The most power that C gives you on the stack is setjmp/longjump - setjmp captures a point on the stack and longjmp lets you unwind back to that point. setjmp/longjmp certainly have an assembly-like feel to them when compared to, say, exceptions, but they are incredibly underpowered when compared to the stack swizzling abilities of a typical assembly language.

Speaking of continuations, let's talk about...

Parallel and Concurrent Code

The C standard has almost nothing to say about parallel and concurrent code except that the "volatile" keyword means that a variable might be changed concurrently by something else so don't optimize. That's about it. Various libraries and compilers have something to say on the subject, but the language is pretty quiet. A real assembly, or at least a real assembly on any modern architecture, is going to have all kinds of stuff to say about parallelism and concurrency: compare and swap, memory fences, Single Instruction Multiple Data (SIMD) operations, etc. And that leads a bit into...

Optimization

void doubleArray(int * restrict dest, 
     int const * restrict src, size_t size) {
   for (size_t i = 0; i < size; i++) {
      dest[i] = src[i] * 2;
   }
}

Under gcc -std=c99 -S at -O1 I get pretty standard sequential code. Under -O3 I get highly parallel code using x86 SIMD operations. An assembler that changed my code that way would be thrown out as garbage but a C compiler that DIDN'T parallelize that code on hardware that could support it would seem underpowered. And given the highly sequential nature of the original code the transformation to parallel code goes way beyond mere de-sugaring.

Optimization leads me to ...

Undefined Behavior

The C standard says a bunch of stuff leads to undefined behavior. Real compilers take advantage of that to optimize away code, sometimes with surprising results.

What assembly has undefined behavior on that scale? mov al, [3483h] ;(Intel syntax) is going to try to copy the byte at 3483h into the al register, period. It might trap due to memory protection but the attempt will be made.

Conclusion

Alan Perlis famously said "a programming language is low level when its programs require attention to the irrelevant." What's relevant depends on the problem and the domain, so a corollary would be that a language that's at too high a level prevents paying attention to some of what is relevant for your needs.

C is a low level language as compared with most languages used for most day to day programming. If you spend all day hacking Ruby or Haskell then C might look like assembly. But C is actually at a substantially higher level than assembly. Assembly both permits and requires you to do things that C hides.

Please stop perpetuating the myth that C is sugar for assembly because real compilers make complicated, non-obvious transformations to get from C to machine code that assemblers don't. And please stop saying that C is a portable assembly because if it is then it's the assembly for a very peculiar machine that is both underspecified and a bit stupid when compared to real hardware.

Monday, April 12, 2010

C Is Not Assembly

In my last article I dug down into some of the joys of undefined pointer behavior in C. But I did it in the context of an architecture that most developers aren't too likely to ever see again, the Intel 8086. I wanted to show that this stuff matters even with more mainstream architectures because compilers are free to do a lot of non-obvious things. C is not assembly.

The United States Computer Emergency Readiness Team (US-CERT) "is charged with providing response support and defense against cyber attacks for the Federal Civil Executive Branch (.gov) and information sharing and collaboration with state and local government, industry and international partners."

With a U.S. Department of Homeland Security badge on their page you know they're serious. When you overflow your buffers the terrorists win. So you'd think they'd take C seriously.

I found a real WTF gem caused by programmers treating C like assembly. Vulnerability Note VU#162289

"In the C language, given the following types:

        char *buf;
        int len;

some C compilers will assume that buf+len >= buf. As a result, code that performs wrapping checks similar to the following:

len = 1<<30;
[...]
if(buf+len < buf)  /* wrap check */
  [...overflow occurred...]

are optimized out by these compilers; no object code to perform the check will appear in the resulting executable program. In the case where the wrap test expression is optimized out, a subsequent manipulation of len could cause an overflow. As a result, applications that perform such checks may be vulnerable to buffer overflows."

The advisory is careful to admit that compilers are free to do just that. Here's why: greatly simplified, the C standard says a pointer must point at a valid object or just past the end. Any pointer arithmetic that might cause a pointer to step outside those bounds yields undefined behavior. Now, there's some code missing from their snippet, but I have to assume that the compiler knows that buf points at the beginning of something and not into the middle. So by definition either buf + len >= buf or the program is free to do anything up to and including launching shoulder mounted kitten missiles at Capitol Hill.

Still, there is a WTF to lay on the compiler writer here. If a programmer writes an "if" test then presumably he or she had some reason to believe that sometimes the test might be true. Before optimizing away the conditional the compiler really should have issued a warning.

In order for this code to have any hope of working a few assumptions must hold: sizeof(int) <= sizeof(char *), overflow must happen "silently", etc. But there's another major assumption here: the buffer pointed to by buf must be located at the end of its address space. With a check like this, if there are any objects located higher in the same overflow segment then those objects are getting some kitten missiles. So another WTF is a developer making an assumption about how a compiler works in the face of undefined code.

Now, there are a few scenarios where all these assumptions might be justified. For instance, if you're targeting some special purpose embedded device then memory layout might be well understood. In such situations, the optimization performed by the compiler might be shocking indeed, even if technically permissible.

The problem is that the developer is thinking at the assembly code level but the C standard says the compiler doesn't have to "think" the same way. In assembly the distance between what you write and the object code generated is pretty small (barring some kind of complicated macro magic). In C the distance is quite a bit larger.

Repeat after me, C is not assembly.

The Solution?

In my mind the real WTF is the US-CERT's proposed solution.

"Cast objects of type char* to uintptr_t before comparison. The faulty wrapping check listed above would be written

#include <stdint.h>
[...]
if((uintptr_t)buf+len < (uintptr_t)buf)
   [...]

Alternatively, developers can use size_t on platforms that do not provide the uintptr_t type."

Yes, that's right, the proposed solution is to replace one set of undefined behavior with another. Well...WTF? Didn't we get here because the compiler was free to do anything in the face of undefined semantics?

C is not assembly. Not even when you do some casting. Pointers might be segmented in such a way that overflow behavior for pointers is totally different from that of integral types. Or sizeof(int) > sizeof(uintptr_t). Or the cast to an integral type could do anything at all like invert the bits. Or shoulder mounted kitten missiles. Whatever.

The Real Solution

The best solution is "don't do that." Give the buffer a size and compare len to that size. The assumption that the buffer is the last object in its address space is just playing with fire. But I'll assume they really did need to use memory at the end of the address space. Fine. What's the solution?

It's this: you can't say "end of address space" portably in C so tread carefully when you use undefined behavior. Minimize the amount of undefined behavior you depend on. Pull the non-portable stuff out into separate compilation units and document that the assumptions. Maybe even write the non-portable stuff in assembly where you know what's going to happen, but at least write a unit test that ensures that the compiler behaves the way you think it does. And always, always remember that C is not assembly.

Thursday, April 8, 2010

Good Math, Bad Pointer Math

Heard about the new C bandwagon? According to April's Tiobe report C has eclipsed Java as the number 1 most popular language by a whopping 0.007%. So while all the rest of you Java suckers have just lost all professional relevance, I'm OK. I cut my professional teeth on C.

Sometime ago on Twitter I gave in-jest advice to new C programmers: all your code and data are in one giant mutable array indexed by pointers, good luck. David MacIver (http://twitter.com/drmaciver) responded that the C standard didn't say that. Whereupon I swore a mighty blood oath of vengeance upon him for ruining my joke with facts.

Vengeance is hard so I'll get started later, maybe tomorrow or next week. Right now I'll write a bit about the facts behind pointers. One of the trickiest aspects of writing portable C programs is that a great deal of the specification says that the behavior of certain constructs is undefined. Old C hands sometimes point out some undefined behavior in a C newbie's program whereupon the newbie fires up gcc or msvc or whatever and shows that the code behaves exactly as it "should." Who's right? What does it mean to be undefined?

C was designed to be both 1) portable across a wide range of platforms and 2) work at a very low level. Those two goals are pretty much in opposition - low level details can vary widely across platforms. Don't think MacIntel vs Wintel, think IBM System/360 vs DEC PDP-11 vs Motorola 68x vs x86. And so the C standard says that certain things are syntactically valid but the behavior can depend on compiler, options, hardware, and the day of the week. For example according to the standard dereferencing a null pointer is undefined. It may segfault; silently appear to succeed but actually return garbage; update your Facebook page with your masturbation frequency; or none of the above. It's all good.

Why should you care about portability if you aren't planning to port? Well, for one thing, undefined behavior can vary across versions of the same compiler or even different compiler options so planning not to port is not a realistic option. For another, even when you think you know what is going to happen on your target platform you might be wrong in some cases - some security critical cases, even. And finally, if you're good at hacking C there's a good chance you'll someday be called upon to write code for some device with a very different architecture. You might as well get into good habits now.

NULLity

While I was unable to find an example of Facebook updating I was able to find examples of two other distinct null pointer behaviors. The following program derefs a null:

#include <stdio.h>

int main() {
   puts(NULL);
}

On one common platform the program caused a segfault as you might expect. But on a second, mystery platform (to be revealed shortly) the program ran to "successful" completion with the output "║►╤". Yup, garbage, not crash.

Here's another one. The standard says that a null pointer can textually be written with a 0 in cases where the compiler can statically prove that you want a pointer, but it does not guarantee that a null pointer will in fact consist of a bit pattern of all zeros. So if you declare

const int n = 0;

Then the following will always be true

NULL == 0;
n == 0;

But the following might or might not be true

NULL == (void *)n;
(int)NULL == n;
(int)NULL == 0;

Unfortunately, it's a bit difficult to find a platform where the null pointer is not a pattern of 0 bits so this non-portable code is surprisingly portable, but see the comp.lang.c FAQ.

Good Math, Bad Pointer Math

Here is an example of some code that does behave very, very differently.

#include <stdio.h>

int main() {
   const unsigned long base = 0xfffc;

   /* table header */
   printf("\t\tlong\t\tchar *\n");

   /* size of data types */
   printf("size\t\t%d\t\t%d\n", sizeof(base), sizeof((char *)base));

   /* the core table */
   for (unsigned long i = 0; i < 8; i++) {
      unsigned long n = base + i;
      char *p = (char *)base + i;
      printf("%08lx + %ld\t%08lx\t%08lx\n", base, i, n, p);
   }
}

C aficionados will be horrified that I'm using non-portable %08lx instead of portable %p to output a pointer, but using %p would reveal too much. On the first, common platform the result was

  long  char *
size  4  4
0000fffc + 0 0000fffc 0000fffc
0000fffc + 1 0000fffd 0000fffd
0000fffc + 2 0000fffe 0000fffe
0000fffc + 3 0000ffff 0000ffff
0000fffc + 4 00010000 00010000
0000fffc + 5 00010001 00010001
0000fffc + 6 00010002 00010002
0000fffc + 7 00010003 00010003

So both longs and pointers are 4 bytes (32 bits) and both long and pointer math are pretty ordinary.

Now, on the second, mystery platform with one set of compiler options I get this

  long  char *
size  4  4
0000fffc + 0 0000fffc 0000fffc
0000fffc + 1 0000fffd 0000fffd
0000fffc + 2 0000fffe 0000fffe
0000fffc + 3 0000ffff 0000ffff
0000fffc + 4 00010000 10000000
0000fffc + 5 00010001 10000001
0000fffc + 6 00010002 10000002
0000fffc + 7 00010003 10000003

Still 32 bits but pointer math is, um, strange. 3 nibbles (12 bits) have been skipped. It gets perhaps weirder with a different set of compiler options on the mystery platform.

  long  char *
size  4  4
0000fffc + 0 0000fffc 0000fffc
0000fffc + 1 0000fffd 0000fffd
0000fffc + 2 0000fffe 0000fffe
0000fffc + 3 0000ffff 0000ffff
0000fffc + 4 00010000 00000000
0000fffc + 5 00010001 00000001
0000fffc + 6 00010002 00000002
0000fffc + 7 00010003 00000003

So now my 32 bit pointer is wrapping around at a mere 16 bits. And that's a clue to our mystery: we're kicking it old skool.

Mystery Revealed

For my first, common platform I'm using "gcc -std=c99" on 64 bit Ubuntu x86 but compiling using "-m32" to force 32 bit compilation. The second, mystery platform is the Watcom C compiler in C99 mode (wcc -za99) on the FreeDOS MS-DOS compatible operating system running in a VirtualBox virtual machine.

Using DOS keeps the machine in "real mode" - which basically means the chip (or virtual chip) behaves as an 8086[1]. Now, the 8086 is a peculiar beast. It's a 16 bit chip - registers are all 16 bits and most instructions work with 16 bit words - but it has a 20 bit memory address width. How do you squeeze 20 bits into a 16 bit word? Well, internally the 8086 uses a highly optimized form of data compression based on lzw that...

No, wait, that's not right. The answer is that the 8086 un-squeezes 20 bits into 32 bits - two 16 bit words instead of one. But here's where it gets weird. Instead of saying "we'll allow 32 bits of addressing but make it a hardware fault if anything above the low 20 bits is set" the 8086 designers said, well, this:

  XXXX0  the "segment" shifted left by 4 bits
+  XXXX  the "offset"
-------
  XXXXX  final 20 bit address

A segment register is shifted left 4 bits and then added to an offset register to get an actual address. So now it's clear why 0x0000fffc + 4 = 0x10000000 with one set of compiler flags: the segment has been incremented by 1 in the bits that count[2]. But why is the result 0x00000000 with a different set of compiler flags?

In order to get the 0x0000fffc + 4 = 0x10000000 behavior the machine code emitted by the compiler has to keep track of overflow - so every pointer add is not just a register add, but a register add plus an overflow check and perhaps a second register update. There may also be a move of two registers to/from memory instead of just one. That could be quite a performance hit on a lowly 4.77MHz machine (the original clock speed on the IBM PC).

So, as an optimization, most C compilers for 8086 real mode allow you to say that no structure is ever going to be larger than 2^16 bytes (64K) so that the emitted code doesn't always need to do all those gymnastics. The upshot: pointers silently overflow at 16 bits. 0x0000fffc + 4 = 0x00000000 and 0x1000fffc + 4 = 0x10000000.

If you're playing along at home the first result was achieved using the -mh ("huge" memory model) compiler flag and the second one was with -ml ("large" memory model). Both were linked with "wlink name test file test.obj". For more about the gory details of real mode memory models see 16 bit memory models[3].

The Standard

Back to earth now. The C standard is fine with all of the above behavior. As a gross oversimplification the standard says that

  1. A pointer must be NULL or must point at something valid or must point just past the end of something valid. Pointer math that would result in a pointer that is outside of those bounds is undefined. It might work like any of my examples, or it might put a phone order in for pizza.
  2. A dereference of a NULL pointer or a pointer does not point at something valid, even if the pointer is only pointing one step past the end of something valid, is also undefined. You pray for a crash.

In short, the C standard doesn't pretend that your program is one giant array. It pretends each individual allocation creates island unto itself (and may be part of a larger island or composed of smaller islands). The C standard is pretty quiet about what can happen between those islands. It's just that in modern platforms memory tends to behave like one giant flat mutable array so it's easy to assume that that's part of the C model.

One way to look at it is that the hypothetical C standard memory model treats pointers somewhat the way higher level languages treat references and array indexing, except that those higher level languages define bad behavior as either being statically impossible (no pointer math) or dynamically checked (null pointer exceptions and array index checks) where the C standard just says "undefined."

Obviously looping around some arbitrary pointer location is not going to be portable, but this stuff has less obvious consequences. Code like the following is not guaranteed to work

char *buf = malloc(BUF_SIZE);
char *end_of_buf = buf + BUF_SIZE;

...

/* check for buffer overrun */
if (buf + offset >= end_of_buf) {
  /* buffer overrun */
} else {
  /* everything is okay */
}

As the loop above showed, buf + offset might in fact be less than buf on an overrun. The correct answer is to compare offset with BUF_SIZE.

In Defense of Undefined

When people explain C's choices it's common to say that it's about implementation efficiency. Indeed that's a huge part of it. But remember that C was specifically designed to allow very low level access; to allow much (though by no means all) of what assembly allows. So C's ability to write to arbitrary bits of memory allows useful things like talking to memory mapped devices. That kind of code is not portable, but there are important bits of some kinds of programs that aren't going to be portable no matter what you do. The trick is to keep those bits walled off and as small as possible.

So be careful with your pointers and I'll see all you former Java programmers in the unemployment line as I drive by in my new car with the custom plates "PTRMATH%saccount=34389983829 password=h3ll0k1ttySegmentation fault.

Footnotes

  1. Yes, I'm keeping a "virtual" machine in "real" mode. I'm just as confused as you are.
  2. Actually, one consequence of this funky scheme is that the same location in memory can be addressed multiple ways. 0x00120 + 0x0003 = 0x00123 and 0x00100 + 0x00023 = 0x00123. It's just that in the huge memory model the compiler's emitted code for pointer math keeps things canonical by using only the highest 4 bits of the segment.
  3. For even more weirdness associated with this segmentation stuff read a bit about the A20 Line, something that causes heartburn to bootloader writers to this day.

Friday, August 21, 2009

Getting to the Bottom of Nothing At All

Except when I'm being a total smart ass or self indulgent emo telling the Ruby community to f' off, most of what I write talks about practical stuff, often with a small dose of theory. But this time it's about theory with a small dose of practice.

Last time I talked about the way programmers in popular C derived languages C++, Java, and C# think of a void function as being one that returns nothing. I contrasted this with the functional view that such functions return something, they just return the unit value (), something that doesn't carry any information. To expand on this I want to talk about actually returning nothing.

In any Turing complete language it is possible to write a function that really truly never returns anything, not even a made-up singleton value like (). In C style languages the simplest such function might contain an infinite loop like

while(true);
For my purposes a better way to explore the idea is with infinite recursion. Pretend your favorite C style language does tail call optimization[1] so we can ignore stack overflow issues and examine the following
X foo(){ return foo(); }

The question is what should the X be? And the answer is that if you plug that line of code into C, C++, C#, or Java then X can be just about any type you want[2].

If that doesn't give you pause then consider this: your type system appears to be lying to you. The function is promising an X but has no hope of delivering. And its quite happy to promise you any X. If you write "String foo(){ return foo(); }" then your type system considers the expression "foo();" to have type String so that you can write "String myFoo = foo();". But guess what, foo doesn't return a String and myFoo will never get a String. Is the type system lying?

The answer is no, of course not. It's just proving something a bit weaker than you might think at first. Instead of proving that "foo() will compute something that is a String" it's proving that "foo() won't compute something that isn't a String." If your "Law of the Excluded Middle" sense just tweaked, then you're on the right page. You'd think there are only two cases to consider: either the function definitely computes a String or it definitely doesn't. But there's this third case where it doesn't compute anything, and since the type checker can't reliably detect non-termination (c.f. Turing Halting Problem) it has to do something a bit odd. First, it assumes that the foo() declaration is correct and does return a String. Then it goes looking for a contradiction such as returning an int. Inside the function the compiler sees that the return is the result of the expression foo(). But the compiler has already assumed that foo() returns a String. There's no contradiction between declared and result so all is happy. The word "tautology" should come to mind right about now. By assuming X is true the type checker has proved that X is true. This weakening is a practical consequence of the Turing Halting Problem and there are at least two good ways to think about it. But first some more examples of the phenomenon.

Exceptions and Exits and ...

I very casually suggested that we ignore stack overflow issues in the infinite recursion above. But exceptions are an important part of this story because they are, in some interesting way, related to non-termination. Consider this function (or its equivalent in your favorite language)
X foo(){ throw new RuntimeException(); }
Once again, X can be any type. And once again foo() does not in fact compute an X.

Clearly these two definitions of foo are different things. Non-termination means we're stuck whereas an exception means we might be able to recover and try something else, or at least notify that there's a problem and shutdown cleanly. None-the-less they both behave similarly. First, they hijack the normal flow of control in such a way that it never returns back to the caller. And second they are both examples of a kind of weakening in the type system since any type can be substituted for X. Formally they are both examples of functions that diverge (functions that return normally are said to converge).

The Type

Here's a third example of a diverging function in Java. Translate as necessary for your language (in Java System.exit(n) stops the program immediately and returns n to the calling process).
X foo(){ System.exit(1); return null; }

Yet another case where foo() won't compute anything, it diverges. In fact, the return statement after the exit call is dead code. However, this is example is slightly different than the other examples because we had to create the dead code for the compiler to be happy[3] and, in fact for a different "X" like int the return value might have to be changed. That bit of dead code is closely related to the heart of this article.

Java can detect some kinds of dead code. If you write "X foo() { throw new RuntimeException(); return null; };" then Java recognizes that the return is unreachable and complains. In my System.exit(1) example Java didn't recognize that the call to exit() will never return so it required a following return statement. Obviously "throw" is a keyword and can get special attention that a mere library function can't but it would be useful to be able to let Java know that, like throw, exit() diverges.

One of the best ways to tell a compiler how a function behaves is by using types and, in type theory, there's a type that expresses just what we need. The type is called bottom (often written ⊥), and while there are different ways to look at the bottom type I'm going to go with a subtyping based view that should be accessible to C++, C#, and Java programmers.

If a language has subtyping and a function says it will return a type "X" then the function is allowed to return a "Y" instead as long as Y is a subtype of X. In my example of a function that just throws an exception the return type could be anything at all. So if we wanted System.exit(1) to indicate that it diverges the same way throw does, then its return type should be a subtype of all types. And indeed, that's exactly what bottom is.[4] bottom is a subtype of String, and int, and File, and List<Foo>, and every other type. Usefully, conventional type hierarchies are written with supertypes above subtypes, which makes a convenient mnemonic for "bottom" which needs to go below everything on such a hierarchy.

Now, if you're used to OO thinking then you expect a value with a certain subtype to in some sense be substitutable everywhere that a supertype is expected. But how can any one object behave like a String, an int, a File, etc? Remember that bottom indicates divergence: an expression with type bottom can never compute a value. So if exit()'s return type was bottom it would be totally safe to write "String foo() { return System.exit(1); }" while another bit of code could have "int bar() {return System.exit(1); }."

Making it Useful, A Divergence to Scala Divergence

Occasionally it might be useful to indicate that a function diverges. Examples are functions like System.exit(1), or functions that always throw an exception, perhaps after logging something or doing some useful calculation to create the exception. But interestingly out of all the statically typed languages that have any following outside of pure research only Scala has an explicit bottom type, which it calls Nothing. The reason Scala has a bottom type is tied to its ability to express variance in type parameters.

For some reason a lot of programmers run screaming into the night when you say "covariance" or "contravariance." It's silly. I won't get into all the details of variance, but I will say that in Scala the declaration "class List[+T] {...}" means that List[Subtype] is a subtype of List[Supertype]. No screaming necessary. And List[+T] brings me to one extremely practical use of bottom - what type should an empty List have?

Well, an empty List can have type List[String] or List[Foo] or List[int]. T can be whatever. And what's a subtype of whatever for all values of whatever? You guessed it: bottom (Nothing). Indeed, Scala has one constant called Nil whose type is List[Nothing]: a subtype of List[String], List[int], and List[whatever]. It all ties up in a bow when you consider that a List[T] has a method called head which returns the first element of the list as type T. But an emtpy list has no first value, it must throw an exception. And sure enough, head in List[Nothing] has type Nothing.

C# 4.0 is supposed to be getting definition site variance similar to Scala's but using the clever mnemonic keywords "in" and "out". I haven't heard anything yet on whether it will also add a bottom type, but it would make a lot of sense.

Java has usage site variance using wildcards. You can say "List<? extends Supertype> x" to indicate that x can have a List<Supertype> or a List<Subtype>. The bottom type would be useful in Java, too, although not as compelling since wildcards are so verbose people rarely use them even when they would make sense. Plus, Java folk tend to mutate everything and List[Nothing] in part makes sense in Scala because Scala Lists are immutable.

C++ does not have any simple way to express this kind of variance so the bottom type is even less compelling in C++ than it is in Java.

Back On Track

Haskell and languages in the ML family don't have an explicit bottom type. Their type systems don't have subtyping so adding bottom as a subtype would confuse things. None-the-less, they do have a nice way to express bottom that can be teleported back to Java, C#, and C++ (but not C). Recall that bottom is a subtype of all types. Another way of saying that is that if a function returns type bottom then for all types A, the function returns something compatible with A so why not express that directly? In Haskell the "error" function takes a string and throws an exception.
Prelude> :type error
error :: [Char] -> a
In Haskell, a lower case identifier in type position is always a type parameter, and [Char] means "list of Char", aka String. So for all types a, if "error" doesn't diverge then it will take a String and return an a. That can pretty much be expressed directly in Java
public static <A> A error(String message) { throw new RuntimeException(message); }

or C++

class message_exception : public std::exception {
  public:
    explicit message_exception(const std::string& message) : message_(message) {}
    virtual ~message_exception() throw() {};

virtual const char * what() const throw() { return message_.c_str(); };

private: const std::string message_; }; template <typename A> A error(const std::string& message) {throw message_exception(message); }

And for either language, usage would be something like

int divide(int x, int y) {
  if (y == 0) {
    return error<int>("divide by zero"); // drop the "<int>" in Java
  } else {
    return x / y;
  }
}

Haskell also has a function called "undefined" that simply throws an exception with the message "undefined." It's useful when you want get started writing some code without fully specifying it.

Prelude> :type undefined
undefined :: a

The function isn't as interesting as the type, which promises that for any type a "undefined" can compute an a or diverge. Since "undefined" can't possible just produce a value of any arbitrary type, it has no choice but to diverge. The same idea can be added to Java

public static <A> A undefined() {return error("undefined"); }
or C++
template <typename A>
A undefined() { return error<A>("undefined"); }

In either language it might be used as

string IDontKnowHowToComputeThis(int input) { 
  return undefined<string>(); // again, make appropriate changes for Java
} 

Given the requirement to write the "return" keyword in C#, Java, and C++ I'm not sure how practical something like a generified error function really is as compared to having it return an exception and making the user write 'throw error("blah")', nor whether "undefined" is that much more useful than just throwing an UndefinedException, but this does illustrate the relationship between the bottom type and functions that, in Haskell terms, compute "forall a.a" or in C#/Java/C++ terms return the type parameter A without taking an A as an argument.

Also, as always care should be taken when transliterating from one language to another. Java would allow a function like "undefined" to return a null instead of diverging. C++ would allow it to return anything all and would only fail to compile if it were used in an incompatible way. That contrasts with languages like Haskell and ML in which the only way to implement "undefined :: a" is to make it diverge in some form or fashion[5].

The Bottom Value

I've spent some time talking about the bottom type as having no values. But it does have expressions like "undefined()" and that leads to a rather philosophical notion of how to think of bottom as having a value. Sorta. Skip this section if you don't like total gray beard philosophizing. If you're brave, then stick with me and imagine a subset of your favorite C derived language that does not allow side effects. No mutation, no IO, and no exceptions. In this strange world functions either just compute values or they diverge by not halting. In such a world the order of evaluation mostly isn't important as long as data dependencies are met - in f(a(), b()), you can compute a() before b() or b() before a(), whatever, they can't interfere with each other. Doesn't matter. The only time order of evaluation is forced on us is when there are data dependencies, so "a() + b()" must compute a() and b() (or b() and a()) before their sum can be computed. Nice.

Well, almost. Order of evaluation can matter for expressions that diverge. Let me give an example.

int infiniteLoop() {return infiniteLoop();}
int f(int x) {return 42;}
int result = f(infiniteLoop());

Because f ignores its argument, if we call "f(infiniteLoop());" there are two possible outcomes. If "infiniteLoop()" is eagerly evaluated before f is called then the program will diverge. On the other hand, if the expression "infiniteLoop()" is lazily remembered as being potentially needed later then f can successfully return 42 and the diverging expression can be forgotten just like it never happened (because it didn't).

We've gone to the pain of eliminating side effects from our language so it's a little irritating to have to keep thinking about evaluation order just to deal with divergence, so we perform a little mental trick; a little white lie. Imagine that functions like infiniteLoop above don't get stuck, they compute a value called ⊥, which is the only value of type bottom.

Now, since the bottom type is a subtype of all types and ⊥ is the bottom value, then it follows that every type must be extended with ⊥. Boolean = {⊥, true, false}, int = {⊥, 0, 1, -1, 2, -2, ...}, and unit = {⊥, ()}. That means we need some rules for ⊥ and how it interacts with everything else. In the vast majority of languages including Java, C#, and C++ but also impure functional languages like F# and OCaml doing just about anything with ⊥ can only compute ⊥. In other words, for all functions f, f(⊥) = ⊥. If you write f(infiniteLoop()) in those languages then the result is ⊥. This kind of rule is called "strictness".

In contrast, Haskell is often called a "lazy language," meaning that expressions aren't evaluated until they're needed. That's not quite technically correct. The Haskell spec just says that it is "non-strict." The spec doesn't care when expressions are evaluated so long as programs allow ⊥ to slide as far as possible. An expression like f(infiniteLoop) must evaluate to 42. Haskell basically forces an expression involving ⊥ to evaluate to ⊥ only when the argument must be used[6]. The distinction between "lazy" and "non-strict" is subtle, but by being "non-strict" rather than "lazy" a Haskell compiler can use eager evaluation anytime it can prove that doing so doesn't change behavior in the face of ⊥. If a function always uses its first argument in a comparison, then Haskell is free to use eager evaluation on that argument. Since Haskell truly does forbid side effects(unlike our imagined neutered language above), the choice of evaluation strategy is up to the compiler and invisible except for performance consequences[7].

C++, Java, and C# have just a tiny bit of non-strictness. In these languages "true || ⊥" is true and "false && ⊥" is false. If these languages were totally strict then "true || ⊥" would be ⊥. Users of these languages call this behavior "short circuiting" and it's done for performance reasons rather than being a philosophical goal, but it's still a curious departure from their normal rules.

There you have it. The bottom value ⊥ is a clever mental hack to allow purely declarative functional code to be reasoned about without injecting sequencing into the logic. It allows people to talk about the difference between a purely declarative strict language and a purely declarative non-strict language without getting into details of evaluation order. But since we're talking about languages that aren't so purely declarative, we can take off our philosopher hats and return back to the world where side effects are unrestricted, bottom is a type with no values and divergence means that flow of control goes sideways.

Tuples and Aesthetics

Last time I talked about the unit type and how, if you interpret types as logical propositions, the unit type behaves as a "true" and tuple types act as logical and "∧." I also talked about an algebraic interpretation where unit type acts like 1 and tuple types act like multiplication "×". So the type (A,B) can also be written as A∧B or A×B. The type (A,unit) is isomorphic to A, A×1 = A, and A∧True <=> A.

Bottom has similar interpretations as 0 or False. The type (A,bottom) is isomorphic to the bottom type because you can't compute any values of type (A,bottom). A×0 = 0, and A∧False <=> False. Nice how it all hangs together, eh?

Bottom behaves like False in another way. In logic if you assume that False is true you can prove anything. Similarly in type systems the bottom type allows the programmer to promise to do even the impossible. For instance, here's a Java function signature that promises to convert anything to anything.

public static <A,B> B convert(A arg) {...}
If you ignore dynamic casting and null (they're both weird in different ways) there's no meaningful way to implement that function except by diverging. More on this in an upcoming episode.

Conclusion

I somehow don't think anybody will be running into the office saying "I just read an article on the bottom type, so now I know how to solve our biggest engineering challenge." But the bottom type is still interesting. It's a kind of hole in your static type system that follows inevitably from the Turing Halting Problem[8]. It says that a function can't promise to compute a string, it can only promise to not compute something that isn't a string. It might compute nothing at all. And that in turn leads to the conclusion that in a Turing complete language static types don't classify values (as much as we pretend they do) they classify expressions.

Footnotes

  1. gcc does do tail call optimization under some circumstances.
  2. Oddly, in C, C#, and Java X can't be "void" because its an error to return an expression of type void. C++ does allow void to make writing templates a bit easier.
  3. Yes, I can make it a void function and not have a return. But again that would illustrate how exit() is different from a throw: throw doesn't require the return type to be void.
  4. Note I'm saying "subtype" here and not "subclass." int, List<String>, and List<File> are 3 different types. In C++, Java, and C# int doesn't have a class. In Java and C# List<String> and List<File> both come from the same class. Yet bottom must be a subtype of all 3 so it can't be a subclass.
  5. A plea to my readers: I'm too lazy to see if "forall a.a" is bottom in F# or C#, or if, like Java, null would be allowed. My bet is that null won't be allowed because only Java has the peculiar notion that type parameters must be bound to reference types.
  6. Very loosely speaking. Please see the Haskell Report for all the gory details about when Haskell implementations are allowed to diverge.
  7. Haskell has another clever hack where throwing an exception isn't an effect, but catching one is. Since order of evaluation is unspecified in Haskell, the same program with the same inputs could conceivably cause different exceptions. Exceptions are theoretically non-deterministic in Haskell.
  8. A language that isn't Turing complete can prove termination and does not need a bottom type, but such a language isn't powerful enough to have a program that can interpret the language.

Thursday, July 23, 2009

Void vs Unit

I suspect most people who come to one of the statically typed functional languages like SML, OCaml, Haskell, F#, or Scala will have only seen static typing in one of the popular C derived languages like C++, Java, or C#. Some differences in type system become clear fast. One particular spot is the relationship and difference between void and unit.

So imagine your favorite C derived language. Consider your language's equivalent of the following function definition [1]

void printHello(String name) { printf("hello %s", name); }

What does the word "void" mean? To any C, C++, Java, C# programmer it means that printHello doesn't return anything. Simple.

But a functional programmer[2] sees it quite differently. This article will explain the different point of view and why it's useful.

What's a Function?

It turns out to be very hard to write a single definition of "function" that covers all the ways the word has been used. But to a functional programmer at the very least a function is something that maps inputs to outputs. In the impure functional languages that function may also have an effect like printing "hello." I'll stick with the impure definition in this article.

Now, when a functional programmer talks about mapping inputs to outputs he or she doesn't mean taking a string as an input and outputting it on the console. Instead he or she is talking about arguments and returns of a function. If something is to have any pretense at being a function, or at least a function that returns normally, then it must return something.

printHello clearly does return normally so a functional programmer insists that it returns something. But a functional programmer also must recognize that it doesn't return anything "meaningful."

Thus rather than the concept "void," a functional programmer thinks of unit. In some languages the unit type is written as (), but for clarity I'm going to stick with "unit." Now, unit is a type, conceptually not much different from say "int". But unit, as its name somewhat implies, has exactly one value conventionally represented as an empty pair of parentheses ().

In OO terms () is a "singleton". There's only one instance of it. It's also immutable. In a very real sense it carries no information. You can't write a program that behaves differently based on what a function of type unit returns - it always returns the same ().

A Performance Diversion

At this point a little guy in the back of your head might be asking "doesn't returning something meaningless just waste cycles?" Well, it's important to distinguish between abstraction and implementation. The unit value, (), is an abstraction that the implementation is free to optimize away. If you write in Scala

def printHello : Unit = {
  println("hello")
  ()
}

Then the compiler will generate the equivalent of the original void based code. If in Scala you write

print(printHello)

Then the compiler might generate something like

printHello
print( () )

Because () is a singleton it doesn't matter at the machinecode/bytecode level that it's not "actually" returned. The compiler can figure it out with no performance hit at all.

Making It Useful

Okay, blah, blah, blah, functional think this, unit that. Whatever. What's the point? () is a value with no information, unit is a type with only that value...what's the practical use?

Well, it turns out that its useful for writing generic[3] code. Java and C++, somewhat famously, do not have first class functions. But they can be faked. Here's the Java

interface Function<Arg, Return> {
   Return apply(Arg arg);
}
Function<String, Integer> stringLength = new Function<String,Integer> {
  Integer apply(String arg) {
    return arg.length();
  }
}

In C++ you can do it essentially the same way with a template and operator() overloading[4]

template <typename Arg, typename Result>
struct Function {
  virtual Result operator()(Arg arg) = 0;
};
struct StringLength : Function<string, int> {
  int operator()(string arg) {
     return string.length();
  }
}
Function<int, string> *stringLength = new StringLength();

C# has a function type and lambdas so it's very easy to write.

Func<string, int> stringLength = s => s.Length;

So, if stringLength is an object of type Function<String, int> / Function<int, string> * / Func<string, int> then I can write the following

int result = stringLength.apply("hello"); // Java
int result = (*stringLength)("hello"); // C++
int reulst = stringLength("hello"); // C#

So here's the question: how would I convert the printHello function earlier into a Function<String,X>/Function<string, X> */Funct<string, X>. What type is X?

If you've been playing along so far you recognize that it "should be" void, but Java and C# don't allow it. It's invalid to write Function<String, void> in either language.

Let's pretend you could. After all, C++ will let you write a class PrintHello : Function<std::string, void>. But there's still a problem even in C++. Generic code might expect a result and none of these languages let you denote a value of type void. For instance (in Java) imagine thise simple bit of reusable code.

public static <A,B> List<B> map(List<A> in, Function<A,B> f) {
  List<B> out = new ArrayList<B>(in.size());
  for (A a : in) {
     out.add(f.apply(a));
  }
  return out;
}

C++ has std::transform and C# has IEnumerable#Map, which are more or less the same idea.

Anyway, given something like the above calling map/transform/Map with a list of strings and the printHello object should result in a List<void> of the same length as the original list. But what's in the list? The answer, in a functional language, is a bunch of references to (). Tada, the meaningless is useful for writing generic code.

C++ will allow a Function<std::string, void> but any attempt to call something like map with it will fail to compile. C++ doesn't have a () value so our generic code has become slightly less generic.

The Work Arounds

Java, C++, and C# programmers do solve this problem all the time. One common option is to not use Function<string, void> but some other concept like "Action<string>" to mean a function that takes a string, performs a side effect, and returns nothing useful. Then you essentially duplicate "map" into something called perhaps "foreach" that expects an Action<T> and returns void. That probably makes sense in this case since a list of references to a meaningless value is perhaps silly, but it also means a great deal of code duplication if this kind of higher order programming is common. For instance, you can't write just one compose function that creates a function from two other functions, you also have to write a compose that takes an action and a function to create an action.

Another answer is to make the printHello function object have type Function<String,Object>/Function<string,void*>/Func<string,object> and return a null. That's pretty disgusting.

Perhaps the best option is to create your own Unit enumeration with only a single value. That does leave Unit nullable in Java, but railing against nullability in Java is another blog post.

enum Unit { unit }; // C++, Java, or C#

As for good old C, well, the type system doesn't do parametric polymorphism so generic code in C is problematic no matter what you do. Still, with function pointers it's conceivable you might find a use for a unit value. Typically a C programmer would use 0.

A Quick Aside About Purity

For the most part I've casually used effects everywhere. Most functional languages like SML, OCaml, F#, and Scala are impure and let the programmer get away with that. Some functional languages, e.g. Haskell without unsafe* extensions, do not. They must track effects using the type system. But even with this caveat, the unit type is still useful to, for instance, distinguish from a function that fetches a string from the keyboard which would have might have type IO String vs one that prints to the console which might have type String -> IO Unit.

Conclusion

Void and unit are essentially the same concept. They're used to indicate that a function is called only for its side effects. The difference is that unit actually has a value. That means that unit functions can be used generically wherever a generic first class function is required. The C derived languages miss out by treating void functions as functions that don't return anything instead of as functions that don't return anything meaningful.

Next time I'll talk about functions that really, truly return nothing.

Postscript on Aesthtics and Tuples

Mostly I've talked about the practical value of the unit type in this article, but I do love it when convenient practice and beautiful theory line up. So here's an attack from a different direction.

Many functional languages have tuples. A tuple is an ordered fixed size collection of values of (possibly) different types[5]. Tuple types are usually written with parens. For instance, (Foo, Bar) is the type of pairs (2-tuples) with one Foo, one Bar. The same type is sometimes written as Foo × Bar, which makes sense too. You can see the type as a set that consists of a kind of Cartesian product of the Foo and Bar sets. For instance if Foo has only two possible values {f1, f2} and Bar has only two possible values {b1, b2} then the type Foo × Bar has 4 possible values, {(f1, b1), (f1, b2), (f2, b1), and (f2, b2)}.

You can have tuples that with higher numbers of members like triples and 4 tuples and such.

What about a 1-tuple? Well, the type Foo can be seen as a 1-tuple. There's would be no useful difference between the type Foo and the type (Foo).

The question to ask is, given that we have 3-tuples, 2-tuples, and 1-tuples does it make sense to have a 0 tuple? The answer, dear reader, is unit. The unit value is a tuple with no members. That's why Haskell calls both the type and the value "()" - it's like the type (A,B) without the A,B. And just as an empty list and empty set aren't quite "nothing" neither is the empty tuple.

And here's another interesting thing, type theorists will also call the unit type "1" (not to be confused with the value 1). To see why, look at the type (Foo, unit) which can also be written as Foo × 1. If Foo only has 2 possible values {f1, f2} then the entire set of possible values for the type Foo × 1 is {(f1, ()), (f2, ())}. It's the same 2 values extended with a little bit of "meaningless" information that can be stripped off or added back in as desired. Formally speaking it's isomorphic to the type Foo. To abuse notation slightly, Foo × 1 = Foo. So unit is not only the type that has only 1 value, it's the type that behaves like a 1 at the type level.

Footnotes

  1. In C of course it would be char *name, in C++ it might be char * or I might use the string and iostream libraries. Whatever. The differences aren't that relevant to this discussion.
  2. To save myself a lot of time I'm using phrases like "functional language" to mean the statically typed functional languages that have been influenced to a large extent by ML. Some of what I'm talking about does translate into some dynamically typed languages as well, although they often "pun" things so that there isn't a unit value that's distinct from nil, null, false or some other singleton value.
  3. I use the phrase "generic" here to mean what the functional community means by "parametrically polymorphic" or just "polymorphic." However, my target audience is more likely to associate polymorphism with OO dispatch, so "generic" it is.
  4. In C++ you can also do it without the pure virtual base class and just use template structural typing based on operator(), but shortly it'll be easier for me to describe a problem if I use a base class to give the concept a name. I could also allocate on the stack but that slightly confuses my point. I'm going this route to keep things symmetrical among the languages. Besides, this isn't a tutorial on why C++ isn't Java. On a related note it's a shame C++0x won't have concepts.
  5. This is quite different from a Python tuple which is really just a list.