Monday, September 13, 2010

Moron Why C Is Not Assembly

In response to a previous article a poster in some forum called me an idiot and said "everybody knows that C is a portable assembly language with some syntax sugar." The idiot insult hurt deeply and I cried, but once the tears were dry I resolved to write a bit more on why C isn't assembly, or is at best an assembly for a strange, lobotomized machine. Also, I may have misspelled "more on" in this article's title.

To compare the two things I kinda need to define what I'm comparing. By C I mean an ANSI C standard, whichever version floats your boat. Most real C implementations will go some distance beyond the standard(s), of course, but I have to draw the line somewhere. Assembly is more problematic to define because there have been so many and some have been quite different. But I'll wave my hands a bit and mean standard assembly languages for mainstream processors. To level set: my assembly experience is professionally all x86 and I had a bit of RISC stuff in college plus I did some 6502 back in the day.

Assembly-Like Bits

There are some very assembly like things about C and its worth mentioning them. The main one is that C gives you very, very good control over the layout of data structures in memory. In fact, on modern architectures, I'd bet that this fact is the primary ingredient in the good showing that C (and C++) have in the language shootout. Other languages produce some damn fine executable code, but their semantics make it hard to avoid pointer indirection, poor locality of reference, or other performance losses.

There are a few other assembly-ish thing in C. But not much. Now to why C is just not assembly. I'll start with...

The Stack

As far as I'm concerned the number one thing that makes C not assembly is "The Stack." On entry into a function C allocates space needed for bookkeeping and local variables and on function exit the memory gets freed. It's very convenient, but it's also very nearly totally opaque to the programmer. That's completely unlike any mainstream assembly where "the stack" (lower case) is just a pointer, probably held in a register. A typical assembly has no problem with you swapping the stack (e.g. to implement continuations, coroutines, whatever) while C defines no way to do much of anything with the stack. The most power that C gives you on the stack is setjmp/longjump - setjmp captures a point on the stack and longjmp lets you unwind back to that point. setjmp/longjmp certainly have an assembly-like feel to them when compared to, say, exceptions, but they are incredibly underpowered when compared to the stack swizzling abilities of a typical assembly language.

Speaking of continuations, let's talk about...

Parallel and Concurrent Code

The C standard has almost nothing to say about parallel and concurrent code except that the "volatile" keyword means that a variable might be changed concurrently by something else so don't optimize. That's about it. Various libraries and compilers have something to say on the subject, but the language is pretty quiet. A real assembly, or at least a real assembly on any modern architecture, is going to have all kinds of stuff to say about parallelism and concurrency: compare and swap, memory fences, Single Instruction Multiple Data (SIMD) operations, etc. And that leads a bit into...

Optimization

void doubleArray(int * restrict dest, 
     int const * restrict src, size_t size) {
   for (size_t i = 0; i < size; i++) {
      dest[i] = src[i] * 2;
   }
}

Under gcc -std=c99 -S at -O1 I get pretty standard sequential code. Under -O3 I get highly parallel code using x86 SIMD operations. An assembler that changed my code that way would be thrown out as garbage but a C compiler that DIDN'T parallelize that code on hardware that could support it would seem underpowered. And given the highly sequential nature of the original code the transformation to parallel code goes way beyond mere de-sugaring.

Optimization leads me to ...

Undefined Behavior

The C standard says a bunch of stuff leads to undefined behavior. Real compilers take advantage of that to optimize away code, sometimes with surprising results.

What assembly has undefined behavior on that scale? mov al, [3483h] ;(Intel syntax) is going to try to copy the byte at 3483h into the al register, period. It might trap due to memory protection but the attempt will be made.

Conclusion

Alan Perlis famously said "a programming language is low level when its programs require attention to the irrelevant." What's relevant depends on the problem and the domain, so a corollary would be that a language that's at too high a level prevents paying attention to some of what is relevant for your needs.

C is a low level language as compared with most languages used for most day to day programming. If you spend all day hacking Ruby or Haskell then C might look like assembly. But C is actually at a substantially higher level than assembly. Assembly both permits and requires you to do things that C hides.

Please stop perpetuating the myth that C is sugar for assembly because real compilers make complicated, non-obvious transformations to get from C to machine code that assemblers don't. And please stop saying that C is a portable assembly because if it is then it's the assembly for a very peculiar machine that is both underspecified and a bit stupid when compared to real hardware.