Monday, September 13, 2010

Moron Why C Is Not Assembly

In response to a previous article a poster in some forum called me an idiot and said "everybody knows that C is a portable assembly language with some syntax sugar." The idiot insult hurt deeply and I cried, but once the tears were dry I resolved to write a bit more on why C isn't assembly, or is at best an assembly for a strange, lobotomized machine. Also, I may have misspelled "more on" in this article's title.

To compare the two things I kinda need to define what I'm comparing. By C I mean an ANSI C standard, whichever version floats your boat. Most real C implementations will go some distance beyond the standard(s), of course, but I have to draw the line somewhere. Assembly is more problematic to define because there have been so many and some have been quite different. But I'll wave my hands a bit and mean standard assembly languages for mainstream processors. To level set: my assembly experience is professionally all x86 and I had a bit of RISC stuff in college plus I did some 6502 back in the day.

Assembly-Like Bits

There are some very assembly like things about C and its worth mentioning them. The main one is that C gives you very, very good control over the layout of data structures in memory. In fact, on modern architectures, I'd bet that this fact is the primary ingredient in the good showing that C (and C++) have in the language shootout. Other languages produce some damn fine executable code, but their semantics make it hard to avoid pointer indirection, poor locality of reference, or other performance losses.

There are a few other assembly-ish thing in C. But not much. Now to why C is just not assembly. I'll start with...

The Stack

As far as I'm concerned the number one thing that makes C not assembly is "The Stack." On entry into a function C allocates space needed for bookkeeping and local variables and on function exit the memory gets freed. It's very convenient, but it's also very nearly totally opaque to the programmer. That's completely unlike any mainstream assembly where "the stack" (lower case) is just a pointer, probably held in a register. A typical assembly has no problem with you swapping the stack (e.g. to implement continuations, coroutines, whatever) while C defines no way to do much of anything with the stack. The most power that C gives you on the stack is setjmp/longjump - setjmp captures a point on the stack and longjmp lets you unwind back to that point. setjmp/longjmp certainly have an assembly-like feel to them when compared to, say, exceptions, but they are incredibly underpowered when compared to the stack swizzling abilities of a typical assembly language.

Speaking of continuations, let's talk about...

Parallel and Concurrent Code

The C standard has almost nothing to say about parallel and concurrent code except that the "volatile" keyword means that a variable might be changed concurrently by something else so don't optimize. That's about it. Various libraries and compilers have something to say on the subject, but the language is pretty quiet. A real assembly, or at least a real assembly on any modern architecture, is going to have all kinds of stuff to say about parallelism and concurrency: compare and swap, memory fences, Single Instruction Multiple Data (SIMD) operations, etc. And that leads a bit into...

Optimization

void doubleArray(int * restrict dest, 
     int const * restrict src, size_t size) {
   for (size_t i = 0; i < size; i++) {
      dest[i] = src[i] * 2;
   }
}

Under gcc -std=c99 -S at -O1 I get pretty standard sequential code. Under -O3 I get highly parallel code using x86 SIMD operations. An assembler that changed my code that way would be thrown out as garbage but a C compiler that DIDN'T parallelize that code on hardware that could support it would seem underpowered. And given the highly sequential nature of the original code the transformation to parallel code goes way beyond mere de-sugaring.

Optimization leads me to ...

Undefined Behavior

The C standard says a bunch of stuff leads to undefined behavior. Real compilers take advantage of that to optimize away code, sometimes with surprising results.

What assembly has undefined behavior on that scale? mov al, [3483h] ;(Intel syntax) is going to try to copy the byte at 3483h into the al register, period. It might trap due to memory protection but the attempt will be made.

Conclusion

Alan Perlis famously said "a programming language is low level when its programs require attention to the irrelevant." What's relevant depends on the problem and the domain, so a corollary would be that a language that's at too high a level prevents paying attention to some of what is relevant for your needs.

C is a low level language as compared with most languages used for most day to day programming. If you spend all day hacking Ruby or Haskell then C might look like assembly. But C is actually at a substantially higher level than assembly. Assembly both permits and requires you to do things that C hides.

Please stop perpetuating the myth that C is sugar for assembly because real compilers make complicated, non-obvious transformations to get from C to machine code that assemblers don't. And please stop saying that C is a portable assembly because if it is then it's the assembly for a very peculiar machine that is both underspecified and a bit stupid when compared to real hardware.

50 comments:

FredZarguna said...

Actually, the oldest FORTRAN compilers were, in many ways lower-level than 360 Assembler. We used to code all the character routines in Assembler because FORTRAN was so awful at string manipulation and formatting. For small relocatable code, we even wrote some inline stuff in machine language and punched as INTEGERs it into COMMON Blocks, which we then called as if they were subroutines.

Am I dating myself here? (I did that in High School and didn't care for it...)

Anyway,
what about:

void foo()
{
asm mov ax,[3483H]
}

/* just kidding...
anybody who says C is assembler hasn't coded any assembler, and probably hasn't coded any C.
*/

Brutallyfrank said...

people slam C out of ignorance and because they can't program using pointers and don't understand memory allocation/deallocation.

Patrick said...

"The main one is that C gives you very, very good control over the layout of data structures in memory."

Not really. It merely gives you "good" control. Try making a struct containing a 29-bit big-endian integer, followed by eleven 37-bit fields with no padding.

mlvl.jr said...

The "i = ++i - i++;" horror is plain UB! (just to remind those who forgot)

Karl Bielefeldt said...

Interesting defense, but I think people who say C is portable assembly usually intend it more as hyperbole rather than a statement of fact.

Keith Thompson said...

The reason C is not assembly is this:  Assembly code specifies machine instructions.  C code specifies behavior.

Ted Williams said...

Generally speaking, each IBM (mainframe) COBOL statement compiles to one or two IBM 360 assembly language instructions. After all, the main COBOL directives are almost the same as the assembly mnemonics. The COBOL language came first (1960 vs 1963), so I suspect this is not an accident.

Alex832 said...

As a 6502 and x86 assembly programmer, I'm surprised at your "undefined behavior" comment.  Both chips had variants produced by multiple companies which ended up with different opcodes in certain places.

You could most definitely write 6502 assembly code whose behavior was undefined, in the sense that it ran differently on different 6502's. The only thing that makes C special is that they admitted it.  An honest technical specification for the 6502 would indeed list undocumented opcodes as undefined behavior.

Mike McNally said...

No, that is absolutely not true.  A "PERFORM" statement, for example, has no direct equivalent in 360 assembly language.  I've written both COBOL and 360 assembler and I can honestly say that no similarity between the two ever crossed my mind.

Francsico Reverbel said...

Forgot to say it: great post (as usual)!

James Iry said...

Classic mistake of putting code in the blog without testing first.

Paul said...

The Main definition of an "Assembly" language is a One-TO-One correspondence with the CPU machine language it is designed for.

FORTRAN is a high level language (FORmula TRANslation) and is no where near the level of the machine. C, C++, C#, BASIC, COBOL are all high level languages. They technically don't have that "One-To-One" correspondence to the machine code of the hardware they run on.

There are of course exceptions, COBOL on the Burroughs B3200-B4900 medium scale main frames of the 60's through the 80's DID have a one to one correspondence. There were machine language instructions like "MOVE A TO B EDITED WITH C" where A, B and C were references to memory locations containing the input record (A), the output record (B) and the PICTURE formatting (C). The CPU practically "thought' in COBOL. There was generally one machine language instruction generated by the compilier for each COBOL Procedure Division statement. It went so far as to even addressed memory in BASE 10! COBOL on those machines ran like a bat out of hell. Very fast.

OlegOAndreev said...

So you missed the I in "FORTAN I"?

bob pasker said...

if you had really "cut your teeth" on VAX/VMS, you would have used the MOVC instruction, not a byte-by-byte copy.

secondly, you would know that the "high-level assembler" for VMS was BLISS-32, not C.

Francisco Reverbel said...

A minor glitch:

mov ax, 3483h ;(Intel syntax)

copies the hexadecimal value 3483 (decimal 13443) to the 16-bit register ax. To copy the byte at the memory offset 3483h to a register you would write something like

mov al, [3483h]

Note the usage of an 8-bit register (al instead of ax) and the addressing mode (direct rather than immediate).

JG said...

Actually C is only resembling assembly if it is translating to a p-machine. That's probably where the phrase came from. In the early days (1970s, 1980s), p-machine hardware was more common.

If you translated C to assembly without optimization (e.g. on an HPUX machine running on a 9000 series 300, which was originally designed to run Pascal p-system and HP BASIC (which compiles source to run p-code - those "PROG" files)), you'd see the assembly had a 1:1 correspondence to the C statements for the most part. Leaving the p-system world has changed that.

John Fitzpatrick said...

C is certainly not "assembly language with syntactic sugar". Any programmer who knows the history of programming languages knows that FORTRAN I is assembly language (and not particularly portable) with syntactic sugar.

Peter Hamilton said...

I second the statement concerning fortran and assembly.

Breach said...

Very nice site, and post. It's great to see that people still manage to discuss c at all, outside of IRC ;). I implore you to keep posting about C/ASM and low level tidbits, or I shall eat your liver.

Thanks.

~breach.

Hugh S. Myers said...

The assembler it most resembles is that of the PDP Dec machines obviously before your time. The morons are those who think that the original statement was literal. It does map quite well though hence the suggestion...

barrkel said...

C is relatively easy to compile to straightforward and predictable assembly, but I agree, that doesn't give it all the characteristics of an assembly language.

Another thing I'd add is that C doesn't model the von Neumann architecture very well. It lets you create data structures just fine, but the capability to create code, dynamically, at runtime, surely the defining characteristic of a shared instruction-data architecture, is all but completely absent.

PeterL said...

C feels like PDP-11 assembler (e.g., ++i and i++ come from PDP-11 assembler). But "C is portable assembler" isn't a literally true statement -- it means that C is low-level enough to write an operating system in, when combined with a bit of real assembler for the non-portable parts.

None said...

Hey, great post.
But your layout does something shitty. If the text is too big in my browser, then your code example gets cut off on the right margin.
I spent 2 or 3 minutes looking for your closing paren on your example fuction.

Joe said...

I'd argue that undefined behaviour in C is analogous to the illegal opcodes offered by the 6502. The 6502 has 151 legal opcodes. It also has 105 illegal opcodes - 40% of the available commands - which could be used by a skilled hacker to perform all sorts of optimisation tricks.

Paolo said...

Joe, that's not true. Illegal opcodes had perfectly predictable outcomes. Undefined behavior cannot be relied upon.

Bernt said...

Good article, but it is a bit of a straw man that there is more hardware control in assembler -- anyone ever argue different? :-)

An Assembler with good macro facilities (or a simple precompiler) could do much of C's local variable stuff.

Some macro for subroutine entry reserves stack space and define offset constants. A macro for returning, which released the extra space. It could be neat to use if the processor architecture has sane addressing modes.

Disclaimer: I haven't touched even C for quite a few years. :-) And the only time I read about the x86, I was too disgusted to continue after a couple of hours.

Joe said...

I'm not sure. If you write a program that takes advantage of undefined behaviour of C and compile it a dozen times with the *same* compiler, will you get the same behaviour every time?

Now imagine you run your favourite C64 game on an emulator that doesn't support the illegal opcodes. Do you get the same behaviour?

Running a program on a 6502-compatible CPU that doesn't implement the illegal opcodes (which were only ever a "feature" of the manufacturing process, not designed into the chip) gives you the same effect as compiling a C program with a compatible compiler that doesn't implement undefined behaviours in the same way.

Dbrenn2 said...

Somebody actually thought C was Assembly???? (facepalm)

It compiles into code so tight that it gives most of the advantages of Assembly with all of the advantages of the High Level Language it is.

Joe said...

"Now imagine you run your favourite C64 game on an emulator that doesn't support the illegal opcodes. Do you get the same behaviour?"

To clarify, do you get the same behaviour as you did when you ran it on your old C64?

fxj said...

maybe FORTRAN V was. when did you check the language definition of FORTRAN the last time? It has everything a high level programming language needs. (classes, namespaces, pointers and concurrent programming). C looks like stone age wrt to FORTRAN2008.

Cchiesa1 said...

I cut my C teeth on VAX/VMS, where the (unoptimized) machine instruction stream corresponded exactly to the C statements.

On the other hand, C was extremely inefficient at certain operations that were vital to the software I was developing at the time. We wanted to move millions of single bytes from one buffer to another, and would ordinarily code a loop such as

char src[BIGNUMBER], dest[BIGNUMBER];
for (int i = 0; i < BIGNUMBER; i++)
dest[BIGNUMBER] = src[BIGNUMBER];

In assembly this was about five instructions, specifically using the MOVB (move byte) instruction to do the data copy. The C version, compiled, even when optimized, produced many thousands of instructions, because C decrees that atomic datatypes must be promoted to the largest possible integer datatype before being manipulated (in this case assigned) and then truncated back... and THAT incorporated all kinds of special cases for handling signed versus unsigned quantities, etc. etc. etc. I always preferred assembly for precisely the reasons cited in this article: you had control over what was going to happen.

Steve Maurer said...

The reason why people equate C to assembly is not because they think C is an inadequate substitute for assembly. It's because they're trying to insult both.

Programming in assembly isn't something anyone should do, except at the lowest level. It takes a huge amount of effort to produce a small amount of code, which is locked to the specific hardware of a single machine. Trying to pretend that C is assembly is a way of saying that anyone who programs in C is similarly non-productive.

This isn't true, obviously. But I guess it's a way for people who aren't bright enough to understand pointers to pretend to themselves that they're actually smarter for not even trying.

Guest said...

Your logic would seem to disconnect C from C++ because of their differences, while ignoring their similarities.

All those things you list are twists (like undefined behavior), limitations, and additions, but does not change the fact that C is in reality an convenient interface to assembly. And it is most evident from the keyword "asm" which gives you the ability to embed assembly directly in the program. And C can do all those things you mention it can't, *precisely* because of the fact that it can embed assembly code; like mess with the stack.

For the most part, most C constructs have a 1-to-1 relationship with assembly code and structures, as seen through a compiler. Yes, C hides stuff from the programmer; just as assembly hides stuff from the programmer (like different machine representations of the same instruction, endianess, bit representation, etc), but it is still analogous to machine code.

Your take on who's the "idiot" is misplaced. Write a simple compiler and see how "different" they really are; I assure you it will be fairly easy.

Rif said...

What do you mean with "++i and i++ come from PDP-11 assembler"? Certainly it did not use that syntax. PDP-11 like many other CPUs (Z80?) have array/buffer operations that could load/store registers using an array index register that could be pre-decremented or post-incremented.

Matt said...

I am always amused when someone calls C "assembly language". I work with C in embedded x86 systems (not a Linus or Windows derived code base) and often have the debugger show me the C source along with the generated assembly. I can say that that C != ASM from first hand experience. One of the more difficult things our systems do is an operation called a thunk, or a change of processor operating modes. Because of the nature of this operation, the code to handle the transition including setting back up thing like The Stack must all be written in assembly, C simply does not have the capability to do this nativity.

None said...

I came across this through Reddit, and I'm almost certain this is a joke. C absolutely is a assembly trundled up in syntactic sugar. The stack, optimization, etc. is the nasty medicine hidden by the sugar. That is its entire mission in life: to make assembly a bit tastier.

Neil Harding said...

I wrote a 68k to C converter to port Sonic 3D from the Sega Genesis to the Saturn (and PC as well). There are a lot of tricks you can do in assembly language that are hard to do in C. The main thing is the condition flags, since these are set by the processor after almost every instruction (not true on ARM, where it is optional, and not on operations on address registers in 68K). Checking for 0 is easy enough in C, but there is no easy way of checking for integer overflow (V flag in 68K). Also assembly language allows you to do things like addq #4,sp (which pops the return address of the stack), so you can return to subroutine that called the method, a() { .. b(); codeWeDontWantToCall(); } b() { addq #4,sp; } would return to the method that called a and would skip codeWeDontWantToCall().

In addition you can mix data and methods much easier, I used to use data structures to control animation of sprites with code like dc.w MOVE_XY + CALL,-1,-1
dc.l methodToCall, with conditional flags and I could put part of the logic into the animation, which lead to easier code and smaller code as well.

Also using assembly language meant I could write a sound playback using an interrupt at 8khz, which used the user stack pointer as a temporary variable, which saved 24 cycles compared to memory variable, which was about 2% saving of the processor speed.

Assembly language was also more portable than C code (well, for porting Atari ST to Commodore Amiga it was, since they were the same processor, but totally different operation systems).

Neil Harding

PeterL said...

I'm pretty sure that Z80 didn't have {pre,post}-{increment,decrement}. Anyway, 8008/8080/Z80 were later than PDP-11.
C of course uses different syntax from PDP-11 assembler, but "x++ = y++" can be done in 1 instruction on PDP-11.

Ben said...

I would have thought that the definition of assembly is not how close it resembles the machine instruction set, but that you are defining the instruction set on a word to byte basis.
Anything that requires a complier seems to be a higher level language (even if it is still low).
That's a non programmer perspective - from an older electronics engineer.

FredZarguna said...

/* but what about */
void foo()
{
asm mov ax,[3483H];
}

mowgli said...

The one and only reason:
"C" is not assembly because it is not based on machine language instructions (Opcodes).

nothing more, nothing less :)

Charles Sovereign said...

Also, that is there INLINE.

Alexander Voronin said...

C code may lead to unexpected only if You use implicit types and write unclear code. Use "signed" and "unsigned" types, do not use "int" and do not write horrors like "i = ++i - i++;" and everything will be ok. And those example with "array" copy myst be replaced with single memmove/memcpy stdlib call, which is definetly based on assembly "rep movs[b,w,d]" instruction. C code could be VERY expectable, stable and clear, otherwise why people write OS kernels with C? It's just a human factor.

Andre J Burke said...

When you get right down to it, an assembler and a compiler simply work differently, with different functions. A compiler builds a parse-tree and tries to figure out what the programmer is trying to get done. It then produces assembly code which can then be presented to an assembler for conversion to machine code. Assembler to machine code is a much simpler and more direct mechanism because its basically a choice of which of the following machine code instructions is going to work best for the current assembler instruction. The journey from C to assembler is way more complex and can take many different twists and turns, especially when optomization is involved. C works via a compiler. It is an irrelevant comparison from the start.

Rughug said...

How could it, on any processor? A C program is compiled to run in a given environment, usually an operating system. Changing processor modes may change this environment to a point where a program cannot function. The OS itself may be able to do that, but it would probably require resetting some parts of the OS as well.

Changing modes from a running program can be like sawing off the branch on which you sit, so support for this functionality appears a bit pointless.

jefito said...

Well, duh. If you need to be literal about it, then C is not assembly language, by definition.

That being said, if you want to open up your brain a little bit, C can be viewed as a portable assembly language for the simple reason that it has fairly simple data and execution models which are at a pretty low-level on most CPUs, and because it's nearly ubiquitous. Most modern C compilers to a good job of optimizing for their targets, so C is reasonably efficient (and if you feel a need to have an argument about hand-rolled assembly vs. compiler-generated, head on over to the I Missed the Point Cafe and yak away. Don't save me a seat, though -- I won't be joining you).

The metaphorical point is that the 'C-machine' is a perfectly reasonable target for a higher level language's compiler. This is explicitly the reason that the first C++ implementations were actually C-language generators, which were in turn compiled by local C compilers. Stroustrup himself uses the term 'portable assembler' to describe C (you can read more about this in his "Design and Evolution of C++").

Sorry, but most of this article and a good deal of the responses missed this point.

bob pasker said...

Just because both C and the PDP-11 both have autoincrement, that doesn't mean that one was derived from the other (both sharks and dolphins have fins).

To make a logical argument, you actually have to show that K+R modeled C's autoincrement operator after the PDP-11s autoincrement addressing mode.

in fact, Dennis Ritchie wrote exactly the opposite:

"People often guess that they were created to use the auto-increment and auto-decrement address modes provided by the DEC PDP-11 on which C and Unix first became popular. This is historically impossible, since there was no PDP-11 when B was developed." [1]

[1] http://plan9.bell-labs.com/who/dmr/chist.html

Mrforty said...

One year later, juste a quote that might be useful for anyone reading this :
from Linux Torvalds [http://torvalds-family.blogspot.com/2009/08/programming.html]
"Some people seem to think that C is a real programming language, but
they are sadly mistaken. It really is about writing almost-portable
assembly language"

Dimamakhnin said...

I successfully manage to translate C into assembly by flattening the HELL out of a working C function, and explicitly dedicating registers (possibly redundant in C), and turning all while and, and for loops, and so on with if and jump statements... Managed to translate square root algorithm and bubblesort :)

Valentin Tihomirov said...

Portable assembler is oxymoron. A protable assembler is not assembler since assembler is specific to a processor. Assembler is representation of its instructions in human readable form yet there is one-to-one correspondence between machine instructions and assembler mnemonics. It is impossible to have a protable assember. Porting means a translation (aka compilation), which is alternative to assembling (1-to-1 mapping). Yet, you can consider assember as a special case of compiler, the compiler that does the simplest form of translation, 1-to-1 mapping. So, compiler ("C language") is a generalization of assembler. This permits saying "asm IS-A compiler" but not "compiler IS asm". Cat is an animal but not Animal is a Cat.