Thursday, April 8, 2010

Good Math, Bad Pointer Math

Heard about the new C bandwagon? According to April's Tiobe report C has eclipsed Java as the number 1 most popular language by a whopping 0.007%. So while all the rest of you Java suckers have just lost all professional relevance, I'm OK. I cut my professional teeth on C.

Sometime ago on Twitter I gave in-jest advice to new C programmers: all your code and data are in one giant mutable array indexed by pointers, good luck. David MacIver (http://twitter.com/drmaciver) responded that the C standard didn't say that. Whereupon I swore a mighty blood oath of vengeance upon him for ruining my joke with facts.

Vengeance is hard so I'll get started later, maybe tomorrow or next week. Right now I'll write a bit about the facts behind pointers. One of the trickiest aspects of writing portable C programs is that a great deal of the specification says that the behavior of certain constructs is undefined. Old C hands sometimes point out some undefined behavior in a C newbie's program whereupon the newbie fires up gcc or msvc or whatever and shows that the code behaves exactly as it "should." Who's right? What does it mean to be undefined?

C was designed to be both 1) portable across a wide range of platforms and 2) work at a very low level. Those two goals are pretty much in opposition - low level details can vary widely across platforms. Don't think MacIntel vs Wintel, think IBM System/360 vs DEC PDP-11 vs Motorola 68x vs x86. And so the C standard says that certain things are syntactically valid but the behavior can depend on compiler, options, hardware, and the day of the week. For example according to the standard dereferencing a null pointer is undefined. It may segfault; silently appear to succeed but actually return garbage; update your Facebook page with your masturbation frequency; or none of the above. It's all good.

Why should you care about portability if you aren't planning to port? Well, for one thing, undefined behavior can vary across versions of the same compiler or even different compiler options so planning not to port is not a realistic option. For another, even when you think you know what is going to happen on your target platform you might be wrong in some cases - some security critical cases, even. And finally, if you're good at hacking C there's a good chance you'll someday be called upon to write code for some device with a very different architecture. You might as well get into good habits now.

NULLity

While I was unable to find an example of Facebook updating I was able to find examples of two other distinct null pointer behaviors. The following program derefs a null:

#include <stdio.h>

int main() {
   puts(NULL);
}

On one common platform the program caused a segfault as you might expect. But on a second, mystery platform (to be revealed shortly) the program ran to "successful" completion with the output "║►╤". Yup, garbage, not crash.

Here's another one. The standard says that a null pointer can textually be written with a 0 in cases where the compiler can statically prove that you want a pointer, but it does not guarantee that a null pointer will in fact consist of a bit pattern of all zeros. So if you declare

const int n = 0;

Then the following will always be true

NULL == 0;
n == 0;

But the following might or might not be true

NULL == (void *)n;
(int)NULL == n;
(int)NULL == 0;

Unfortunately, it's a bit difficult to find a platform where the null pointer is not a pattern of 0 bits so this non-portable code is surprisingly portable, but see the comp.lang.c FAQ.

Good Math, Bad Pointer Math

Here is an example of some code that does behave very, very differently.

#include <stdio.h>

int main() {
   const unsigned long base = 0xfffc;

   /* table header */
   printf("\t\tlong\t\tchar *\n");

   /* size of data types */
   printf("size\t\t%d\t\t%d\n", sizeof(base), sizeof((char *)base));

   /* the core table */
   for (unsigned long i = 0; i < 8; i++) {
      unsigned long n = base + i;
      char *p = (char *)base + i;
      printf("%08lx + %ld\t%08lx\t%08lx\n", base, i, n, p);
   }
}

C aficionados will be horrified that I'm using non-portable %08lx instead of portable %p to output a pointer, but using %p would reveal too much. On the first, common platform the result was

  long  char *
size  4  4
0000fffc + 0 0000fffc 0000fffc
0000fffc + 1 0000fffd 0000fffd
0000fffc + 2 0000fffe 0000fffe
0000fffc + 3 0000ffff 0000ffff
0000fffc + 4 00010000 00010000
0000fffc + 5 00010001 00010001
0000fffc + 6 00010002 00010002
0000fffc + 7 00010003 00010003

So both longs and pointers are 4 bytes (32 bits) and both long and pointer math are pretty ordinary.

Now, on the second, mystery platform with one set of compiler options I get this

  long  char *
size  4  4
0000fffc + 0 0000fffc 0000fffc
0000fffc + 1 0000fffd 0000fffd
0000fffc + 2 0000fffe 0000fffe
0000fffc + 3 0000ffff 0000ffff
0000fffc + 4 00010000 10000000
0000fffc + 5 00010001 10000001
0000fffc + 6 00010002 10000002
0000fffc + 7 00010003 10000003

Still 32 bits but pointer math is, um, strange. 3 nibbles (12 bits) have been skipped. It gets perhaps weirder with a different set of compiler options on the mystery platform.

  long  char *
size  4  4
0000fffc + 0 0000fffc 0000fffc
0000fffc + 1 0000fffd 0000fffd
0000fffc + 2 0000fffe 0000fffe
0000fffc + 3 0000ffff 0000ffff
0000fffc + 4 00010000 00000000
0000fffc + 5 00010001 00000001
0000fffc + 6 00010002 00000002
0000fffc + 7 00010003 00000003

So now my 32 bit pointer is wrapping around at a mere 16 bits. And that's a clue to our mystery: we're kicking it old skool.

Mystery Revealed

For my first, common platform I'm using "gcc -std=c99" on 64 bit Ubuntu x86 but compiling using "-m32" to force 32 bit compilation. The second, mystery platform is the Watcom C compiler in C99 mode (wcc -za99) on the FreeDOS MS-DOS compatible operating system running in a VirtualBox virtual machine.

Using DOS keeps the machine in "real mode" - which basically means the chip (or virtual chip) behaves as an 8086[1]. Now, the 8086 is a peculiar beast. It's a 16 bit chip - registers are all 16 bits and most instructions work with 16 bit words - but it has a 20 bit memory address width. How do you squeeze 20 bits into a 16 bit word? Well, internally the 8086 uses a highly optimized form of data compression based on lzw that...

No, wait, that's not right. The answer is that the 8086 un-squeezes 20 bits into 32 bits - two 16 bit words instead of one. But here's where it gets weird. Instead of saying "we'll allow 32 bits of addressing but make it a hardware fault if anything above the low 20 bits is set" the 8086 designers said, well, this:

  XXXX0  the "segment" shifted left by 4 bits
+  XXXX  the "offset"
-------
  XXXXX  final 20 bit address

A segment register is shifted left 4 bits and then added to an offset register to get an actual address. So now it's clear why 0x0000fffc + 4 = 0x10000000 with one set of compiler flags: the segment has been incremented by 1 in the bits that count[2]. But why is the result 0x00000000 with a different set of compiler flags?

In order to get the 0x0000fffc + 4 = 0x10000000 behavior the machine code emitted by the compiler has to keep track of overflow - so every pointer add is not just a register add, but a register add plus an overflow check and perhaps a second register update. There may also be a move of two registers to/from memory instead of just one. That could be quite a performance hit on a lowly 4.77MHz machine (the original clock speed on the IBM PC).

So, as an optimization, most C compilers for 8086 real mode allow you to say that no structure is ever going to be larger than 2^16 bytes (64K) so that the emitted code doesn't always need to do all those gymnastics. The upshot: pointers silently overflow at 16 bits. 0x0000fffc + 4 = 0x00000000 and 0x1000fffc + 4 = 0x10000000.

If you're playing along at home the first result was achieved using the -mh ("huge" memory model) compiler flag and the second one was with -ml ("large" memory model). Both were linked with "wlink name test file test.obj". For more about the gory details of real mode memory models see 16 bit memory models[3].

The Standard

Back to earth now. The C standard is fine with all of the above behavior. As a gross oversimplification the standard says that

  1. A pointer must be NULL or must point at something valid or must point just past the end of something valid. Pointer math that would result in a pointer that is outside of those bounds is undefined. It might work like any of my examples, or it might put a phone order in for pizza.
  2. A dereference of a NULL pointer or a pointer does not point at something valid, even if the pointer is only pointing one step past the end of something valid, is also undefined. You pray for a crash.

In short, the C standard doesn't pretend that your program is one giant array. It pretends each individual allocation creates island unto itself (and may be part of a larger island or composed of smaller islands). The C standard is pretty quiet about what can happen between those islands. It's just that in modern platforms memory tends to behave like one giant flat mutable array so it's easy to assume that that's part of the C model.

One way to look at it is that the hypothetical C standard memory model treats pointers somewhat the way higher level languages treat references and array indexing, except that those higher level languages define bad behavior as either being statically impossible (no pointer math) or dynamically checked (null pointer exceptions and array index checks) where the C standard just says "undefined."

Obviously looping around some arbitrary pointer location is not going to be portable, but this stuff has less obvious consequences. Code like the following is not guaranteed to work

char *buf = malloc(BUF_SIZE);
char *end_of_buf = buf + BUF_SIZE;

...

/* check for buffer overrun */
if (buf + offset >= end_of_buf) {
  /* buffer overrun */
} else {
  /* everything is okay */
}

As the loop above showed, buf + offset might in fact be less than buf on an overrun. The correct answer is to compare offset with BUF_SIZE.

In Defense of Undefined

When people explain C's choices it's common to say that it's about implementation efficiency. Indeed that's a huge part of it. But remember that C was specifically designed to allow very low level access; to allow much (though by no means all) of what assembly allows. So C's ability to write to arbitrary bits of memory allows useful things like talking to memory mapped devices. That kind of code is not portable, but there are important bits of some kinds of programs that aren't going to be portable no matter what you do. The trick is to keep those bits walled off and as small as possible.

So be careful with your pointers and I'll see all you former Java programmers in the unemployment line as I drive by in my new car with the custom plates "PTRMATH%saccount=34389983829 password=h3ll0k1ttySegmentation fault.

Footnotes

  1. Yes, I'm keeping a "virtual" machine in "real" mode. I'm just as confused as you are.
  2. Actually, one consequence of this funky scheme is that the same location in memory can be addressed multiple ways. 0x00120 + 0x0003 = 0x00123 and 0x00100 + 0x00023 = 0x00123. It's just that in the huge memory model the compiler's emitted code for pointer math keeps things canonical by using only the highest 4 bits of the segment.
  3. For even more weirdness associated with this segmentation stuff read a bit about the A20 Line, something that causes heartburn to bootloader writers to this day.

8 comments:

Toby said...

Very interesting post, a welcome refresher on the rather arcane magics of pointers.

This doesn't really fall in the scope of your post, but why do you think there is a resurgence in C usage (as indicated by Tiobe) - or a relative decline in Java? Is it the rise of mobile and embedded devices?

James Iry said...

Toby,

First, Tiobe needs to be taken with a dinosaur killer sized grain of salt. But second, the Tiobe folks say that C hasn't really gained much. They say there's been a slow drain on Java. The speculation is that other languages, especially other JVM langauges, are creating a "brain drain" on Java and so there are fewer books, blog posts, etc. about Java that Tiobe measures (sort of).

Toby said...

Agree that Tiobe needs to be taken with an extra-large grain of salt. But the "brain drain" on Java definitely makes sense - there appears to be an amazing growth in JVM-based languages, and there's no surprise that would cut into Java usage. Good time to be working in one of those languages then (scala)!

More on topic, I wonder sometimes if there ever will be a new low-level language that eclipses C. Precisely the undefined behaviors you mentioned can cause big headaches, and it would seem a language that's stricter about pointer semantics could help prevent nastiness. Go seems to try and fit this bill to some extent. LLVM is another interesting player in the low-level languages field. But a serious low-level engineer will probably stick to the tool chain that has the most support, which invariably will end up being C.

Elazar Leibovich said...

I have no actual evidence, but I think it's not so much as other JVM languages as it is C#.

ikedaattractor said...

So, I actually have experience with a platform on which a NULL may not consist of all zeros: the PDP10. Addresses on the 10 are 18 or 30-bits and *word oriented*. Word oriented meaning that an address references a word (36-bits) of memory and not a byte.

You may wonder how would one operate on a byte of data then? i.e. In C how is a char* represented? Enter the byte pointer.

A byte pointer is a 36-bit address with some special pattern in the left-most bits of the address that tells the computer, the address is meant to point at a byte of n-bits, offset at location i from the start of the word referenced by the word address.

So, as an example: a standard memory location may look like this (address in octal, data in hex):

000001222333: 0x01020304f

But what if I want the char* which points at 0x01? In that case the computer must be given an 8-bit byte pointer that points at that word which looks something like this:

550001222333

Because of this strange representation, and some additional decisions related to how casts to and from void* should be treated. One representation for NULL that was used was:

550000000000

So, as you point out, there are a number of ways in which code which is common, but not strictly portable fails on this platform.

Also, note that 0 is a valid accumulator address on which to perform byte operations, so actually dereferencing a NULL pointer, also does not cause a page-fault or the equivalent instant failure. Instead, the computer continues to happily execute until some side effect of the NULL dereference may cause any number of exceptions. Or maybe not.

PDP-10s were widely loved by many because of there very expressive macro assembly languages. Even up until the late 80s there were organizations that did not want to get ride of them even when DEC had killed the product line, but I often wonder how much the difficulty in porting C programs to the platform lead to its demise.

Of course I say that knowing that there are companies out there that actually still use PDP10 based processors in there products.

Anonymous said...

hey, IIRC according to C99 there is no direct relationship between the representation of a NULL pointer (which may or not be all 0 bits) and the result of converting an integer with value zero to a pointer (conversion of litteral 0 to a pointer is fully defined by the standard as yielding in a NULL pointer -- it's even how the NULL macro is implemented)

A compiler can be strictly compliant, target an environnement where NULL pointers are all 0 bits, and yet defining conversion from a non litteral int of value 0 to a pointer as not yielding a NULL pointer. Even if I doubt such a compiler exist or will ever exist, that still means one can't rely on bit-based reinterpretation of int casted to pointers to write stricly conformant C99 code.

peterfirefly said...

"IBM System/360 vs DEC PDP-11 vs Motorola 68x vs x86"

No, think S/360 vs PDP-10 vs Cray-1 vs real-mode x86 (large model) vs 8051 ;)

S/360 is nice (32-bit, byte-adressable, 1 byte=8 bits, sensible address spaces (excluding some of the later revisions to the architecture).

PDP-10: 36-bit, not byte-addressable, 9 bit bytes.

Cray-1: not byte-addressable.

real mode x86, large model: 32-bit ptrs, byte-addressable, ptr aliases (so comparing ptrs without normalizing them first would go wrong), 64K segments (completely general ptr arithmetic was expensive).

8051: about 5 different address spaces. Ptr sizes of 1 or 2 for those or 3 for generic ptrs (1 byte to indicate the address space, all ptr stuff done through library calls for generic ptrs).

G_H said...

Very interesting read. But I must say that what had me most wide-eyed wasn't the pointer aberrations but your vanity license plate password. It's pretty much identical to one I once used when in a rush and couldn't be arsed to open KeePass.
Exactly how many people are there with Hello Kitty variants for passwords? Is this some kind of subconcious thing, a statistically highly probably choice when asked to quickly come up with random crap?