Monthly Archives: November 2013

A glimpse of undefined behavior in C

A few weeks ago a coworker of mine stopped by my desk with a coding question.  We’d recently been quizzing each other over our C knowledge, so I smiled and braced myself for the hell which was clearly to come.

He wrote a few lines of code up on the whiteboard and asked what the program output:

#include <stdio.h>

int main(){
    int i = 0;
    int a[] = {10,20,30};

    int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
    printf("%d\n", r); 
    return 0;
}

It seemed pretty straightforward. I explained the order of operator precedence – postfix binds before multiplication which binds before addition and also that multiplication and addition’s associativity was left to right, so I grabbed the marker and started writing out the arithmetic.

int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
//    =    a[0]    + 2 * a[1]  + 3 * a[2];
//    =     10     +     40    +    90;
//    = 140

After I smugly scribbled down the answer, my coworker replied a simple “nope”. I thought for a few minutes, and was stumped. I couldn’t quite remember the order of the postfix operators’ associativity. Furthermore, I knew that wouldn’t even change the order of evaluation here as associativity rules only apply between operators of the same precedence level, but thought I’d give it a shot and walk through the arithmetic as if the postfix operators were all evaluated right to left.

int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
//    =    a[2]    + 2 * a[1]  + 3 * a[0];
//    =     30     +     40    +    30;
//    = 100

Again, my coworker replied it was still wrong. At this point I threw up my hands and asked what the answer was. It turned out that this small sample was actually reduced from a larger code snippet he’d put together. To verify his question, he’d compiled and run the larger sample but was surprised to see it not behave as he’d thought. After reducing the unexpected behavior to this illuminating example, he’d compiled it with gcc 4.7.3 and it output a startling result: “60”.

At this point I was intrigued. I remembered that, in C, the order of function argument evaluation was undefined, so we thought maybe the postfix operators were just being evaluated in some random non-left-to-right order. Still convinced that postfix had a higher operator precedence than addition and multiplication, we quickly proved to ourselves that there was no order we could evaluate the i++’s such that the three elements of the array were added/multiplied together to get “60”.

Now I was hooked. My first thought was to look at the disassembly of the snippet and try and track down what was actually going on. I compiled the sample with debugging symbols and after a quick objdump we had the source-annotated x86_64 disassembly:

Disassembly of section .text:

0000000000000000 <main>:
#include <stdio.h>

int main(){
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	48 83 ec 20          	sub    $0x20,%rsp
    int i = 0;
   8:	c7 45 e8 00 00 00 00 	movl   $0x0,-0x18(%rbp)
    int a[] = {10,20,30};
   f:	c7 45 f0 0a 00 00 00 	movl   $0xa,-0x10(%rbp)
  16:	c7 45 f4 14 00 00 00 	movl   $0x14,-0xc(%rbp)
  1d:	c7 45 f8 1e 00 00 00 	movl   $0x1e,-0x8(%rbp)
    int r = 1 * a[i++] + 2 * a[i++] + 3 * a[i++];
  24:	8b 45 e8             	mov    -0x18(%rbp),%eax
  27:	48 98                	cltq   
  29:	8b 54 85 f0          	mov    -0x10(%rbp,%rax,4),%edx
  2d:	8b 45 e8             	mov    -0x18(%rbp),%eax
  30:	48 98                	cltq   
  32:	8b 44 85 f0          	mov    -0x10(%rbp,%rax,4),%eax
  36:	01 c0                	add    %eax,%eax
  38:	8d 0c 02             	lea    (%rdx,%rax,1),%ecx
  3b:	8b 45 e8             	mov    -0x18(%rbp),%eax
  3e:	48 98                	cltq   
  40:	8b 54 85 f0          	mov    -0x10(%rbp,%rax,4),%edx
  44:	89 d0                	mov    %edx,%eax
  46:	01 c0                	add    %eax,%eax
  48:	01 d0                	add    %edx,%eax
  4a:	01 c8                	add    %ecx,%eax
  4c:	89 45 ec             	mov    %eax,-0x14(%rbp)
  4f:	83 45 e8 01          	addl   $0x1,-0x18(%rbp)
  53:	83 45 e8 01          	addl   $0x1,-0x18(%rbp)
  57:	83 45 e8 01          	addl   $0x1,-0x18(%rbp)
    printf("%d\n", r);
  5b:	8b 45 ec             	mov    -0x14(%rbp),%eax
  5e:	89 c6                	mov    %eax,%esi
  60:	bf 00 00 00 00       	mov    $0x0,%edi
  65:	b8 00 00 00 00       	mov    $0x0,%eax
  6a:	e8 00 00 00 00       	callq  6f <main+0x6f>
    return 0;
  6f:	b8 00 00 00 00       	mov    $0x0,%eax
}
  74:	c9                   	leaveq 
  75:	c3                   	retq 

The first few and last few instructions just set up the stack frame, initial variable values, and do the printf call and return from main, so we really only cared about instructions 0x24 through 0x57. That’s where the interesting behavior was. Let’s walk through the instructions a few at a time:

  24:	8b 45 e8             	mov    -0x18(%rbp),%eax
  27:	48 98                	cltq   
  29:	8b 54 85 f0          	mov    -0x10(%rbp,%rax,4),%edx

The first three instructions were as we expected. First, it loads the value of i (0) into register eax, sign-extends it to 64 bits, then loads a[0] into register edx. The 1 * was obviously optimized out by the compiler, but everything looked normal here. The next few started out much the same:

  2d:	8b 45 e8             	mov    -0x18(%rbp),%eax
  30:	48 98                	cltq   
  32:	8b 44 85 f0          	mov    -0x10(%rbp,%rax,4),%eax
  36:	01 c0                	add    %eax,%eax
  38:	8d 0c 02             	lea    (%rdx,%rax,1),%ecx

The first mov loads the value of i (still 0) into eax, sign-extends it to 64-bits, then loads a[0] into eax. That’s intersting – we expected the i++ to have been performed before these three instructions again, but maybe the last two instructions will do some sort of assembly magic to get us to the expected result (2 * a[1]). The two instructions add eax to itself, effectively performing 2 * a[0], then add that result to the previous computation and stores it in ecx. At this point the instructions have evaluated a[0] + 2 * a[0]. Things are starting to look a little strange but again, maybe there’s some compiler magic going on.

  3b:	8b 45 e8             	mov    -0x18(%rbp),%eax
  3e:	48 98                	cltq   
  40:	8b 54 85 f0          	mov    -0x10(%rbp,%rax,4),%edx
  44:	89 d0                	mov    %edx,%eax

These next few instructions are starting to look pretty familiar. They load the value of i (still 0), sign extend it to 64-bits, load a[0] into edx, then copy edx into eax. Hmm, ok let’s take a look at a few more:

  46:	01 c0                	add    %eax,%eax
  48:	01 d0                	add    %edx,%eax
  4a:	01 c8                	add    %ecx,%eax
  4c:	89 45 ec             	mov    %eax,-0x14(%rbp)

Here, it adds a[0] to itself 3 times, adds it the previous computation, then stores it into variable ‘r’. Now that’s wild – our variable r now contains a[0] + 2 * a[0] + 3 * a[0]. Sure enough, that’s what the program output: “60”. But what happened to those postfix operations? They’re all there at the end:

  4f:	83 45 e8 01          	addl   $0x1,-0x18(%rbp)
  53:	83 45 e8 01          	addl   $0x1,-0x18(%rbp)
  57:	83 45 e8 01          	addl   $0x1,-0x18(%rbp)

It seemed our compiled version of the code just seemed completely wrong! Why had the postfix operations been thrown down at the end after the assignment had already happened? With my faith in reality dwindling, I decided to go straight to the source. No, not the compiler’s source – that’s just an implementation – I grabbed the C11 language specification.

The issue lay within the details of the postfix operations. In our case, we were thrice performing a postfix increment to our array index in a single expression. When evaluating a postfix operator it returns the initial value of the variable. The assignment of the new value back to the variable is known as a side effect. It turns out that side effects are only defined to have been committed between sequence points. See section 5.1.2.3 of the standard for the specifics of where sequence points are defined, but in our case our expression exhibits undefined behavior. It’s entirely up to the compiler when the side-effect of the assignment of the new value back to the variable is performed with respect to the other parts of the expression.

In the end, we both learned a bit more about the C language. It’s well-known best practice to avoid constructing complex expressions with prefix/postfix operators, and this is a great example of why that is the case.