The C++ logo, by Jeremy Kratz, licensed under CC0 1.0 Universal Public Domain Dedication

Less assembly does not mean faster code

Notes published the 05 Jun, 2021
3 - 4 minutes to read, 852 words
Categories: assembly, c, c++
Keywords: assembly, c, c++, compiler explorer, gcc, performance

Consider the trivial implementation of a fibonacci sequence:

int fibonacci(int n) {
  // assume n >= 0
  if(n <= 1) {
    return n;
  }
  return fibonacci(n-1) + fibonacci(n-2);
}

If one looks at the generated assembly of GCC with no optimizations, it would look similar to

fibonacci(int):
        push    rbp
        mov     rbp, rsp
        push    rbx
        sub     rsp, 24
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 1
        jg      .L6
        mov     eax, DWORD PTR [rbp-20]
        jmp     .L7
.L6:
        mov     eax, DWORD PTR [rbp-20]
        sub     eax, 1
        mov     edi, eax
        call    fibonacci(int) # first recursive call
        mov     ebx, eax
        mov     eax, DWORD PTR [rbp-20]
        sub     eax, 2
        mov     edi, eax
        call    fibonacci(int) # second recursive call
        add     eax, ebx
.L7:
        mov     rbx, QWORD PTR [rbp-8]
        leave
        ret

the slightly optimized version (-O1 with gcc11) is even clearer

fibonacci(int):
        mov     eax, edi
        cmp     edi, 1
        jle     .L10
        push    rbp
        push    rbx
        sub     rsp, 8
        mov     ebx, edi
        lea     edi, [rdi-1]
        call    fibonacci(int) # first recursive call
        mov     ebp, eax
        lea     edi, [rbx-2]
        call    fibonacci(int) # second recursive call
        add     eax, ebp
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret

But when using -O2 and -O3, at least since GCC 10, the compiler generates nearly 200 assembly instructions.

It is easy to fall into the trap of thinking that so much code is going to be slower than the non-optimized versions. As the code generated by GCC 9 was much more compact and similar to those generated by -O1, I thought it was a regression, that strangely no one noticed.

Compiling the code with GCC 10 and executing fibonacci(42), shows that with -O1, the code takes (on my machine), approximately 4 seconds, while with -O2 and -O3 it takes no more than 2 seconds. And with greater numbers, the difference is only getting bigger.

I also compiled the code with GCC 9 and -O3, and in that case, the binary generated by GCC 10 is faster, even if the difference is measurable only with bigger numbers.

I do not have the patience to look at the 200 lines of generated assembly code to convince myself that GCC partially unrolls the fibonacci function. I say partially because it is still possible to find a couple of recursive calls.

On the other hand, comparing assembly is a useful way of comparing two algorithms that solve the same problem. Counting the number of instructions is easy, but comparing and understanding all those instructions is not.

While it is obvious that the number of instructions alone is meaningless, this test case seems small enough where comparing the number of instructions seems reasonable, and yet the outcome is that it has no relevance.

I also do not like this conclusion for another reason. It feels wrong that such a simple piece of code generates that much data.

While for my test case, the binaries all had the same size (except those generated with -O0 which were a couple of bytes bigger), we are trading space (larger binaries) for time (faster execution times, at least for some functions).

Ideally, and of course, it is not always possible, we want to optimize for both as much as possible.

How does the assembly of a non-recursive fibonacci function look like?

int fibonacci(int n) {
  int n1 = 0;
  int n2 = 1;
  int out = 0;

  for(int i = 0 ; i != n-1 ; ++i) {
    out = n1 + n2;
    n1 = n2;
    n2 = out;
  }
  return out;
}

Except with -O0, the generated assembly looks like

fibonacci(int):
        cmp     edi, 1
        jle     .L4
        sub     edi, 1
        mov     edx, 0
        mov     eax, 1
        mov     ecx, 0
.L3:
        mov     esi, eax
        add     eax, ecx
        add     edx, 1
        mov     ecx, esi
        cmp     edi, edx
        jne     .L3          # loop
        ret
.L4:
        mov     eax, 1
        ret

And from a performance point of view, the un-optimized version (-O0), outperforms the highly optimized (-O3) version of GCC 10.

Thus less assembly does not mean faster code, but it is a useful metric.

This classic example with the fibonacci function, also shows that it is still important to do some "manual" optimizations.

While some compilers might be better than others at optimizing specific pieces of code, to be "portable" and capable of running on less performant devices means that we cannot take for granted too many compiler optimizations. Especially because a newer version of the compiler might not optimize a specific piece of code as much as the previous version, and finding the culprit is normally anything but easy.

A smart enough compiler could compile the recursive version to the non-recursive, but we are not there yet, and the C++ standard (other languages do), does not even mandate tail-call optimization, which would not help in this case, but in other where a function is called recursively.

Something similar can be observed when replacing constants that are initialized at runtime with constexpr. While generally, the size of the binary should decrease, as some work done at runtime is simply not there, you might notice that for some variables the total size is bigger than before.

The underlying cause, most of the time, might be that the compiler changes its decision where to inline code. As with the fibonacci function that does not mean that the resulting code is slower. On the contrary, it might be faster, and the compiler traded space for time.

If you have questions, comments, or found typos, the notes are not clear, or there are some errors; then just contact me.