The C++ logo, by Jeremy Kratz, licensed under CC0 1.0 Universal

Less assembly does not mean faster code


3 - 4 minutes read, 848 words
Categories: assembly c c++
Keywords: assembly c c++ compiler explorer gcc performance

Consider the trivial implementation of a fibonacci sequence:

int fibonacci(int n) {
    // assume n >= 0
    if (n <= 1) {
        return n;
    }
    return fibonacci(n-1) + fibonacci(n-2);
}

If one looks at the generated assembly of GCC with no optimizations, it would look similar to

fibonacci(int):
        push    rbp
        mov     rbp, rsp
        push    rbx
        sub     rsp, 24
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 1
        jg      .L6
        mov     eax, DWORD PTR [rbp-20]
        jmp     .L7
.L6:
        mov     eax, DWORD PTR [rbp-20]
        sub     eax, 1
        mov     edi, eax
        call    fibonacci(int) # first recursive call
        mov     ebx, eax
        mov     eax, DWORD PTR [rbp-20]
        sub     eax, 2
        mov     edi, eax
        call    fibonacci(int) # second recursive call
        add     eax, ebx
.L7:
        mov     rbx, QWORD PTR [rbp-8]
        leave
        ret

the slightly optimized version (-O1 with gcc11) is even clearer

fibonacci(int):
        mov     eax, edi
        cmp     edi, 1
        jle     .L10
        push    rbp
        push    rbx
        sub     rsp, 8
        mov     ebx, edi
        lea     edi, [rdi-1]
        call    fibonacci(int) # first recursive call
        mov     ebp, eax
        lea     edi, [rbx-2]
        call    fibonacci(int) # second recursive call
        add     eax, ebp
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret

But when using -O2 and -O3, at least since GCC 10, the compiler generates nearly 200 assembly instructions.

It is easy to fall into the trap of thinking that so much code is going to be slower than the non-optimized versions. As the code generated by GCC 9 was much more compact and similar to those generated by -O1, I thought it was a regression, that strangely no one noticed.

Compiling the code with GCC 10 and executing fibonacci(42), shows that with -O1, the code takes (on my machine), approximately 4 seconds, while with -O2 and -O3 it takes no more than 2 seconds. And with greater numbers, the difference is only getting bigger.

I also compiled the code with GCC 9 and -O3, and also in that case the binary generated by GCC 10 is faster, even if the difference is measurable only with bigger numbers.

I do not have the patience to look at the 200 lines of generated assembly code to convince myself that GCC partially unrolls the fibonacci function. I say partially because it is still possible to find a couple of recursive calls.

On the other hand, comparing assembly is a useful way for comparing two algorithms that solve the same problem. Counting the number of instructions is easy, comparing and understanding all those instructions not.

While it is obvious that the number of instructions alone is meaningless, this test case seems small enough were comparing the number of instructions seems reasonable, and yet it is evident that it has no relevance.

I also do not like this conclusion for another reason. It feels wrong that such a simple piece of code generates that much data.

While for my test case, the binaries all had the same size (except those generated with -O0 which were a couple of bytes bigger), are trading space (larger binaries) for time (faster execution times, at least for some functions).

Ideally, and of course, it is not always possible, we want to optimize for both as much as possible.

How does the assembly of a non-recursive fibonacci look like?

int fibonacci(int n) {
    int n1 = 0;
    int n2 = 1;
    int out = 0;

    for (int i = 0 ; i != n-1 ; ++i) {
        out = n1 + n2;
        n1 = n2;
        n2 = out;
    }
    return out;
}

Except with -O0, the generated assembly looks like

fibonacci(int):
        cmp     edi, 1
        jle     .L4
        sub     edi, 1
        mov     edx, 0
        mov     eax, 1
        mov     ecx, 0
.L3:
        mov     esi, eax
        add     eax, ecx
        add     edx, 1
        mov     ecx, esi
        cmp     edi, edx
        jne     .L3          # loop
        ret
.L4:
        mov     eax, 1
        ret

And from a performance point of view, the un-optimized version (-O0), outperforms the highly optimized (-O3) version of gcc-10.

Thus less assembly does not mean faster code, but it is a useful metric.

This classic example with the fibonacci function, also shows how important it is to do some "manual" optimizations.

While some compilers might be better than others at optimizing specific pieces of code, to be "portable" and capable of running on less performant devices means that we cannot take for granted too many compiler optimizations. Especially because a newer version of the compiler might not optimize a specific piece of code as much as the previous version, and finding the culprit is normally anything but easy.

A smart enough compiler could compile the recursive version to the non-recursive, but we are not there yet, and the C++ standard (other languages do), does not even mandate tail-call optimization, which would not help in this case, but in other where a function is called recursively.

Something similar can be observed when replacing constant that are initialized at runtime with constexpr. While generally the size of the binary should decrease, as some work done at runtime is simply not there, you might notice that for some variables the total size is bigger than before.

The underlying cause, most of the time, might be that the compiler changes it decision where to inline code. As with the fibonacci function but that does not mean that the resulting code is slower. On the contrary it’s probably faster, the compiler traded space for time.


Do you want to share your opinion? Or is there an error, same parts that are not clear enough?

You can contact me here.