Less assembly does not mean faster code
Consider the trivial implementation of a fibonacci sequence:
int fibonacci(int n) {
// assume n >= 0
if(n <= 1) {
return n;
}
return fibonacci(n-1) + fibonacci(n-2);
}
If one looks at the generated assembly of GCC with no optimizations, it would look similar to
fibonacci(int):
push rbp
mov rbp, rsp
push rbx
sub rsp, 24
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 1
jg .L6
mov eax, DWORD PTR [rbp-20]
jmp .L7
.L6:
mov eax, DWORD PTR [rbp-20]
sub eax, 1
mov edi, eax
call fibonacci(int) # first recursive call
mov ebx, eax
mov eax, DWORD PTR [rbp-20]
sub eax, 2
mov edi, eax
call fibonacci(int) # second recursive call
add eax, ebx
.L7:
mov rbx, QWORD PTR [rbp-8]
leave
ret
the slightly optimized version (-O1
with gcc11) is even clearer
fibonacci(int):
mov eax, edi
cmp edi, 1
jle .L10
push rbp
push rbx
sub rsp, 8
mov ebx, edi
lea edi, [rdi-1]
call fibonacci(int) # first recursive call
mov ebp, eax
lea edi, [rbx-2]
call fibonacci(int) # second recursive call
add eax, ebp
add rsp, 8
pop rbx
pop rbp
ret
But when using -O2
and -O3
, at least since GCC 10, the compiler generates nearly 200 assembly instructions.
It is easy to fall into the trap of thinking that so much code is going to be slower than the non-optimized versions. As the code generated by GCC 9 was much more compact and similar to those generated by -O1
, I thought it was a regression, that strangely no one noticed.
Compiling the code with GCC 10 and executing fibonacci(42)
, shows that with -O1
, the code takes (on my machine), approximately 4 seconds, while with -O2
and -O3
it takes no more than 2 seconds. And with greater numbers, the difference is only getting bigger.
I also compiled the code with GCC 9 and -O3
, and in that case, the binary generated by GCC 10 is faster, even if the difference is measurable only with bigger numbers.
I do not have the patience to look at the 200 lines of generated assembly code to convince myself that GCC partially unrolls the fibonacci
function. I say partially because it is still possible to find a couple of recursive calls.
On the other hand, comparing assembly is a useful way of comparing two algorithms that solve the same problem. Counting the number of instructions is easy, but comparing and understanding all those instructions is not.
While it is obvious that the number of instructions alone is meaningless, this test case seems small enough where comparing the number of instructions seems reasonable, and yet the outcome is that it has no relevance.
I also do not like this conclusion for another reason. It feels wrong that such a simple piece of code generates that much data.
While for my test case, the binaries all had the same size (except those generated with -O0
which were a couple of bytes bigger), we are trading space (larger binaries) for time (faster execution times, at least for some functions).
Ideally, and of course, it is not always possible, we want to optimize for both as much as possible.
How does the assembly of a non-recursive fibonacci
function look like?
int fibonacci(int n) {
int n1 = 0;
int n2 = 1;
int out = 0;
for(int i = 0 ; i != n-1 ; ++i) {
out = n1 + n2;
n1 = n2;
n2 = out;
}
return out;
}
Except with -O0
, the generated assembly looks like
fibonacci(int):
cmp edi, 1
jle .L4
sub edi, 1
mov edx, 0
mov eax, 1
mov ecx, 0
.L3:
mov esi, eax
add eax, ecx
add edx, 1
mov ecx, esi
cmp edi, edx
jne .L3 # loop
ret
.L4:
mov eax, 1
ret
And from a performance point of view, the un-optimized version (-O0
), outperforms the highly optimized (-O3
) version of GCC 10.
Thus less assembly does not mean faster code, but it is a useful metric.
This classic example with the fibonacci
function, also shows that it is still important to do some "manual" optimizations.
While some compilers might be better than others at optimizing specific pieces of code, to be "portable" and capable of running on less performant devices means that we cannot take for granted too many compiler optimizations. Especially because a newer version of the compiler might not optimize a specific piece of code as much as the previous version, and finding the culprit is normally anything but easy.
A smart enough compiler could compile the recursive version to the non-recursive, but we are not there yet, and the C++ standard (other languages do), does not even mandate tail-call optimization, which would not help in this case, but in other where a function is called recursively.
Something similar can be observed when replacing constants that are initialized at runtime with constexpr
. While generally, the size of the binary should decrease, as some work done at runtime is simply not there, you might notice that for some variables the total size is bigger than before.
The underlying cause, most of the time, might be that the compiler changes its decision where to inline code. As with the fibonacci
function that does not mean that the resulting code is slower. On the contrary, it might be faster, and the compiler traded space for time.
If you have questions, comments, or found typos, the notes are not clear, or there are some errors; then just contact me.