The C++ logo, by Jeremy Kratz, licensed under CC0 1.0 Universal

Micro-optimizations

Notes published the
6 - 7 minutes to read, 1438 words
Categories: c c++
Keywords: assembly c c++ gcc implementation detail maintenance performance

I had recently a discussion about what compilers are able to optimize, and what kind of optimization one should be doing manually.

The discussed function looked like

void foo(int* result, vector<int> data)
{
    for (auto x : data){
        if (bar(x)) *result += 1;
    }
}

and the proposed "optimized" version was

void foo(int* result, vector<int> data)
{
    int local = 0;
    for (auto x : data){
        local += bar(x);
    }
    *result += local;
}

And the ultimate question was if the compiler (in this case GCC) would generate the same optimal code for both versions or if there would be differences (long story short: all compilers generated different code)

Local variable vs pointer

As the author applied two "optimizations", I decided to analyze one change after the other.

The first one is copying the content of the pointer locally, thus

void foo(int* result, vector<int> data)
{
    for (auto x : data){
        if (bar(x)){
            *result += 1;
        }
    }
}

versus

void foo(int* result, vector<int> data)
{
    int local = 0;
    for (auto x : data){
        if (bar(x)){
            local += 1;
        }
    }
    *result += local;
}

Exceptions

I immediately made it clear that the two functions are not equivalent, and thus it would not have been possible for a compiler to generate exactly the same code (unless the compiler had a bug).

What if, for example, bar throws?

In the first example, result would point to a changed value, while in the second one not.

My colleague assumed implicitly that bar is noexcept, so are the two programs equivalent by adding this restriction?

No, they still are not equivalent.

Signals

The first reason, at least I thought so, is given by signals.

On my platform, int and sig_atomic_t are aliases, as in static_assert(std::is_same_v<int, sig_atomic_t>);.

Thus if result points, for example, to a global variable, and while foo is executed, a signal might be raised, and a signal handler function can inspect the variable pointed by result, and the result would differ between the implementation.

But the standard says that signal handlers can only access volatile sig_atomic_t, thus (unless doing a const_cast) int* cannot point to a variable that is accessed by a signal handler unless the compiler/platforms support signal handler accessing to non-volatile objects (which would be a valid compiler extension).

Shared memory

There are also other cases; like a memory-mapped file or a shared region of memory. Both are not part of the standard but are extensions supported on most operating systems. This extension more or less requires that the compiler cannot generally optimize multiple writes to a global variable.

Multithreading

Multithreading is not an issue. We have a pointer to an int, not to a variable that is safe to write and read from multiple threads. Doing so would be undefined behavior, so there is no need to consider it.

Again: unless the compiler/platform gives additional guarantees about multithreading.

Aliasing

Of course, there is also aliasing.

result could point to some data inside vector<int>. The author explicitly passed vector<int> by copy to avoid this issue, but since std::vector uses an allocator unless the compiler can see through it, it should take into account that result points inside the vector.

A separate test with different types shows that aliasing does not seem to play a significant role.

What to do?

Now, supposing that one is not interested in signal and shared memory, should he write the "optimized" version? What harm can it do?

The first disadvantage is that makes the code less obvious. Why are we writing apparently redundant code? Why create a local/temporary variable?

And more importantly: should we always do it? Even for types that might be costly to copy?

As in many other situations, the answer is: it depends. If there is no guideline, my main advice would be to measure. Does writing to a local variable and "publishing" all changes at once make the code fast enough that the change is worth it?

Note that for this function there is actually an even better solution: do not use an out-parameter at all.

int foo(const std::vector<int>& data)
{
    int local = 0;
    for (const auto& x : data){
        if (bar(x)){
            local += 1;
        }
    }
    return local;
}

We do not need to think about aliasing, global variables, signals, or exceptions. It is straightforward to see what this code does and it is way simpler to reason about it, as there is no pointer that could point to a global variable.

Obviously, the advice does not always hold for more complex types. For example with a container, it can be beneficial to use a preallocated buffer instead of allocating one inside the function. On the other hand, with local variables, the compiler can generally do some interesting optimizations (again: measure!)

There is another point to be made. Initially, the author compared only the generated code of the functions.

If those functions are used only with local variables, like in the following example

#include <vector>

bool bar(int i) noexcept;
void foo1(int* result, const std::vector<int>& data)
{
    for (const auto& v : data){
        if (bar(v)){
            *result += 1;
        };
    }
}

void foo2(int* result, const std::vector<int>& data)
{
    int local = 0;
    for (const auto& v : data){
        if (bar(v)){
            local += 1;
        };
    }
    *result += local;
}

int baz1(const std::vector<int>& data){
    int j = 0;
    foo1(&j, data);
    return j;
}

int baz2(const std::vector<int>& data){
    int j = 0;
    foo2(&j, data);
    return j;
}

One can notice that the generated code for baz1 and baz2 is identical.

In this case, I avoided copying std::vector to generate less code and make the output easier to compare.

Thus, while it is interesting to analyze snippets of code in isolation, it also makes sense to see them in a broader context.

If the function we are talking about is an implementation detail or used only in specific contexts (like with local variables), the compiler might in fact generate exactly the same code.

Obviously in other contexts, like functions that are part of the API of a shared library, the compiler has fewer optimization opportunities.

To branch or not?

The second "optimization" was changing if (bar(x)) *result += 1; to local += bar(x);. Those constructs are not necessarily equivalent, for example, what if int bar noexcept() returns 42? Supposing that the two snippets are equivalent, I actually prefer the first form because it is easier to read.

With the if, one knows immediately that the result can be incremented at most by one. With the second, where the value is always modified (eventually by adding 0), one needs to inspect bar to reach the same conclusion.

Also, a minor advantage, the if makes it easier to add a breakpoint.

By reading the first "non-optimized" example, I thought that initially bar returned a boolean value, only later found out that my colleague assumed it returned an integral value. Booleans and integrals are two types with different semantics, I do not like mixing them.

I would thus prefer to write result += bar(v) ? 1 : 0, because even if it generates exactly the same code of result += bar(v), it has the same advantages of the if-version, except for the breakpoint.

Note 📝
In this context, my opinion is that if is more legible with bar. When implementing a mathematical formula, or if the function names make it clear that the only possible return values are 0 and 1, then it could be more legible to not write an if.

Guideline

It is not always easy to see if an optimization is worth it or not.

If an optimization is at the expense of breaking a common guideline, or pattern, …​ then it should be required to show some data that demonstrates that the change is worthwhile or at least hint that it can make a difference.

On the other hand, a lot of C++ guidelines are in fact micro-optimization. One example would be passing complex types by const-ref where it makes sense (and non-expensive types to copy by value), or, if possible, to call reserve on a vector before filling it. It should not be necessary to prove that those make your application faster.

If the change also simplifies the code (like not using a pointer at all), then, by all means, do it. The main argument should not be that it is a change that could increase performance, but the other benefits.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.