The C++ logo, by Jeremy Kratz, licensed under CC0 1.0 Universal

C++ performance guidelines


20 - 25 minutes read, 4980 words
Categories: c++
Keywords: amdahl's law c++ compiler explorer constants constexpr perf performance quick-bench valgrind view types

This is not about micro-optimizations and premature optimization, but about avoiding pessimization, and fixing low hanging-fruits for improving performance.

The underlying assumption is that the optimized/"not pessimized" code is at least just as readable, maintainable, and correct as the pessimized. In practice, I find the optimized code even to be easier to maintain, analyze, and verify its correctness.

Also, some suggestions are already diagnosed by compilers or static analyzers.

I do not claim that fixing those low-hanging fruits will make the final product faster, probably they won’t. Where it will make probably a difference is in other scenarios, like testing and debugging, because of Amdahl’s law:

If one part of your program accounts for 10% of its runtime, and you optimize that part to run twice as fast, the program as a whole will only speed up by 5%.

Why do I want to increase performance

There are many reasons. For me, there are mainly four practical advantages:

  • run the same task with fewer resources and/or less time

  • run the same task on a less powerful device

  • use more resources to solve the same task in less time

  • run more tasks with the same resources/amount of time, like runtime checks performed by sanitizers or Valgrind

The last point is especially important for development, as it permits to unlock new types of workflows

  • test debug build

  • fuzzing

  • sanitizers

  • expensive hand-written assertions

  • Valgrind

  • automated tests for every change

  • further analysis, as there will be less noise

While all those tasks should not depend on the performance of our program, in fact, they do. Suppose you have a test suite that takes a lot of time (lets say ten minutes) to execute. If using a sanitizer doubles the execution time, and if we want to test at least two different variations, the execution times are probably already prohibitive (more than half an hour). If we do not want to recompile our code multiple times (as also this task might take a lot of time) and prefer to use something like Valgrind, where it is more probable that the execution time increases (depending on what checks we have enabled and the logic of our program) by a factor of 10.

In both cases, it is difficult to continuously or automatically test every change in our code base, even if the return of investment is gigantic.

Even with a centralized server that builds and test every commit, we are not going to wait an hour to see if our changes did not introduce any problem, and if we get a negative result after an hour, we need to switch context to reevaluate our previous work. It also makes it harder to make many small and independent commits, as one big commit with many small unrelated changes is much faster to build and test.

I also want to test unoptimized builds. Optimizations can "remove" UB and make programs work by accident, also debugging is much better if all values are inspectable. Also, the correctness/functionality of the program should be orthogonal to the optimizations, so it’s a good idea to test from time to time without them, in order not to rely too much on them. Thus I also try to keep an eye open on debug builds, even if it is a very-low-priority task.

When talking about performance, the first advice is to always measure. There are many tools, like the integrated CPU profiler in Visual Studio, perf, quick-bench, Compiler Explorer, Cachegrind, Very Sleepy, …​

While the advice is correct, finding and analyzing bottlenecks is not always a trivial task, and can take a lot of time.

Its not a coincidence that we, maybe unconsciously, are already applying performance guidelines all the time, even if we do not know if they are beneficial for our code, like

  • lock mutexes only for the shortest amount of time (because serializing parallel work is the worst performance hit we can take)

  • pass expensive types by reference, or through a pointer (because copying data unnecessarily is a waste of resources, and generally the indirection has less overhead)

  • pass non-expensive types by value (because the indirection might cost more than copying the data)

  • Avoid unnecessary work, like calling the same function without side-effects multiple times with the same parameter (as calling it once and storing the result is more performant)

But there are a lot more low-hanging fruits that can be found in any big-sized project.

But if changing any line of code according to the suggestions I’m writing about makes it somehow harder to understand, harder to maintain, or less easy to use…​, then undo the changes without further thinking. If you are not measuring, then the time spent is probably not worth it.

Avoid optimization barriers

Optimization barriers are those constructs that the compiler cannot or does not optimize. It might also be that they prevent only certain types of optimizations.

The first category of optimizations barrier is made by those constructs that have visible side-effects, like

The second category is made of constructs that are outside of the C++ language. These are mainly externally compiled files, like object files, shared libraries, and so on.

The third category is made of constructs that the compiler is not good at optimizing

  • exceptions cold path and binary size.

  • inline assembly

A sufficiently smart compiler can overcome many limitations like assembly, exceptions, memory barriers and some system calls like memory allocation. A conforming compiler might even modify a bubble-sort to quick-sort, as long as the end result is the same. This is the reason why the correctness of the program should not rely too much on optimization. Imagine if you update the compiler and the performance is awful as your bubble-sort is not optimized anymore to a quick-sort, and you did not even know in the first place that it has been optimized so aggressively.

While most optimizations barrier/side effects are generally needed (somewhen the program needs to synchronize between threads or output something), it is generally best if those operations can be gathered together.

For example

{
  unique_lock;
  foo();
}

// do something else

{
  unique_lock;
  bar();
}

vs

// do something else

{
  unique_lock;
  foo();
  bar();
}

or

printf("Hello ");
printf("World!\n");

vs

printf("Hello World!\n");

For the same reason, try to use std::mutex instead of std::recursive_mutex. It helps to make your class both simple and more performant. A sufficiently smart compiler could optimize both classes to produce the same code, I bet patches are welcome.

recursive versus non-recursive mutex

Scale of performance

It’s useful for using it as a guideline when writing code

  • cpu cycle

  • read value from cache

  • access ram

  • function call

  • virtual function call

  • system call

  • access disk

According to this guideline, reading from memory is much faster than allocating (a system call).

Indeed changing

std::string a, b, c;
// fill a, b,c
auto result = a + b + c;

to something like

template <class... S>
std::string concat(const S&... strs) {
  std::string_view views[]{strs...};
  int totalsize = std::accumulate(std::begin(views), std::end(views), 0, [](int count, std::string_view v){return count + v.size();});
  std::string result(totalsize, '\0');
  int cur_pos = 0;
  for(const auto& v : views){
    std::copy(v.begin(), v.end(), result.begin() + cur_pos);
    cur_pos += v.size();
  }
  assert(cur_pos == totalsize);
  return result;
}

std::string a, b, c;
// fill a, b,c
auto result = concat(a, b, c);

is faster.

One could measure if replacing a + b + c with concat is worth it. Unless the application is concatenating a lot of string, the performance gain might not be noticeable. But as the time at disposal is limited, using consistently something like concat has no disadvantage, and can even be more beneficial as shown in the benchmark if some of the concatenated strings are not std::string, but std::string_view, literals, or other types. It also has the advantage that the performance gain might be noticeable on less powerful platforms, and especially when using Valgrind, sanitizers, or when fuzzing.

Constants

Defining constants seems to be something so easy…​ and yet it is a can of worms. There is the issue with visibility, the initialization order fiasco, and the fact that constants (even unused) that allocate memory (like std::string) won’t free it until the program is exiting.

Thus if possible, just make all constants compile-time constant, ie use constexpr. Avoid allocating owning types, as they need to copy unnecessarily the content. Writing a constexpr and efficient container, like map is not an easy task. But if the sole reason to exist for this class is to hold compile-time constants, we can take some shortcuts:

  • assume that the number of contained elements is small

  • provide only const operations

The easiest implementation is just using an array of pairs as the underlying data structure. Something like the following ct_map class should be a good starting point

#include <utility>
#include <algorithm>

template<class T, class U, std::size_t N>
class ct_map {
  template <std::size_t... PACK1>
  constexpr ct_map(const std::pair<T,U> (&arr)[N], std::index_sequence<PACK1...>)
    : data{ arr[PACK1]... }
  {
    std::sort(data, data+N);
  }
  std::pair<T,U> data[N];
public:
  constexpr ct_map(const std::pair<T,U> (&arr)[N])
    : ct_map(arr, std::make_index_sequence<N>{})
  {
  }
  constexpr const std::pair<T,U>* find(const T& k) const noexcept {
    // if unsorted
    //return std::find_if(data, data+N, [k](const std::pair<T,U>& v){ return v.first == k;});
    // if sorted
    return std::lower_bound(data, data + N, k, [](const std::pair<T,U>& v, const T& k){ return v.first < k;});
  }
  constexpr std::size_t size() const noexcept { return N;}
  // add begin, end, and other function one might need
};

template<class T, class U, std::size_t N>
constexpr ct_map<T,U,N> make_ct_map(const std::pair<T,U> (&arr)[N])
{
  return ct_map<T,U,N>(arr);
}

By taking the call to std::sort out, it can be backported to C++11 with minor changes if one backports std::index_sequence too.

Indeed, for a "small" set of values it outperforms the map containers from the standard library. Actually, it outperforms std::map even with optimisation disabled, but not std::unordered_map.

The allocation, which would happen only once at startup, is not taken into account, but creating an std::*map is not trivial. Creating a constant std::map/std::unordered_map are approximately 450 lines of assembly (100 for the data), thus 350 lines of logic for creating the data structure. Creating the constexpr ct_map is just the data.

Thus while counting assembly lines is not a good performance metric, in this case, it shows how much difference defining a simpler data structure can reduce code bloat. Unless we are starting the same program a lot of times (for example when fuzzing…​), no one might notice the difference. On the other hand, limiting the size of the executable could have nice side-effects, smaller updates can, for example, be downloaded and installed faster.

Those and others are multiple reasons why I would prefer ct_map over std::*map, even if from a strict performance view we do not have necessarily always a gain when storing constants.

Singleton

Singletons are the way to initialize global data at runtime in C++. The initialization is even thread-safe, but we might not always need it. Either we do not have multiple threads accessing the singleton, or we are already using something else for ensuring consistency with multiple threads.

In both cases, even if small, there is an overhead for accessing the data, which would not be there if we simply made it global.

For constants, true constants, use constexpr, but if not possible continue to use singletons, especially because they avoid the init-order fiasco.

View types

Passing owning types might require allocation even if passed by reference.

void foo(const std::string&);
foo("bar");

Types that do not own the content, like std::string_view, and std::span, are "small". They are conceptually a couple of pointers (or a pointer and a length), and thus should be passed by value, and not const-reference.

View types are also incredibly useful for defining constants. The class ct_map has the disadvantage that the size is part of the type. This makes it impossible to store, for example, multiple ct_map with different sizes in an array and limits the possibility to pass it as a parameter to functions.

On the other hand, a type like string_view, does not have such limitations, as the size is not part of the type, but can still be queried at compile-time.

Pass ownership by value

This only applies if move semantic is cheap. It generally is, for example for std::string, std::vector, …​, but there might be exceptions, especially for user-defined types, arrays, and thus classes like ct_map!

Passing ownership by value is also consistent with move-only types, like std::unique_ptr.

class foo(){
  std::string str;
  public:
  explicit foo(const std::string& str_) : str(str_){}
};

foo("bar"); // 2 allocations
std::string bar = "...";
foo(bar); // 1 copy

versus

class foo(){
  std::string str;
  public:
  explicit foo(std::string str_) : str(std::move(str_)){}
};
foo("bar"); // 1 allocations + 1 move

std::string bar = "...";
foo(bar); // 1 copy + 1 move

As the move is cheap (possibly swapping values on the stack, and not copying the content on the heap of std::string), passing by value has more advantages than passing by const-reference.

Pass parameters by value or reference

Pass "big"/"complex" types by reference. Pass "small"/"simple" types by value.

Big and small depend on the context, but if void foo(int, int, int, int, int) is an acceptable function signature, then so should be struct p{int a; int b; int c; int d; int e;}; void foo(p);, as it has the same performance implications.

Complex types are those that have a non-default copy-constructor, like containers from the standard library.

When working with multiple threads, passing big types by value might be more performant than sharing them, and synchronizing access.

Also passing by value creates optimization opportunities that do not depend on inlining, like tail call optimizations.

Return by value

Returning by value is the fastest way, with some care, to return a value from a function.

Until C++11 the language needed the definition of a copy-constructor which might have been expensive, but in practice, it did not need to be called, as RVO (return value optimization) was a thing at least already in 1996! I did not test with such old compilers if RVO could be forced though.

With C++11, thanks to move semantic, even if RVO cannot be applied or a class did not have a copy-constructor, the cases were returning by value was not an option shrunk drastically, as a move constructor is sufficient.

With C++17, which adds structured bindings and mandates RVO for some cases, I believe there are no more use-cases where one should prefer one or multiple out-parameter, it does not matter how big or complex the data-structure is.

Note: In-out parameters cover a different use-case, so they should not be dismissed.

Avoid return by const-value

It disables RVO, move semantic, and does not make sense.

From the caller’s perspective const int foo() and int foo() is the same, because the returned value is an int. It’s up to the caller to store it in const int, int, or even something else.

Avoid return by pointer

Allocating and returning an owning pointer has been a common workaround before C++11, when a copy-constructor was not available. Even if it was possible to emulate (with RVO) move semantic, I am not aware of any project doing it.

In this case, returning by pointer is a pessimization as it requires an unnecessary allocation, introduces an indirection and possibly aliasing and a possibly null-state, and also makes the code harder to maintain.

Allocation

As mentioned previously, iterating multiple times generally costs less than allocating memory.

Some containers of the standard library have a reserve method for preallocating enough memory. Since C++11, all containers from the standard library can be initialized directly, which ensures that memory will be claimed only once.

Conditional logging

void log(const std::string& message);
log("Hello"); // bad allocates even if log is disabled and string discarded

This logging interface has the disadvantage that it makes it hard to avoid unnecessary computations, as the caller should check each time if logging is enabled or not. Provide a "lazy interface" (either directly or wrap the functionality), there are many possibilities:

  • Instead of std::string, use std::string_view if \0-termination is not important

  • Use a format interface, so that the conversion happens only if necessary

  • Use a callback approach, the callback is called only if logging is active

A possible wrapper could look like that:

#include <string>


enum class level {info, warning, error};
level currentlevel(); // query current log level
void log(level, const std::string&);

// Usage:
// log(level::info, "Hello World!"); // bad, allocates string even if unused
// log(level::info, "Hello " + username); // bad, concatenation and allocation even if unused



// support both const char*, std::string and functions-like by default
constexpr const char* to_str(const char* s) noexcept {
  return s;
}
const std::string& to_str(const std::string& s) noexcept {
  return s;
}
// can be extended for other types too if it makes sense
struct third_party_string{ /*...*/ };
const char* to_str(third_party_string); // or const std::string/whatever

template<class T> constexpr auto call_or_convert_to_str_i(T&& obj, int) noexcept -> decltype(to_str(obj)) {
  return to_str(obj);
}
template<class T> constexpr auto call_or_convert_to_str_i(T&& obj, long) -> decltype(obj()) {
  return obj();
}

template<class T> std::string call_or_convert_to_str(T&& obj) {
  return call_or_convert_to_str_i(obj, 0);
}

template <class M>
void logw(level l, M&& msg){
    if(currentlevel() == l){
        log(l, call_or_convert_to_str(msg));
    }
}


void foo(){
    logw(level::info, "This creates a std::string iff currentlevel() == level::info");
    logw(level::info, [](){ return "do some complex computation iff currentlevel() == level::info";});
}

Avoid logging altogether

Especially in pure methods, input-output is sufficient for reproducing everything.

The definition of pure depends on context (and programming language), as in C++ it is possible to take the address of nearly everything, defining pure is not easy. I tend to find constexpr a good definition of pure, even if in practice it might be too restrictive.

Functions that use external resources (like files, stdin, stdout, stderr, …​) are generally non-pure, as those have a global scope. Memory is generally an external resource, but in case of absence of memory, logging will hardly work, and make the issue harder to diagnose.

Notice that the compiler (gcc and clang have the same interface) can also instrument code to easily add traces everywhere (no parameters, unless creating plugin).

Avoid deleting all copy operations (for the sake of performance)

Otherwise, the user needs to save all input parameters of the constructor for creating a copy, which is generally more costly.

Also for debugging being able to copy a value is useful, and for testing postconditions it might be necessary to copy a value before mutating. Last but not least, copying can be useful for providing a stronger exception guarantee.

Avoid entangling logic and IO together

IO is slow compared to, for example, parsing something from a buffer.

Data structures

  • Prefer contiguous containers.

  • If the size is known at compile-time and does not change, prefer array.

  • If the size is known at runtime/might change, prefer string/vector.

std::set is, in my experience, overused, especially for small containers or constants. Creating a vector (or array), sorting it, and removing duplicates is generally faster, and also searching elements in a sorted std::vector is generally equivalent.

create set versus vector

std::map preserves order, std::unordered_map is generally faster. An array of std::pair (like in ct_map) is also a viable alternative.

In general, avoid rewriting container classes, especially std::string. It has non-trivial optimizations like SSO, move semantic, amortized reallocation cost, …​ Thus prefer to use the standard library, but acknowledge that it does not cover efficiently all use-cases.

Big-O notation

Should not be the main reason for choosing one algorithm over another. In many cases, it won’t give the correct solution, as cache locality plays, on modern hardware, a gigantic role.

Unless the data is very big (measure), everything else is secondary, unless the operations are really slow.

It also does not map very well on reality/physics, for example accessing an address in RAM is not constant time.

This is why, for example, std::vector should be almost always preferred over std::list.

Avoid shared ownership

shared_ptr

Prefer value on the stack, if a value on the heap is required, prefer std::unique_ptr. Shared ownership is difficult to reason about, and as the internal counter of std::share_ptr is also thread-safe, the compiler is unable to do many optimizations.

If std::shared_ptr is necessary, remember to use std::make_shared to avoid one allocation (notice that std::make_shared_ptr has one call to new, while shared_ptr_with_new has two.

Algorithms

I stumbled upon Optimized C++ by Kurt Guntheroth, Optimize String Use: A Case Study and asked myself, if how I would have implemented the solution would have beaten all proposed equivalent solutions (ie, I ruled the solution that does not use std::string out).

The most easy-to-implement solution consists of a couple of lines of code (I did not use a lambda even if the book is from 2016):

bool pred(char c){
  return c < 0x20;
}
std::string remove_ctrl_erase_alg(std::string s) {
  s.erase(std::remove_if(s.begin(), s.end(), pred), s.end());
  return s;
}

It is surely simpler and easier to maintain than many proposed solutions, but is it performant?

Compared to remove_ctrl_block_args, remove_ctrl_erase_alg seems to be as fast. It would be faster if there were never elements to remove, but in general, there would be some.

remove_ctrl_erase_alg has two downsides: it copies the whole string, thus generally allocates more memory than necessary, and for deleting the characters it needs to shift them to the end, which might move a lot of them if the characters to be deleted are at the beginning.

std::string copy_if_alg(const std::string& s) {
  auto num_el = std::count_if(s.begin(), s.end(), pred);
  std::string ss(num_el, '\0');
  std::copy_if(s.begin(), s.end(), ss.begin(), pred);
  return ss;
}

copy_if_alg overcomes those limitations, even if it doubles the number of branches, and removes a trivial bulk copy-operation. It also needs to iterate the original string twice, but as allocating is orders of magnitude slower, it should not be expected to be an issue, and indeed, this algorithm outperforms remove_ctrl_block_args, and all others presented in Optimized C++ by Kurt Guntheroth.

There are other possible improvements, but without measuring those are not worth it, as they would make the code more complex/less maintainable and depend on the compiler’s ability to optimize code, on the input strings, on the runtime, and so on.

Exceptions

Use them for exceptional circumstances (unfortunately it’s hard to define what they are), as the exception path is not optimized much, but the non-exception path is generally more performant than testing for return codes.

There is no reason why the cold path could not be optimized more, it’s just that because of how exceptions are used (or disabled in some projects) that there seems to be not that much interest to improve the situation.

noexcept

noexcept, as optimization, makes little difference most of the time, even if it can help to remove dead code/remove hidden code paths. It also helps to reason about possible failures, and about exception safety.

But for some functions it is important to be noexcept for a performance perspective:

  • destructor

  • move operations

  • swap

Notice that compiler-generated is already optimal, so do not implement those functions just for adding noexcept.

prefix vs postfix operators

++it should be the preferred form when advancing.

In general, it does less work than the postfix operator (it++;). For trivial types (integer, pointer, …​) the compiler will generate the same code with minimal optimizations enabled, but for non-trivial types not necessarily. Using prefix over postfix for all types increases consistency, and avoids making it easy to overlook those types where the difference might be important.

When iterating over all elements, a for-each-loop (for(const auto& v : c){}) can also generate better code than a hand-made loop:

for( auto it = c.begin() ; it != c.end(); ++it){
  // ...
}

Create proper abstractions/invariants

For example, instead of using integers as port numbers, define a structure.

A class port; can have a constructor that establishes an invariant (probably a range of valid values). In this case, instead of verifying in multiple routines if the passed parameter looks like a valid port number, the check needs to be implemented only once in the constructor.

Generally this means avoiding doing the same test multiple times, which might be a performance boost. It also zeroes the possibility to use a non-validated port number and avoids having inconsistent tests.

Avoid the C++ stream facilities

The current design of the stream facilities (especially class inheritance), makes it much slower than its C counterpart. Also, the interface is not as user-friendly. Many programmers use std::endl instead of '\n' for appending a newline, which flushes unnecessarily the buffer, and interleaved calls to output stream are problematic. Also, internationalization is harder to achieve, and generally formatting output is a pain.

std::printf does not have all these drawbacks, but is very error-prone to use, and cannot be customized with other types.

Never facilities, like std::fmt and std::to_string, try not to repeat the same errors.

Also SG16 considers using as default output stream stdout and not std::cout for the new print facility.

Using std::puts for printing already formatted strings and avoiding stringstream for converting a number to strings, can not only increase performance but reduce compilation times and binary size too.

Harmful pieces of advice

Those are advice that without measuring or further reasons are generally harmful, not only because they tend to produce bloated binaries, but also because they make the code harder to maintain, to understand, and probably do not have real benefits.

  • Performant code is low-level code (like assembly, macros, C instead of C++, and/or manually managing owning pointers and other resources)

  • Cache everything, every value needs to be stored in a temporary variable

  • log everything (function, function parameters, intermediary values, time, …​)

  • validate every parameter at every function call

  • no (mutable) references, only pointers

  • Do not take advantage of specific language features, or impose cultural restrictions, for example

    • everything is subclass of BaseClass

    • all member functions should be virtual

    • all destructors should be virtual

    • only member functions

    • no templates

    • no exceptions

    • no assertions

    • no standard library

Compiler optimization we can/should rely on

I argued that we should not rely on compiler optimization for the correctness of our programs. There are of course exceptions.

Some facilities have been designed to be zero cost abstraction, it means ideally without any space and time overhead. In practice there might be a cost, but it will be negligible in most cases. One needs to carefully measure the space-time impact, compile-times and eventually look at the assembly before trading one of those higher-level facilities for a lower level one. And those consideration will probably be compiler-dependent, so it’s important to document in which environment the decision has been made.

Library

  • std::unique_ptr versus raw owning pointers

  • std::array versus array

  • std::tuple versus struct

  • std::span versus pointer plus length

  • std::string_view versus char*

  • std::vector/std::string versus manually managing a dynamic array of elements

Language

  • return value optimization, move/value semantic, copy elision

  • cost of calling a function

  • small parameters

  • struct with one member (no difference in padding and alignment)

  • for-each loop versus manual loop

  • lambda versus not using lambda

  • struct of types versus contained types (padding/alignment excluded)

Normally, with carefully designed higher-level constructs, the compiler can even do better optimization because those constructs have a semantic meaning. Also aliasing can be ruled out by proper encapsulation, and invariants make it easier to apply some optimizations. With lower-level constructs one needs to analyze a bigger part of the programs.

There might be circumstances where those facilities have a measurable overhead, those are exceptional, but avoiding them without measuring is a premature optimization.


Do you want to share your opinion? Or is there an error, same parts that are not clear enough?

You can contact me here.