The C++ logo, by Jeremy Kratz, licensed under CC0 1.0 Universal Public Domain Dedication

Recursive mutex

I’m not a thread expert; I barely know how a mutex, semaphore, condition variables, monitor, and other primitives work. In part, it is also tricky to define how they work because different languages and environments use different definitions or have different properties.

So take these notes with a grain of salt.

What I’ve noticed is that how to handle multiple threads is still an unknown practice for most people, and the most common approach is to add a recursive mutex as a member variable and lock it in every method.

I avoid as much as possible to share mutable data between threads, and my use-cases are simple enough that a mutexed_obj are well covered. It is generally better than adding a mutex to an existing class, because it makes it easy to avoid multiple types of errors, and documents much better than a comment what data is protected by the mutex.

One customization point of mutexed_obj is the type of mutex that can be used; the default one is std::mutex, and not std::recursive_mutex. The opinion I got by reading some material on the topic is that a mutex is, in general, better than a recursive mutex (and obviously, I’ve also read opinions that argue the opposite). I came to a similar conclusion, but have never written down what the advantages of not using a non-recursive mutex when you do not need it.

In some environments, there is not even a choice. In Java every object can be used as a recursive mutex, you have to get out of your way and use java.util.concurrent.locks for something else. In other environment, there is a choice, and authors have a strong opinion 🗄️ against the recursive mutex.

All examples ignore the existence of mutexed_obj, which would simplify some code.

What is a recursive mutex?

A recursive mutex, in some environments called a reentrant mutex, is a mutex that can be locked multiple times from the same thread without causing a deadlock:

std::mutex m1;
auto g11 = std::lock_guard(m1); // locks
auto g12 = std::lock_guard(m1); // tries to lock, causes dead-lock


std::recursive_mutex m2;
auto g21 = std::lock_guard(m1); // locks
auto g22 = std::lock_guard(m1); // does not lock

Being able to avoid some deadlocks seems like a useful additional property, but in most cases, they can be avoided through other means, and taking advantage of this property can lead to code that is less performant or that might deadlock for other reasons.

Deadlocks with multiple instances

Consider the following class Data:

#include <mutex>

class Data {
  std::recursive_mutex m_mutex;
  int m_data;

public:
  void add(int amount) {
    auto g = std::lock_guard(m_mutex);
    m_data += amount;
  }

  void transfer(Data& other, int amount) {
    auto g = std::lock_guard(m_mutex);
    if (m_data >= amount) {
      this->add(-amount);
      other.add( amount);
    }
  }
};

The class looks straightforward, the mutex is always locked before reading the data, and since it’s recursive, there seems to be no deadlock when transfer calls add.

But if you write down some possible execution paths …​

no deadlock
w1(10);
w2(10);

thread1: w1.transfer(w2, 7);
         lock w1.m_mutex
         verify w1.m_data(10) >= amount(7)
         lock w1.m_mutex
         w1.m_data(10) -= 7 -> w1.m_data == 3
         unlock w1.m_mutex
         lock w2.m_mutex
         w2.m_data(10) += 7 -> w2.m_data == 17
         unlock w2.m_mutex
         unlock w1.m_mutex
         return

thread2: w2.transfer(w1, 5);
         lock w2.m_mutex
         verify w2.m_data(17) >= amount(5)
         lock w2.m_mutex
         w2.m_data -= 5 -> w1.m_data == 12
         unlock w2.m_mutex
         lock w1.m_mutex
         w1.m_data(3) += 5 -> w1.m_data == 8
         unlock w1.m_mutex
         unlock w2.m_mutex
         return

w1 contains 8
w2 contains 12
no deadlock
w1(10);
w2(10);
thread1: w1.transfer(w2, 7);
thread2: w2.transfer(w1, 5);

thread1: lock w2.m_mutex
         verify w2.m_data(10) >= amount(5)
         lock w2.m_mutex
         w2.m_data -= 5 -> w1.m_data == 5
         unlock w2.m_mutex
         lock w1.m_mutex
         w1.m_data(10) += 5 -> w1.m_data == 15
         unlock w1.m_mutex
         unlock w2.m_mutex
         return

thread2: lock w1.m_mutex
         verify w1.m_data(15) >= amount(7)
         lock w1.m_mutex
         w1.m_data(15) -= 7 -> w1.m_data == 8
         unlock w1.m_mutex
         lock w2.m_mutex
         w2.m_data(5) += 7 -> w2.m_data == 12
         unlock w2.m_mutex
         unlock w1.m_mutex
         return

w1 contains 8
w2 contains 12
no deadlock
w1(10);
w2(10);
thread1: w1.transfer(w2, 7);
thread2: w2.transfer(w1, 5);

thread1: lock w1.m_mutex
         verify w1.m_data(10) >= amount(7)
         lock w1.m_mutex
         w1.m_data(10) -= 7 -> w1.m_data == 3
         unlock w1.m_mutex
         lock w2.m_mutex
         w2.m_data(10) += 7 -> w2.m_data == 17
         unlock w2.m_mutex

thread2: lock w2.m_mutex
         verify w2.m_data(17) >= amount(5)
         lock w2.m_mutex
         w2.m_data -= 5 -> w1.m_data == 12
         unlock w2.m_mutex
         lock w1.m_mutex -> hangs

thread1: unlock w1.m_mutex -> unblocks thread2
         return

thread2: w1.m_data(3) += 5 -> w1.m_data == 8
         unlock w1.m_mutex
         unlock w2.m_mutex
         return

thread1: return

w1 contains 8
w2 contains 12

So far, everything looks good. We also always get the same result; there seems to be no data race or race condition.

deadlock
w1(10);
w2(10);
thread1: w1.transfer(w2, 7);
thread2: w2.transfer(w1, 5);

thread1: lock w1.m_mutex
         verify w1.m_data(10) >= amount(7)
thread2: lock w2.m_mutex
         verify w2.m_data(17) >= amount(5)
         lock w2.m_mutex
         w2.m_data -= 5 -> w1.m_data == 12
         unlock w2.m_mutex
         lock w1.m_mutex -> hangs
thread1: lock w1.m_mutex
         w1.m_data(10) -= 7 -> w1.m_data == 3
         unlock w1.m_mutex -> thread 2 still blocked
         lock w2.m_mutex -> hangs

As this example shows, there is a deadlock. It might not be obvious when reading the code the first time, but since every Data class has its own mutex, there are two mutexes that are locked, and they might not always get locked in the same order.

Assuming that it makes sense that every instance has its own mutex, using a non-recursive mutex would have made the issue easier to spot, even when testing the code with a single thread. The equivalent code with a non-recursive mutex does not look that different:

#include <mutex>

class Data {
  std::mutex m_mutex;
  int m_data;

private:

  void add_unlocked(int amount){
    m_data += amount;
  }

public:
  void add(int amount) {
    auto g = std::lock_guard(m_mutex);
    add_unlocked(amount);
  }

  void transfer(Data& other, int amount) {
    auto g = std::scoped_lock(this->m_mutex, other.m_mutex);

    if (m_data >= amount) {
      this->add_unlocked(-amount);
      other.add_unlocked( amount);
    }
  }
};

This code avoids the deadlock, but it looks error-prone. With the recursive mutex, the guideline was to lock the mutex at the beginning of every method. Now the guideline is to do it only in public functions, and have no member functions call another public member functions internally, only private member functions. There is a chance of misplacing a lock_guard, either by forgetting to lock a mutex in a private member function, or by locking it in a public member function. If the mutex is locked in a private function, it would lead to a deadlock even in single-threaded code, so it should be an issue easy to replicate, find, and fix.

But if the mutex is not locked, there is a race condition, probably difficult to reproduce. An improved version could look like the following:

#include <mutex>

class Data {
  std::mutex m_mutex;
  int m_data;

private:

struct lock{
 lock(std::lock_guard<std::mutex>&){}
 template <typename... Args>
 lock(std::scoped_lock<Args...>&){}
};
void add_unlocked(int amount, lock){
  m_data += amount;
}

public:
  void add(int amount) {
    auto g = std::lock_guard(m_mutex);
    add_unlocked(amount, g);
  }

  void transfer(Data& other, int amount) {
    auto g = std::scoped_lock(this->m_mutex, other.m_mutex);
    if (m_data >= amount) {
      this->add_unlocked(-amount, g);
      other.add_unlocked( amount, g);
    }
  }
};

The inner lock structure acts as a type-erased wrapper over std::lock_guard and std::scoped_lock with no operations. The signature of add_unlocked might look strange, as it takes the type-erased lock and does not use it. This construct is a hint for the developer to not forget to lock the mutex. If one forgets to lock the mutex, it would end up with the following code:

  void add(int amount) {
    add_unlocked(amount, /* ??? */);
  }

When the developer double-checks the signature of add_unlocked, it hopefully realizes that it needs to lock the mutex before calling add_unlocked.

An even better alternative would be the following approach:

// hpp file
#include <mutex>

class Data {
  std::mutex m_mutex;
  int m_data;

  // no add_unlocked here!
public:
  void add(int amount);
  void transfer(Data& other, int amount);
};

// cpp file
namespace {
struct lock{
 lock(std::lock_guard<std::mutex>&){}
 template <typename... Args>
 lock(std::scoped_lock<Args...>&){}
};
void add_unlocked(int& m_data, int amount, lock){
  m_data += amount;
}
}

void Data::add(int amount) {
  auto g = std::lock_guard(m_mutex);
  add_unlocked(this->m_data, amount, g);
}

void Data::transfer(Data& other, int amount) {
  auto g = std::scoped_lock(this->m_mutex, other.m_mutex);
  if (m_data >= amount) {
    add_unlocked(this->m_data, -amount, g);
    add_unlocked(this->m_data,  amount, g);
  }
}

Since add_unlocked is not a member function of Data anymore, and does not take Data as a parameter, as it has no access to its private member variables, it cannot call by accident some public function Data and cause a deadlock.

So instead of a deadlock at runtime, one would have an error at compile-time.

Another advantage is that the implementation details are not unnecessarily exposed in the header file.

Deadlocks with a single instance

In the previous example, I had two instances that caused a deadlock, but a similar issue also exists where only a single instance is used.

Where it can lead to problems is if the mutex is not locked for the whole scope of a member function, and some code should be executed while the mutex is not locked:

class C{
  std::recursive_mutex m;

public:
  void a(){
    do_some_work();
    {
      auto g = std::lock_guard(m);
    }
    do_other_work();
  }

  void b(){
    auto g = std::lock_guard(m);
    a();
  }
}

Why is this problematic?

Because do_some_work and do_other_work might lock other mutexes and cause a deadlock, just like it happens when dealing with multiple instances. Another potential issue is that those functions access the same instance of C through another thread and cause a deadlock too.

And even if there is no deadlock, there might be other issues. For example, both functions can take a long time to execute, blocking anything that waits for the mutex to get unlocked.

Presumably, the code is using threads to get concurrency, but with this design, one is preventing concurrency.

And those issues only happen when b calls a.

Rewriting the code with std::mutex leads to the following class:

class C{
  std::mutex m;

  void a_impl(std::lock_guard<std::mutex>&){
    auto g = std::lock_guard(m);
  }
  void a(){
    do_some_work();
    {
      auto g = std::lock_guard(m);
      a_impl(g);
    }
    do_other_work();
  }

  void b(){
    auto g = std::lock_guard(m);
    a_impl(g);
  }
}

With the current design, if do_some_work or do_other_work lock the mutex from the same or another thread, it’s not an issue anymore.

One might argue that if do_some_work and do_other_work do not access C, so there is no real issue here, but why take the chance and verify if this assumption is correct every time some code changes?

Reentrancy

A similar issue happens in general with reentrancy; it is not specific to multi-threading or mutexes.

Consider for example a global vector of callbacks

struct {
private:
  using callback = void() noexcept;
  std::vector<callback*> data = {};
public:
  void add(callback* c){
    data.push_back(c);
  }

  void notify(){
    for(const auto& v : data){
      v();
    }
  }
} callbacks;

What happens if a callback calls callbacks.add? In C++, this is undefined behaviour, interestingly, also in Java 🗄️:

Note that Iterator.remove is the only safe way to modify a collection during iteration; the behavior is unspecified if the underlying collection is modified in any other way while the iteration is in progress.

The easiest way to fix this problem is to copy the container, and iterate over the local copy:

struct {
private:
  using callback = void() noexcept;
  std::vector<callback*> data = {};
public:
  void add(callback* c){
    data.push_back(c);
  }

  void notify(){
    auto copy = data;
    for(const auto& v : data){
      v();
    }
  }
} callbacks;

And since callbacks is global, it should protect data from concurrent modifications

struct {
private:
  using callback = void() noexcept;
  std::mutex m;
  std::vector<callback*> data = {};
public:
  void add(callback* c){
    auto g = std::lock_guard(m);
    data.push_back(c);
  }

  void notify(){
    auto copy = [&]{
      auto g = std::lock_guard(m);
      return data;
    }();
    for(const auto& v : copy){
      v();
    }
  }
} callbacks;

Note an important distinction; callbacks::m protects callbacks::data, but not what is contained in callbacks::data!

Thus it is not an issue that the callbacks are executed while the mutex is unlocked. Trying to do protect the pointed data, even with a std::recursive_mutex, might lead to deadlocks.

struct {
private:
  using callback = void() noexcept;
  std::recursive_mutex m;
  std::vector<callback*> data = {};
public:
  void add(callback* c){
    auto g = std::lock_guard(m);
    data.push_back(c);
  }

  void notify(){
    auto g = std::lock_guard(m);
    for(const auto& v : data){
      v();
    }
  }
} callbacks;

If v() triggers callbacks::add or callbacks::notify from another thread, it will deadlock.

Performance

I heard the same argument both ways.

Locking a mutex does not cost much time if no other thread has locked it; it does not cost much time.

Locking is a costly operation, so it should be avoided if possible.

So I wrote a dumb test and measured the overhead of locking and locking recursively.

struct Fibonacci1 {
  int calculate(int n) {
    if (n <= 1) {
      return n;
    }
    return calculate(n - 1) + calculate(n - 2);
  }
};

struct Fibonacci2 {
  std::mutex m;
  int calculate_impl(int n) {
    if (n <= 1) {
      return n;
    }
    return calculate_impl(n - 1) + calculate_impl(n - 2);
  }

  int calculate(int n) {
    auto g = std::lock_guard(m);
    return calculate_impl(n);
  }
};

struct Fibonacci3 {
  std::recursive_mutex m;
  int calculate(int n) {
    auto g = std::lock_guard(m);
    if (n <= 1) {
      return n;
    }
    return calculate(n - 1) + calculate(n - 2);
  }
};

Fibonacci1 does not use a mutex at all, Fibonacci2 locks an std::mutex only once, and Fibonacci3 locks a std::recursive_mutex on every call.

Depending on the input parameter, the performance differences between std::recursive_mutex and the other solutions vary a lot.

Performance difference with different mutexes
Performance difference with different mutexes

I assume this is one of the worst scenarios one can encounter. Under test load, the performance is acceptable, and in production, with different values, the system struggles.

Note 📝
I am aware that a non-recursive implementation is possible and much faster than all proposed variants, but that was not the point of the benchmark.

This is not an inherent issue of the fact that the implementation is not good; the implementation using std::recursive_mutex is orders of magnitude slower than the one using std::mutex, which performs more or less like the version without a mutex at all.

I am also aware that the benchmark is synthetic; the code is not doing that much, so it makes sense that the cost of locking and unlocking is greater than adding some numbers together.

But it still stands that using locking a std::recursive_mutex in a recursive function, although it sounds like a sensible approach, can have issues, even if not related to multitasking.

There is also something else to keep in mind. For some recursive algorithms, the compiler can do a so-called tail-call 🗄️ optimization. Locking a recursive mutex at every recursive call removes this optimization opportunities, because the last action is unlocking the mutex, not the recursive call anymore.

Global variables

A mutex makes sense when sharing mutable data between threads. Often, this data is stored in a global variable.

std::mutex has a constexpr constructor, thus it is possible to initialize a global mutex at compile-time

constinit auto m1 = std::mutex();

A std::recursive_mutex does not have a constexpr constructor, so it cannot be initialized with constinit. As usual, global variables (and constants) that are initialized at runtime are problematic, not only because of the indeterminate order of dynamic initialization, but also especially when used in libraries. Granted, this is a specific issue of C++, not other languages.

Conclusion

My takeaway is to prefer the simplest tool to accomplish the desired task.

Using std::mutex instead of std::recursive_mutex might require changing the implementation of some functions, but does not require changing the API of a class.

The implementation of the class might seem more complex, but it makes verifying, or at least testing the correctness of it, easier.

Since finding errors in multi-threaded code, or verifying its absence, I prefer to proceed with caution and make the code as robust and clear as possible.

Using a recursive mutex does not automatically mean that the code is wrong, but determining if the class uses it correctly might take quite some time. At that point, I prefer to spend the time to rewrite the class to use a non-recursive mutex, and apply a pattern that makes it easier to verify the correctness in the future, and avoid introducing errors by accident.

I know it is easier said than done, and the places where I normally do not replace a recursive mutex are those where it is not clear, at least to me, how the class is supposed to work.


If you have questions, comments, or found typos, the notes are not clear, or there are some errors; then just contact me.