The C++ logo, by Jeremy Kratz, licensed under CC0 1.0 Universal Public Domain Dedication

Find and fix code smells in C++ with regexes

Even if everyone knows that search-and-replace is a rudimentary tool for editing source code, it is still one that is very effective and with a very small learning curve.

In particular, it is available in nearly all editors, IDE, and command-line tools. It is also a technique that works with all programming languages that use text for representing source code (one could also use it when programming in Scratch, but I doubt it will be that useful…​).

Because of its simplicity, it is the first tool one learns to use when refactoring some code, like renaming variables or functions.

But search and replace, together with regexes, can be used, with an important caveat, for doing more complex refactorings, like detecting and correcting code smells.

Issues with this approach

Searching and replacing is not an appropriate tool for refactoring C++ source code. There are multiple technical issues, that make it impossible to write an error-free search and replace strategy.

The first is function overload. For example, suppose you have void foo(int) and void foo(const char*) but want to rename only the first overload to bar. Replacing foo with bar obviously will not work, as it will rename all overloads, and given the piece of code

int a = 0;
foo(a);

There is no way, without parsing the code, that a regex can decide if we need to replace foo with bar or not.

Advantages of this approach

Some IDE offers refactoring tools, like renaming a variable. Unfortunately, those are often bundled with the IDE, and cannot be used without it.

You can try to configure multiple IDEs on the same codebase, most of the time it’s a pain, and other times not even possible as not all IDEs are multiplatform, thus in the end you’ll have only a subset of tools at your disposal. Also, such tools might work only on code that is currently compiled to give a better result. As a side effect, conditionally compiled code, and code commented out (like examples in the documentation) will not be changed, which might or might be not a good thing.

Another disadvantage is that often tools bundled with an IDE cannot be executed automatically, so it is more difficult to explain how to obtain the desired result. With a command line tool, it should be enough to write the command one has used, and the transformation can easily be verified in a code review.

Last but not least, tools that need to parse C++ (or C) and also take macros into account are generally much slower unless all parsing information has been created before.

In more than one project I had "search and replace" doing a remarkable job on the whole codebase in less than a couple of seconds, and clang-tidy taking nearly an hour, for example, to replace NULL with nullptr. To be fair clang-tidy also replaced other constructs with nullptr, like 0 or {}. Even if there were a handful of changes, waiting a couple of hours was probably not worth it.

Even if search and replace approach cannot replace a more robust tool for refactoring, it is often Good Enough™.

In particular, it is incredibly fast, and together with a revision control system and a compiler, it is invaluable for many tasks.

Some considerations before beginning

Use a code formatter

Unify formatting. It makes it easier to write regexes. Suppose you want, for example, to change a type from foo<bar> to foo_bar. If the formatting is not consistent, you might miss places where someone wrote, for example, foo< bar >.

Also having a code formatter ensures that the replaced code can be formatted automatically, this helps to make the whole search-and-replace process easier.

Since you are using a formatter, you could even format temporarily your code to make it easier to write regexes for it, and then format it back. The most useful thing would be to have every statement on a single line. Otherwise, you will miss places that are written on multiple lines. Since formatting the whole codebase might take some time, this is normally not an option.

Writing regexes is already hard enough. Debugging regexes is even harder. Regexes for C++ adds another layer of difficulty. Writing multiline regexes for C++ is something that you do not want to do!

Warning ⚠️
In case you skipped this section, read again. A formatter will make your job easier!
Warning ⚠️
Do not apply the presented regexes in your code base. Read them, try to understand them, and only then apply them.

Which regex?

Examples are written for GNU sed 🗄️ and GNU Grep 🗄️, without the extended regular expressions flag (-E, -r, --regexp-extended) so some adjustments might be necessary.

Both programs should be available on all platforms, Windows and Android included.

Different programs might use different regex dialects

In particular, you should pay attention to

  • \b, which denotes a word boundary

  • \1, \2, …​ which is for inserting (or matching) the first, second, …​ capture

  • \( and \) for starting and finishing a capture

  • a whitespace character instead of \s for denoting space, as it is more readable. Again: format your code.

  • ( and ) means literally ( and ), in other dialects you need to escape those, like \( and \)

  • * to match zero or more instances

  • \{0,1\} to match zero or one instances

  • \{1,\} to match one or more instances (although in this case, writing one match and then zero or more might be more legible)

For example in vim, instead of \b, one needs to use \< and \>, for example :%s/\<NULL\>/nullptr instead of sed -e 's/\bNULL\b/nullptr/g'.

Note 📝
It is possible to execute sed from vim. To execute sed on the current buffer, use :%!sed -e <pattern>. Other editors or IDE might provide similar options.

Code smells with pointers

Code older than C++11 might have used std::auto_ptr and replaced most instances with std::unique_ptr. As the interface between the two functions is slightly different, and as there is the factory function std::make_unique since C++14, I noticed some code smells that were easy enough to "fix" with a regex.

Use nullptr instead of NULL

Simply using something like

sed -e 's/NULL/nullptr/g';

is not enough, as NULL could be part of another word or macro, like MY_NULL, or be part of a string literal.

Something more robust would be

sed -e 's/\bNULL\b/nullptr/g';
Before After Correct?

NULL

nullptr

nullptr

nullptr

MYNULLMACRO

MYNULLMACRO

foo(0)

foo(0)

foo(NULL)

foo(nullptr)

a=NULL

a=nullptr

" a=NULL "

" a=nullptr "

✗, wrong replacement(!)

This fix/enhancement should be one of the first refactorings to be done when refactoring smart pointers. It helps, for example, for a smooth transition from naked pointers and std::auto_ptr to std::unique_ptr and std::shared_ptr where those are useful.

For example std::unique_ptr<int>(NULL); does not compile, but std::unique_ptr<int>(nullptr); does.

Note that simply replacing NULL with nullptr might break some code, for example

char c = NULL;
// or
void foo(int);
foo(NULL);

but I would argue that this code is misleading, and should be changed (in a separate commit) to

char c = '\0';

foo(0);

Avoid std::auto_ptr

Since std::auto_ptr has been deprecated, and there are better replacements, it should be removed and replaced with something else.

sed -e 's/std::auto_ptr/std::unique_ptr/g';

should be sufficient for most cases.

One could argue that

sed -e 's/\bstd::auto_ptr\b/std::unique_ptr/g';

might be better 🗄️, as it avoids matching something like mystd::auto_ptr_int, but the chances are pretty low.

Other adjustments might be necessary, normally it involves applying std::move where auto_ptr made a copy.

Before After Correct?

std::auto_ptr<int>

std::unique_ptr<int>

std::auto_ptr <int>

std::unique_ptr <int>

mystd::auto_ptr<int>

mystd::auto_ptr<int>

std::auto_ptr_a<int>

std::auto_ptr_a<int>

Compare with nullptr

A std::auto_ptr a; can be compared with nullptr (or NULL) through the member function get: a.get() == nullptr.

For std::unique_ptr this is not necessary, as operator== (and operator!=) has an appropriate overload.

Using the member function get is not harmful, but it seems to confuse some developers, some even end up writing a != nullptr && a.get() != nullptr. (One could blame the developers, I think it’s just easier to make the code consistent, remove the function call, and eventually explain the change during a code review).

For simplicity and consistency, removing all the unnecessary calls to get helps to avoid this type of confusion.

sed -e 's/\.get( *) *== * nullptr/==nullptr/g';

This pattern has false positives, as .get() might be a member function of something else returning something comparable to nullptr, and removing it would make the code invalid.

It also does not match nullptr == a.get(). It is more difficult to write an appropriate substitution regex for it, as a simple approach would break expressions like nullptr == foo(a.get()).

A probably good enough pattern is

sed -e 's/nullptr == \([^(]*\)\.get()/nullptr == \1/g';

even if it misses things like nullptr == foo().get().

The pattern should then be generalized also for handling operator!=, the final version looks like 🗄️:

sed -e 's/\.get( *) *\([!=]\)= *nullptr/ \1= nullptr/g' -e 's/nullptr \([!=]\)= \([^(]*\)\.get()/nullptr \1= \2/g';
Before After Correct?

a.get() == nullptr

a == nullptr

a.get() == NULL

a.get() == NULL

✓, as a == nullptr would not compiler. Replace NULL with nullptr before changing smart pointers!

a.get() != nullptr

a != nullptr

nullptr == a.get()

nullptr == a

get() == nullptr

get() == nullptr

a->get() == nullptr

a->get() == nullptr

a.get() = nullptr

a.get() = nullptr

a.get() == 1 && b == nullptr

a.get() == 1 && b == nullptr

nullptr == b && 1 == a.get()

nullptr == b && 1 == a

✗, wrong replacement(!)

foo().get() == nullptr

foo() == nullptr

nullptr == foo().get()

nullptr == foo().get()

✗, not replaced

a(b.get()) == nullptr

a(b.get()) == nullptr

nullptr == a(b.get())

nullptr == a(b.get())

Return std::unique_ptr holding nullptr

Instead of writing return std::unique_ptr<int>(); or return std::unique_ptr<int>(nullptr);, it is much simpler to write (and read) return nullptr.

The pattern 🗄️ could also be applied not only to return statements but to avoid too many changes at once I preferred to handle return statements in separate commits.

sed -e 's/return std::unique_ptr<.*>( *nullptr *) *;/return nullptr;/g' -e 's/return std::unique_ptr<.*>( *) *;/return nullptr;/g';
Before After Correct?

return std::unique_ptr<int>();

return nullptr;

return std::unique_ptr<int>(nullptr);

return nullptr;

return std::unique_ptr<int>(a);

return std::unique_ptr<int>(a);

return std::unique_ptr<T<2>>();

return nullptr

Another factor that led me to decide to write the regex this way, is that by replacing std::unique_ptr<int>() with nullptr, it would break code like auto a = std::unique_ptr<int>();.

Note that I’m also matching the ; at the end because I saw too many places that used a construct like std::unique_ptr().get() (true story 🤦).

Those can be adjusted with 🗄️:

sed -e 's/return std::unique_ptr<.*>( *nullptr *)\.get() *;/return nullptr;/g' -e 's/return std::unique_ptr<.*>( *)\.get() *;/return nullptr;/g';

without the need to edit the code by hand.

Before After Correct?

return std::unique_ptr<int>().get();

return nullptr;

return std::unique_ptr<int>(nullptr).get();

return nullptr;

return std::unique_ptr<int>(a).get();

return std::unique_ptr<int>(a).get();

return std::unique_ptr<T<2>>().get();

return nullptr

Note that this transformation might still break or change valid code, for example, foo and bar do not have the same function signature

auto foo(){
    return nullptr;
}

auto bar(){
    return std::unique_ptr<int>();
}

thus it is not always possible to simply replace std::unique_ptr<int>() with nullptr, even in return statements. For example, the first lambda would compile, but the second not:

auto l1 = [](bool b){
    if(b == 1){
        return std::make_unique<int>(42);
    }
    return std::unique_ptr<int>();
};

auto l2 = [](bool b){
    if(b == 1){
        return std::make_unique<int>(42);
    }
    return nullptr;
};

Return std::unique_ptr pointing somewhere

What about not nullptr? For example, when using new, like

return std::unique_ptr<int>( new int(42) );
// or
return std::unique_ptr<int>( new int );

In those cases, using make_unique 🗄️ should be the preferred approach.

sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g' -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\) *);/return std::make_unique< \1 >();/g';
Before After Correct?

return std::unique_ptr<int>( new int(42) );

return std::make_unique<int>(42);

return std::unique_ptr<int>( new int );

return std::make_unique<int >();

✓, mostly

return std::unique_ptr<A<B>>( new A<B>(42) );

return std::make_unique<A<B>>(42);

return std::unique_ptr<A>( new B() );

return std::make_unique<B>();

might be problematic in some cases

return foo(std::unique_ptr<int>( new int(42) ));

return foo(std::unique_ptr<int>( new int(42) ));

✓, out of scope, not returning an unique_ptr

return std::unique_ptr<int>( new int(42) ).get();

return std::unique_ptr<int>( new int(42) ).get();

✗, not replaced, but returns an invalid pointer, it should not be relevant, better fix it directly instead of transforming it

return std::unique_ptr<int>( );

return std::unique_ptr<int>( );

return std::unique_ptr<int[]>( new int[2]{42, 43} );

return std::unique_ptr<int[]>( new int[2]{42, 43} );

✗, not replaced

Note that for new int, using make_unique won’t create a pointer pointing to an uninitialized integer, but a pointer to a zero-initialized integer. So there might be some differences at runtime, for a single int it is negligible, but for more complex structures it could be relevant, and for those cases, std::make_unique_for_overwrite should be preferred.

Thus a possible alternate sed command 🗄️ is

sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g' -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\) *) *;/return std::make_unique_for_overwrite< \1 >();/g';
Before After Correct?

return std::unique_ptr<int>( new int(42) );

return std::make_unique<int>(42);

return std::unique_ptr<int>( new int );

return std::make_unique_for_overwrite<int >();

return std::unique_ptr<A<B>>( new A<B>(42) );

return std::make_unique<A<B>>(42);

return std::unique_ptr<A>( new B() );

return std::make_unique<B>();

might be problematic in some cases

return foo(std::unique_ptr<int>( new int(42) ));

return foo(std::unique_ptr<int>( new int(42) ));

✓, out of scope

return std::unique_ptr<int>( new int(42) ).get();

return std::unique_ptr<int>( new int(42) ).get();

✗, not replaced, returns an invalid pointer, it should not be relevant, better fix it directly instead of transforming it

return std::unique_ptr<int>( );

return std::unique_ptr<int>( );

return std::unique_ptr<int[]>( new int[2]{42, 43} );

return std::unique_ptr<int[]>( new int[2]{42, 43} );

✗, not replaced

The interesting part is also \([a-zA-Z :_<>,0-9]*\); a capture group for capturing the type, and paste it in \1. It does not detect all cases, for example, it won’t match A(), where A is defined as #define A() int, or decltype(1+1) but it is good enough.

It is also not strict enough for other situations, for example, int0 can be a type, but 0int not, so [a-zA-Z:_][a-zA-Z :_<>,0-9]* might be more appropriate, but I doubt it has any practical advantages. It would still fail to recognize that a:b or a< are not valid types.

I’ve also left arrays out. The pattern will not, for example, match std::unique_ptr<int[]>. The main reason is that in most cases, a std::vector is a better replacement, and simpler to use, and also because in the codebases I had to maintain yet, I’ve nearly never encountered such a type, so I did not have yet the need to think how to transform such code.

Note that this regex might change the type of a variable.

struct A0{ virtual ~A0() = default;};
struct A1 : A0{};


auto a = std::unique_ptr<A0>(new A1);

In this case, the sed command would change the type of a from std::unique_ptr<A0> to std::unique_ptr<A1>.

For supporting such cases, one should prefer a more complex expression 🗄️:

sed -e 's/return *std::unique_ptr<\(.*\)>( *new \([a-zA-Z :_<>,0-9]*\) *[{(]\(.*\)[)}] *) *;/return std::unique_ptr< \1 >( std::make_unique< \2 >(\3));/g' -e 's/return *std::unique_ptr<\(.*\)>( *new \([a-zA-Z :_<>,0-9]*\) *) *;/return std::unique_ptr< \1 >( std::make_unique< \2 >());/g';
Before After Correct?

return std::unique_ptr<int>( new int(42) );

return std::unique_ptr< int >( std::make_unique< int >(42));

return std::unique_ptr<int>( new int );

return std::unique_ptr< int >( std::make_unique< int >());

✓, mostly

return std::unique_ptr<A<B>>( new A<B>(42) );

return std::unique_ptr< A<B> >( std::make_unique< A<B> >(42));

return std::unique_ptr<A>( new B() );

return std::unique_ptr< A >( std::make_unique< B >());

return foo(std::unique_ptr<int>( new int(42) ));

return foo(std::unique_ptr<int>( new int(42) ));

✓, out of scope

return std::unique_ptr<int>( new int(42) ).get();

return std::unique_ptr<int>( new int(42) ).get();

✗, not replaced, returns an invalid pointer

return std::unique_ptr<int>( );

return std::unique_ptr<int>( );

return std::unique_ptr<int[]>( new int[2]{42, 43} );

return std::unique_ptr<int[]>( new int[2]{42, 43} );

✗, not replaced

Similarly, it is possible to create a variant that supports std::make_unique_for_overwrite.

There are now fewer chances to break something, but the code is much more verbose. It might be possible to simplify it with another regex (or create a more complex regex to avoid creating such verbose code), but personally, the previous, simpler, regex has always been good enough™.

Store result of new in unique_ptr

Replace snippets like std::unique_ptr<type> value( new type() ); with std::unique_ptr<type> value = std::make_unique<type>();.

Note, while normally I would prefer to use auto instead of typing type twice, the regex does not distinguish between member variables and others.

Since it is not possible to use auto for member variables, the regex will not replace the type with auto

sed -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)> \([a-zA-Z :_<>,0-9]\{1,\}\)( new \1(\(.*\)) );/std::unique_ptr<\1> \2 = std::make_unique<\1>(\3);/g'

Note how I’ve used [a-zA-Z :_<>,0-9]\{1,\} to ensure that there is a variable, name, oterwise the regex would detect and replace incorrectly std::unique_ptr<type>( new type() )

Before After Correct?

std::unique_ptr<type> value( new type );

std::unique_ptr<type> value( new type );

✗, not replaced

std::unique_ptr<type> value( new type() );

std::unique_ptr<type> value = std::make_unique<type>();

std::unique_ptr<type> value( new type(2) );

std::unique_ptr<type> value = std::make_unique<type>(2);

std::unique_ptr<type>( new type() );

std::unique_ptr<type>( new type() );

✓, not replaced, no variable name

std::unique_ptr<type>( new type(2) );

std::unique_ptr<type>( new type(2) );

✓, not replaced, no variable name

std::unique_ptr<type1> value( new type2() );

std::unique_ptr<type1> value( new type2() );

✓, not replaced, different types

std::unique_ptr<type1> value( new type2(2) );

std::unique_ptr<type1> value( new type2(2) );

✓, not replaced, different types

std::unique_ptr<type1>( new type2() );

std::unique_ptr<type1>( new type2() );

✓, not replaced, no name and different types

std::unique_ptr<type1>( new type2(2) );

std::unique_ptr<type1>( new type2(2) );

✓, not replaced, no name and different types

It is possible to support all listed examples by adding multiple regexes, but I’m going to leave cases like std::unique_ptr<type> value( new type ); out for brevity, as it would double the regexes, or make the proposed more complex.

For supporting the case of no variable name, something like the following command should do the job:

sed -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)>( new \1(\(.*\)) );/std::make_unique<\1>(\2);/g'

For supporting the example with different types, after the matching types have been handled, can be supported with:

sed -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)> \([a-zA-Z :_<>,0-9]\{1,\}\)( new \([a-zA-Z :_<>,0-9]*\)(\(.*\)) );/std::unique_ptr<\1> \2 = std::make_unique<\3>(\4);/g'

The last use-case is supported with, again, after supporting matching types:

sed -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\)) );/std::unique_ptr<\1>(std::make_unique<\2>(\3));/g'

Probably it is best to put them together, to ensure that matching types are replaced before working on different types:

sed \
  -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)> \([a-zA-Z :_<>,0-9]\{1,\}\)( new \1(\(.*\)) );/std::unique_ptr<\1> \2 = std::make_unique<\1>(\3);/g' \
  -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)>( new \1(\(.*\)) );/std::make_unique<\1>(\2);/g' \
  -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)> \([a-zA-Z :_<>,0-9]\{1,\}\)( new \([a-zA-Z :_<>,0-9]*\)(\(.*\)) );/std::unique_ptr<\1> \2 = std::make_unique<\3>(\4);/g' \
  -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\)) );/std::unique_ptr<\1>(std::make_unique<\2>(\3));/g'
Before After Correct?

std::unique_ptr<type> value( new type );

std::unique_ptr<type> value( new type );

✗, not replaced

std::unique_ptr<type> value( new type() );

std::unique_ptr<type> value = std::make_unique<type>();

std::unique_ptr<type> value( new type(2) );

std::unique_ptr<type> value = std::make_unique<type>(2);

std::unique_ptr<type>( new type() );

std::make_unique<type>();

std::unique_ptr<type>( new type(2) );

std::make_unique<type>(2);

std::unique_ptr<type1> value( new type2() );

std::unique_ptr<type1> value = std::make_unique<type2>();

std::unique_ptr<type1> value( new type2(2) );

std::unique_ptr<type1> value = std::make_unique<type2>(2);

std::unique_ptr<type1>( new type2() );

std::unique_ptr<type1>(std::make_unique<type2>());

std::unique_ptr<type1>( new type2(2) );

std::unique_ptr<type1>(std::make_unique<type2>(2));

reset an unique_ptr with new or nullptr

This section assumes that you are mostly using smart pointers with the default deleter.

Otherwise changing a.reset( new int(42) ); to a = std::make_unique<int>(42) ; is a breaking change.

If you are using a custom deleter, i.e something like auto up = std::unique_ptr<a>(ptr, deleter);, consider making the deleter part of the type

struct s{
    void operator(a* ptr){
        deleter(ptr);
    }
};
auto up = std::unique_ptr<a, s>(ptr);

This makes it impossible to use by accident the wrong deleter.

For most cases, something like 🗄️:

sed -e 's/\.reset( *new *\([a-zA-Z :_<>,0-9]*\)(\(.*\));/ = std::make_unique< \1 >( \2;/g' -e 's/\.reset( *new *\([a-zA-Z :_<>,0-9]*\));/ = std::make_unique< \1 >();/g' -e 's/\.reset( *nullptr *);/ = nullptr;/g' -e 's/\.reset( *);/ = nullptr;/g';

should be sufficient.

Before After Correct?

a.reset( new int(42) );

a = std::make_unique< int >( 42) ;

a.reset( new T<A>(42,-1) );

a = std::make_unique< T<A> >( 42,-1) ;

a.reset( new T );

a = std::make_unique< T >();

✓, mostly

a.reset( nullptr );

a = nullptr;

a.reset();

a = nullptr;

Using make_unique, like in all other places, has the advantage that it makes it more obvious where possible leaks are.

For cases like a.reset(b.release()) use

sed -e 's/\.reset(\(.*\)\.release());/ = std::move(\1);/g';

It also makes sense to combine it with

sed -e 's/std::move(std::make_unique\(.*\));/std::make_unique\1;/g';

If not a.reset(std::make_unique<int>(42).release()); would be left as a = std::move(std::make_unique<int>(42));.

It could be possible to avoid this issue altogether by using

sed -e 's/.reset(std::make_unique<\(.*\)> *(\(.*\))\.release());/= std::make_unique<\1>(\2);/g';

so that this particular code smell is fixed as a separate commit.

Other instances of unique_ptr

The regexes applied for the return statement can be reused in other situations by simply removing return. Similarly, the regexes used for .reset that ends with ; can be reused in other places. Depending on how many matches and particular code constructs, it might make sense to make further changes to restrict the matches and reduce the number of false positives.

Return a naked owning pointer

Change return new int (42) to return std::make_unique<int>(42).

Further changes are necessary, like changing the function signature and where the function is used. Applying the regex 🗄️, and compiling afterwards the code, will give an idea of what and how many changes are necessary at least, and eventually discover some bugs.

sed -e 's/return *new *\([a-zA-Z :_<>,0-9]*\) *( *\(.*\) *) *;/return std::make_unique<\1>(\2);/g';
Before After Correct?

return new int(42);

return std::make_unique<int>(42);

return new int();

return std::make_unique<int>();

return new int;

return new int;

✗, not replaced

return new int[]{1,2};

return new int[]{1,2};

✗, not replaced

return new A<B>(C(42));

return std::make_unique<A<B>>(C(42));

Other instances of new

new is a very common word, so it is useful to try to avoid matching commented-out code and comments, those all lines beginning with / or * (for comments that span over multiple lines).

Assignment

Use something like following regex 🗄️ to replace new with make_unique:

sed -e 's/\(^[^\*/]*\)= *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1 = std::make_unique<\2>(\3);/g' -e 's/\(^[^\*/]*\)= *new *\([a-zA-Z :_<>,0-9]*\);/\1 = std::make_unique<\2>();/g'
Before After Correct?

a = new int(42);

a = std::make_unique<int>(42);

//a = new int(42);

//a = new int(42);

/*
 * a = new int(42);
 */
/*
 * a = new int(42);
 */

Function call

If new is used when calling a function, like foo(new int), then something like 🗄️:

sed -e 's/\(^[^\*/]*\)( *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1(std::make_unique<\2>(\3);/g' -e 's/\(^[^\*/]*\), *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1,std::make_unique<\2>(\3);/g'

should do the job (eventually call it more than once)

Note that I have currently no idea how to write a readable regex to support f(new a, c) because "a,c" could denote a type with the current regex rules. At that point, hopefully, new does not appear as many times, and those places can be fixed by hand.

Before After Correct?

f(new int(42));

f(std::make_unique<int>(42));

f(new int);

f(new int);

✗, not replaced

//f(new int(42));

//f(new int(42));

f(new int(42), new int(42));

f(std::make_unique<int>(42),std::make_unique<int>(42));

f(new int(42), new int(42), new int(42));

f(std::make_unique<int>(42), new int(42),std::make_unique<int>(42));

✗, not replaced, execute regex more than once

f(std::make_unique<int>(42), new int(42),std::make_unique<int>(42));

f(std::make_unique<int>(42),std::make_unique<int>(42),std::make_unique<int>(42));

f(new int(42), bar(new int(42)));

f(new int(42), bar(std::make_unique<int>(42)));

✗, not replaced, execute regex more than once

f(new int(42), bar(std::make_unique<int>(42)));

f(std::make_unique<int>(42), bar(std::make_unique<int>(42)));

f(new int, c);

f(new int, c);

✗, not replaced

f(new int(42), c);

f(std::make_unique<int>(42), c);

delete

Ideally, a codebase never uses the delete for freeing memory, as resource management is done through constructors and destructors.

Generally delete cannot be transformed into something useful, it could be deleted. Similarly to new it is a very common word, thus it is a good idea to skip comments to reduce the number of matches 🗄️

grep -e '^[^/=\*]*\bdelete [^\[]*[a-zA-Z0-9:_]';
Code Match?

delete a;

yes, ✓

delete[] a;

no, ✓

mydelete a;

no, ✓

delete(a);

yes, ✓

delete (a);

yes, ✓

foo() = delete

no, ✓

 // delete a
 // value from the database

no, ✓

delete a value from the database

yes, ✗

/*
 * delete a
 * value from the database
*/

no, ✓

/*
 delete a
 value from the database
*/

yes, ✗

I’m skipping lines that have a = before delete, as those are probably something that looks like a deleted function.

For delete[], use 🗄️

grep -e '^[^/=\*]*\bdelete *\[';
Code Match?

delete a;

no, ✓

delete[](a);

yes, ✓

delete[] a;

yes, ✓

delete [] a;

yes, ✓

mydelete[] a;

no, ✓

foo() = delete;

no, ✓

// delete a value from the database

no, ✓

I prefer handling delete and delete[] differently because often a better replacement has a very different look (mostly std::vector versus a std::unique_ptr).

Return a value directly instead of a pointer

Especially before C++11 and move semantic, it was common, especially for classes that could not be copied, to return a pointer pointing to the heap.

This adds an unnecessary indirection and a heap allocation. Considering RVO, it should never be necessary to use such techniques anymore, so it would be nice to be able to remove these unnecessary indirections.

Unfortunately changing the function signature is not enough, the user of the functions needs to be changed too.

Consider the following piece of code

struct foo{
    int bar();
};

std::unique_ptr<foo> fun1();
void fun2(const foo& f);

int main(){
    auto res = fun1()->bar();
    fun2(*fun1());
}

After changing the signature of fun1 to foo fun1();, the changes one needs to do are removing dereferences (*) and replacing -> with ..

This can be accomplished with something similar 🗄️ to

sed -e 's/\*\(.*\)\bfun1()/\1fun1()/g' -e 's/\bfun1()->/fun1()./g';

Note that the regex is simple enough to work also for member functions 🗄️:

struct foo{
    int bar();
    std::unique_ptr<foo> fun1();
};

void fun2(const foo& f);

int main(){
    foo f;
    auto res = f.fun1()->bar();
    fun2(*f.fun1());
}
Before After Correct?

auto res = fun1()->bar();

auto res = fun1().bar();

fun2(*fun1());

fun2(fun1());

auto res = f.fun1()->bar();

auto res = f.fun1().bar();

fun2(*f.fun1());

fun2(f.fun1());

Cleanup functions

Common names (or part of names) for cleanup and acquisition functions are

Since those operations are often best handled with factory functions (like std::make_unique instead of new), constructors, and destructors (like lock_guard instead of lock and unlock), it makes sense to search for those and other similar patterns.

Other actions that are often best handled with constructors and destructors, or somehow encapsulated, are

  • start

  • stop

  • revert

  • undo

  • join (like std::thread::join)

I happen to search for those or similar patterns from time to time (comments excluded), because most of the time (think about new and delete, or locking a mutex) those constructs are error-prone to use and are better encapsulated. Thus I expect to have only a few matches.

std::chrono vs integers

Before std::chrono, most functions would take an integral for denoting time. If lucky, a comment would explain if the given parameter denoted seconds, milliseconds, or something else.

std::chrono is a nice example of how a wrapper can be easier to use and read.

When changing a specific function signature, there might be a lot of code that needs to be adapted. Function call where a literal value is being used can be changed with a regex 🗄️:

sed -e "s/set_timeout( *\([0-9-][0-9']*\) *)/set_timeout( \1ms \)/g"
Before After Correct?

set_timeout(0)

set_timeout( 0ms )

set_timeout(1'000)

set_timeout( 1'000ms )

set_timeout(-2)

set_timeout( -2ms )

set_timeout(a)

set_timeout(a)

catch by const-ref

It is easy to forget to write const when catching an exception.

With

sed -e 's/catch( *\([a-zA-Z :_<>,0-9]*\)/catch(const \1/g' -e 's/catch(const const /catch(const /g'

it is possible to add a const after catch(. If the capture should not be made constant, the compiler will complain.

The sed command avoids adding const when it is already there, but it does not detect patterns where the const is written on the right catch(std::exception const&).

A regex like

sed -e 's/catch *( *\([a-zA-Z :_<>,0-9]*\)\bconst\b\(&\)* /catch(const \1\2 /g'

is also problematic, as const could be part of the captured type, for example: catch(myexception<const& int> const&).

To make the pattern less error-prone, one could use

sed -e 's/catch *( *\([a-zA-Z :_<>,0-9]*\)\(.*\)\bconst\b\(&\)* /catch(const \1\2\3 /g'

to find only the last const instead of the first one. This means that this pattern works correctly for catch(myexception<const& int> const&), but still fails for catch(myexception<const& int>).

As variable names cannot contain any of the following characters <>,, it is possible to make the regex even more foolproof, but I do not think that it would make much sense. First, the compiler is going to complain if a regex broke a catch block, second, I’ve never encountered a templated type in a catch clause.

What I’ve curently used 🗄️ is

sed -e 's/\bcatch *( *\([a-zA-Z :_<>,0-9]*\)\(.*\)\bconst\b\(&\)* /catch(const \1\2\3 /g' -e 's/\bcatch *( *\([a-zA-Z :_<>,0-9]*\)/catch(const \1/g' -e 's/\bcatch(const const /catch(const /g';
Before After Correct?

catch(std::exception const a)

catch(const std::exception a)

catch(std::exception const& a)

catch(const std::exception & a)

catch(std::exception<const int> const& a)

catch(const std::exception<const int> & a)

catch(std::exception< const& int> const & a)

catch(const std::exception< const& int> & a)

catch(std::exception< const& int> a)

catch(const std::exception< & int> a)

✗, wrong replacement(!)

catch(std::exception a)

catch(const std::exception a)

catch(std::exception& a)

catch(const std::exception& a)

catch(const std::exception a)

catch(const std::exception a)

catch(const std::exception& a)

catch(const std::exception& a)

Multiline regexes

Yes, it should be forbidden, but hey, for simple use cases, just like single-line regexes, it can be good enough.

sed can be used for multiline regexes simply by using -z as an additional parameter. -z means that sed uses \0 and not a newline as a delimiter when applying a regex. This means that a single regex is generally applied to the whole textual file.

When using -z, one needs to detect newlines too. This makes writing the regex more complex, as the code after a new line is normally formatted with multiple tabs or spaces. Thus you might need to replace a single whitespace with something like \s*.

As a very dumb example, suppose we want to transform return std::unique_ptr<int>( new int(42) ); to return std::make_unique<int>(42);

The previously presented regex would not work, as it does not find the ending ); and for simplicity, let’s just concentrate on this piece of the regex: 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g'

The first approach would be to replace whitespaces with [ \t\n] or \s:

sed -z -e 's/return\s*std::unique_ptr<.*>(\s*new\s*\([a-zA-Z :_<>,0-9]*\)(\(.*\)\s*)\s*)\s*;/return std::make_unique< \1 >(\2);/g'

but it is not sufficient, as it would transform

#include <memory>

auto bar(){ return std::unique_ptr<int>(
  new int(42) ); }
void foo(){ (int*)nullptr >( new int(44) ); }

to

#include <memory>

auto bar(){ return std::make_unique< int >(44); }

Having statements on single lines makes writing regex easier because they are automatically scoped at most to a single statement.⁠[1]

Since the regex matches multiple statements, it is necessary to let it match at most one statement by hand.

The easiest way to do it is to replace .* with [^;]*, thus replacing "any character", with "any character that is not a semicolon". Contrary to other languages, the ; at the end of a statement is never optional.⁠[2]

With

sed -z -e 's/return\s*std::unique_ptr<[^;]*>(\s*new\s*\([a-zA-Z :_<>,0-9]*\)(\([^;]\)\s*)\s*)\s*;/return std::make_unique< \1 >(\2);/g'

the code is transformed to

#include <memory>

auto bar(){ return std::unique_ptr<int>( new int(42) ); }
void foo(){ (int*)nullptr >( new int(44) ); }

which is the desired output.

Note 📝
Both transformations are valid code snippets that would compile. I did this by accident, and it shows that multiline regexes are generally too broad-scoped by default and should be used with care.

TLDR; there are two rules of thumb to apply to a regex when using it on multiple lines

  • replace whitespace with \s*; a sequence of multiple whitespaces (whitespace, tab, line feed, …​)

  • replace .* with [^;]*; a sequence of any character that does not contain ;

Replace base uncopyable class with deleted assignment operators

The Uncopyable class from Scott Meyers Book Effective C++ 🗄️:

class Uncopyable{
	protected:
		Uncopyable(){}
		~Uncopyable(){}
	private:
		Uncopyable(const Uncopyable&);
		Uncopyable& operator=(const Uncopyable&);
};

it can help to suppress copy constructors.

I have seen in the wild implementations with virtual destructors, which missed the following advice (emphasis mine)

The implementation and use of Uncopyable include some subtleties, such as […​] and that Uncopyable’s destructor need not be virtual […​].

Even with the virtual destructor, the class seems to work, but it is a pessimization.

The boost implementation 🗄️ shows that multiple implementations have another possible drawback not described in "Effective C++": unintended ADL.

Another issue is that a subclass might implement a copy constructor. This code (with Uncopyable "updated" to C++11) compiles without errors

class Uncopyable{
		Uncopyable(const Uncopyable&) = delete;
		Uncopyable& operator=(const Uncopyable&) = delete;
	protected:
		Uncopyable() = default;
		~Uncopyable() = default;
};

struct s : private Uncopyable{
    int j= 42;
    s() = default;
    s(const s& other) : j(other.j){
        // do eventually something else
    }
};

void foo(){
    s v1;
    s v2 = v1; // copy
}

Also sometimes users of the class miss following advice (emphasis mine)

The […​] use of Uncopyable include some subtleties, such as the fact that inheritance from Uncopyable needn’t be public

Since C++11, it should be preferred to delete the copy constructor (and assignment) operator directly, or follow the "Rule of 0" instead of relying on class hierarchies. Thus the advice of using Uncopyable did (fortunately) not age well. It is generally less error-prone and generally helps little with legibility.

With the following multiline regular expression it is possible to match subclasses of Uncopyable and add deleted copy constructors and assignments:

sed -z -e 's/\s*\b\(class\|struct\)\s*\([a-zA-Z0-9]*\)\s*:\s*\(private\|public\|protected\)\?\s*Uncopyable\s*{\n/\n\1 \2\n{\n\2 (const \2 \&)=delete;\n\2\& operator=(const \2 \&)=delete;/g'

This regex changes

class a : private Uncopyable
{

};

class b : private Uncopyable
{

};

class c : Uncopyable
{

};

class d : public Uncopyable
{

};

struct e : public Uncopyable
{

};

to

class a
{
a (const a &)=delete;
a& operator=(const a &)=delete;
};
class b
{
b (const b &)=delete;
b& operator=(const b &)=delete;
};
class c
{
c (const c &)=delete;
c& operator=(const c &)=delete;
};
class d
{
d (const d &)=delete;
d& operator=(const d &)=delete;
};
struct e
{
e (const e &)=delete;
e& operator=(const e &)=delete;
};

Note that this version does not change, by design, the class in case of multiple inheritance.

Replace empty constructor and destructor with default

Instead of writing

myclass::myclass(){
}

myclass::~myclass(){
}

one should prefer the shorter form

myclass::myclass() = default;

myclass::~myclass() = default;

Replacing the destructor is "easy", as destructors have a ~ and do not take any parameter.

sed -z -e 's/~\([^;()]*\)(\s*)\s*{\s*}/~\1() = default;/g'

Note that this regex will not match ~a() override {}, so let’s add an optional capture for override

sed -z -e 's/~\([^;()]*\)(\s*)\s*\(override\s*\)*{\s*}/~\1() \2= default;/g'

On the other hand, replacing a constructor is more difficult, as there is no "direct" way to recognize if a given function is a constructor.

Is a(){} a constructor that we would like to set = default;? Yes if it appears directly inside a class or struct named a, otherwise not.

On the other hand, recognizing a non-inline constructor is easier, as the function name coincides with a class name. Thus a::a(){} is a constructor we are interested in, while a::b(){} is not.

sed is able to use a capture group in the expression we want to match, I’m not sure if this feature is widely supported, but it makes it trivial to match outline constructors:

sed -z -e 's/\([^;()]*\)::\1(\s*)\s*{\s*}/\1::\1() = default;/g'

Note that those regexes fail to find functions like a::~a(void){} and a::a(void){}.

It is left as an exercise to the reader to adapt the regexes, even if in my opinion you should stop writing unnecessary void parameters.

Inline constructors are still to be updated, assuming that there will not be many empty nullary functions, searching for (\s*)\s*{\s*}, might be sufficient for finding the remaining constructors without too many positives.

Before (inline for simplicity) After Correct?

struct a{~a(){}};

struct a{~a() = default;};

struct a{~a(){f()}};

struct a{~a(){f()}};

struct a : b{~a() override{}};

struct a : b{~a() override= default;};

a::~a(){}

a::~a() = default;

a::~a(void){}

a::~a(void){}

✗, extend regex, or do not use void for destructors

a::~a(){f()}

a::~a(){f()}

struct a{a operator~(){}};

struct a{a operator~() = default;}

✗, wrong replacement. The original code had undefined behavior, the new one does not build (operator~() cannot be set =default;)

struct a{a(){}};

struct a{a(){}};

struct a{a(){f()}};

struct a{a(){f()}};

struct a{a(int){}};

struct a{a(int){}};

a::a(){}

a::a() = default;

a::a(void){}

a::a(void){}

✗, extend regex, or do not use void for nullary functions

a::a(){f()}

a::a(){f()}

auto instead of iterator types

When iterating over a container, it is not uncommon to find code similar too

// iteratorytype is either the actual type, or a typedef to make the code more readable
for(iteratortype it = c.begin(); it != c.end(); it++){
	// ...
}

While there are most of the time more expressive ways to write such loop (with a foreach loop, or with an appropriate algorithm) a little upgrade that can be done with a regex is using auto instead of writing the iterator type explicitly:

for( auto it = c.begin(); it != c.end(); ++it){
	// ...
}

The code is generally more compact and less error-prone to maintain.

Note that generally speaking, transforming the first form into the second is not safe.

Maybe c.begin() does not return an iteratortype, and the return type of c.begin() is convertible to iteratortype.

For example iteratortype might be a const_iterator.

Thus blindly replacing one form with the other might have unintended side effects. Even more, if the iterator is compared with something else, after all, begin and end do not necessarily have anything to do with iterators, they could return integers, enum or some other strange type.

For this reason, the presented regex is as strict as possible.

The code matches only if

  • it is inside a for statement

  • both .begin() (or `.cbegin()) and .end() (or .cend()) appear

  • the iterator is compared with !=, and not, for example <

  • the return value of begin() is incremented with ++

-z -e 's/for\s*(\s*[a-zA-Z :_<>,0-9\*]*\s\s*\([a-zA-Z:_<>,0-9]*\)\s*=\s*\([a-zA-Z:_<>,0-9\.]*\)\.\(c\)\{0,1\}begin();\s*\1\s*!=\s*\2\.end();\s*\(++\1\|\1++\)\s*)/for(auto \1 = \2.\3begin(); \1 != \2.\3end(); ++\1)/g'
Before (inline for simplicity) After Correct?

for(vector< int >::iterator it = c.begin(); it != c.end(); ++it)

for( auto it = c.begin(); it != c.end(); ++it)

for(iterator it = c.cbegin(); it != c.end(); ++it)

for( auto it = c.cbegin(); it != c.cend(); ++it)

for(iterator it = c.begin(); it != c.end(); it++)

for( auto it = c.begin(); it != c.end(); ++it)

for(iterator it = a.begin(); it < a.end(); ++it)

for(iterator it = a.begin(); it < a.end(); ++it)

✗, extend regex, but I assume tha using < between iterators is rare enough that the change is not worth it

for(iterator it = a.begin(); it != j; ++it)

for(iterator it = a.begin(); it != j; it++)

✓, as there might be implicit conversions

for(iterator it = std::begin(a); it != std::end(a); it++)

for(iterator it = std::begin(a); it != std::end(a); it++)

✗, extend regex, but I assume that this pattern is rare enough that the change is not worth it

for( itertype it = begin(a); it != end(a); it++)

for( itertype it = begin(a); it != end(a); it++)

✗, extend regex, but I assume that this pattern is rare enough that the change is not worth it

for(type i = begin; i != end; ++i)

for(type i = begin; i != end; ++i)

✓, as there might be implicit conversions

An alternate approach would be to add a static_assert, and use the compiler to verify if some unwanted conversions have been introduced:

sed -z -e 's/for\s*(\s*\([a-zA-Z :_<>,0-9\*]*\)\s\s*\([a-zA-Z:_<>,0-9]*\)\s*=\s*\([a-zA-Z:_<>,0-9\.]*\)\.\(c\)\{0,1\}begin();\s*\2\s*!=\s*\3\.end()\s*;\s*\(++\2\|\2++\)\s*)/static_assert(std::is_same<\1, decltype(\3.\4begin())>::value, "");\nfor(auto \2 = \3.\4begin(); \2 != \3.\4end(); ++\2)/g'

This would transform

for(std::vector< int >::iterator it = c.begin(); it != c.end(); it++){
	// ...
}

to

static_assert(std::is_same<std::vector< int >::iterator, decltype(a.begin())>::value, "");
for(auto it = a.begin(); it != a.end(); ++it){
	// ...
}

This means that for the following example

for(std::vector< int >::const_iterator it = c.begin(); it != c.end(); it++){
	// ...
}

the code will not compile after applying the regex instead of slightly changing meaning, and one can inspect manually what changes are necessary to retain the desired behavior (in this case, use cbegin instead of begin).

Conclusion

Most regexes written here have been used only a limited number of times and with some adaptations.

For example, in the beginning, I used something similar to

sed -e 's/std::unique_ptr<.*>( new \([a-zA-Z]*\)(\(.*\));/std::make_unique< \1 >(\2;/g';

After noticing that the difference was so big that it would have made reviewing too difficult. I looked if was possible to split the changes into smaller pieces. So I narrowed the scope down with the following regex

sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z]*\)(\(.*\));/return std::make_unique< \1 >(\2;/g';

After fixing all return statements, I used the more generic pattern to see where it made sense to continue to change the code.

Similarly, when searching for common cleanup and acquisition functions, in the first run, I use the most generic term, like lock. After looking at how many matches there are, I might decide to grep for a less common pattern, like \.[a-z]*lock(, to match member functions like lock and unlock.

Most examples are about pointers, because

  • it occurred to me it might be a good idea to write those one-shot regexes down when I was changing some code dealing with pointers

  • the deprecation of std::auto_ptr and addition of std::unique_ptr and nullptr in C++11, and the addition of std::make_unique in C++14 lead to a lot of low-hanging fruits easy to fix quickly with regexes.

It also occurred to me that it is possible to reuse those regexes and integrate them with Cppcheck. As Cppcheck normalizes the source code, regexes can be simplified a lot. Instead of writing, for example, = *new, writing = new is sufficient. Also, many presented regexes fail if someone uses using namespace std; and unique_ptr without std::, and does not handle a ton of cases with whitespace or tabs. With Cppcheck this is a non-issue and makes this much easier to maintain. The main disadvantage is that with Cppcheck it is not possible to transform the code automatically.


1. I am supposing that you do not write multiple statements on a single line. Does any code formatter write multiple statements on one line? You are using a code formatter, right?
2. Rant: I really dislike languages that make the ; optional under some circumstances. First, it makes the code inconsistent. Second, always having a ; at the end of a statement makes parsing the code in my head easier/less error-prone

Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.