The C++ logo, by Jeremy Kratz, licensed under CC0 1.0 Universal

Find and fix code smells in C++ with regexes


14 - 18 minutes to read, 3512 words
Categories: c c++ editing regex scripting
Keywords: =delete c c++ clang-tidy cppcheck editing leak regex scripting

Even if everyone knows that search-and-replace is a rudimentary tool for editing source code, it is still one that is very effective and with a very small learning curve.

In particular, it is available in nearly all editors, IDE, and command line tools. It is also a technique that works with all programming languages that use text for representing source code (one could also use it when programming in Scratch, but I doubt it will be that useful…​).

Because of its simplicity, it is the first tool one learns to use when refactoring some code, like renaming variables or functions.

But search and replace, together with regexes, can be used, with an important caveat, for doing more complex refactorings, like detecting and correcting code smells.

Issues with this approach

The caveat is no matter how hard you try, searching and replacing is not an appropriate tool for refactoring C++ source code. There are multiple technical issues, that make it impossible to write an error-free search and replace strategy.

The first is function overload. For example, suppose you have void foo(int) and void foo(const char*) but want to rename only the first overload to bar. Replacing foo with bar obviously will not work, as it will rename all overloads, and given the piece of code

int a = 0;
foo(a);

There is no way, without parsing the code, that a regex can decide if we need to replace foo with bar or not.

Another example is that generally, it is not possible to distinguish a function from a variable declaration

std::string s();

is it a function named s that returns a std::string, or a variable s of type std::string?

Even if thus this approach cannot replace a more robust tool for refactoring, it is often Good Enough™.

In particular, it is incredibly fast, and together with a revision control system and a compiler, it is invaluable for many tasks.

Advantages of this approach

Some IDE offers more powerful refactoring tools, like renaming a variable. Unfortunately, those are often bundled with the IDE, and cannot be used without it.

Unless you have configured multiple IDE on the same codebase, which is sometimes a pain, and other times not even possible as not all IDE are multiplatform you’ll have only a subset of tools at your disposal. Also, such tools might work only on code that is currently compiled to give a better result. As a side effect, conditionally compiled code, and code commented out (like examples in the documentation) will not be changed, which might or might be not a good thing.

Another disadvantage is that often tools bundled with an IDE cannot be executed automatically, so it is more difficult to explain how to obtain the desired result. With a command line tool, it should be enough to write the command one has used, and the transformation can easily be verified in a code review.

Last but not least, tools that need to parse C++ (or C) and also take macros into account are generally much slower unless all parsing information has been created before.

In more than one project I had search and replace doing a remarkable job on the whole codebase in less than a couple of seconds, and clang-tidy taking nearly an hour for replacing NULL with nullptr. To be fair clang-tidy also replaced other constructs with nullptr, like 0 or {}. Even if there were a handful of changes, waiting an hour was probably not worth it.

Some considerations before beginning

Unify formatting. It makes it easier to write regexes. Suppose you want, for example, to change a type from foo<bar> to foo_bar. If the formatting is not consistent, you might miss places where someone wrote, for example, foo< bar >.

Also having a formatter ensures that the replaced code can be formatted automatically, this helps to make the whole search-and-replace process automatic.

It is easier if every statement is on a single line, no matter how long. Otherwise, you might miss places that are written on multiple lines.

Code smells with pointers

Code older than C++11 might have used std::auto_ptr, and replaced most instances with std::unique_ptr. As the interface between the two functions is slightly different, and as there is the factory function std::make_unique since C++14, I noticed some code smells that were easy enough to "fix" with a regex.

Examples are written for GNU sed and GNU Grep, so some adjustments might be necessary for your IDE.

Those programs should be available for most, if not all platforms, Windows and Android included.

In particular, I’m using \b for denoting a word boundary.

Some IDE and editors might provide a separate option for defining if a regex should apply to whole words or not.

For exaple in vim, instead of \b, one needs to use \<, for example :%s/\<NULL\>/nullptr instead of sed -e 's/\bNULL\b/nullptr/g'.

Note 📝
One can execute sed inside vim. To execute sed on the current buffer, use :%!sed -e <pattern>. Other editors or IDE might provide similar options.

Use NULL instead of nullptr

Simply using something like

sed -e 's/NULL/nullptr/g';

is not enough, as NULL could be part of another word or macro, like MY_NULL, or be part of a string literal.

Something more robust would be

sed -e 's/\bNULL\b/nullptr/g';

This fix/enhancement should be one of the first refactorings to be done when refactoring smart pointers. It helps, for example, for a smooth transition from naked pointers and std::auto_ptr to std::unique_ptr and std::shared_ptr where those are useful.

For example std::unique_ptr<int>(NULL); does not compile, but std::unique_ptr<int>(nullptr); does.

Note that simply replacing NULL with nullptr might break some code, for example

char c = NULL;
// or
void foo(int);
foo(NULL);

but I would argue that this code is misleading, and should be changed (in a separate commit) to

char c = '\0';

foo(0);

Avoid std::auto_ptr

Since std::auto_ptr has been deprecated, and there are better replacements, it should be removed and replaced with something else.

sed -e 's/std::auto_ptr/std::unique_ptr/g';

should be sufficient for most cases.

One could argue that

sed -e 's/\bstd::auto_ptr\b/std::unique_ptr/g';

might be better, as it avoids matching something like mystd::auto_ptr_int, but the chances are pretty low.

Other adjustments might be necessary, normally it involves applying std::move where auto_ptr made a copy.

Compare with nullptr

A std::auto_ptr a; can be compared with nullptr (or NULL) throught the member function get: a.get() == nullptr.

For std::unique_ptr this is not necessary, as operator== (and operator!=) has an appropriate overload.

Using the member function get is not harmful, but it seems to confuse some developers, some even ends to write a != nullptr && a.get() != nullptr. (One could blame the developers, I think it’s just easier to make the code consistent, remove the function call, and eventually explain the change during a code review).

For simplicity and consistency, removing all the unnecessary calls to get helps to avoid this type of confusion.

sed -e 's/\.get( *) *== * nullptr/==nullptr/g';

This pattern has false positives, as .get() might be a member function of something else returning something comparable to nullptr, and removing it would make the code invalid.

It also does not match nullptr == a.get(). It is more difficult to write an appropriate substitution regex for it, as a simple approach would break expressions like nullptr == foo(a.get()).

A probably good enough pattern is

sed -e 's/nullptr == \([^(]*\)\.get()/nullptr == \1/g';

even if it misses things like nullptr == foo().get().

The pattern should then be generalized also for handling operator!=, the final version looks like

sed -e 's/\.get( *) *\([!=]\)= *nullptr/ \1= nullptr/g' -e 's/nullptr \([!=]\)= \([^(]*\)\.get()/nullptr \1= \2/g';

Return std::unique_ptr holding nullptr

Instead of writing return std::unique_ptr<int>(); or return std::unique_ptr<int>(nullptr);, it is much simpler to write (and read) return nullptr.

The pattern could also be applied not only to return statements but to avoid too many changes at once I preferred to handle return statements in separate commits.

sed -e 's/return std::unique_ptr<.*>( *nullptr *) *;/return nullptr;/g' -e 's/return std::unique_ptr<.*>( *) *;/return nullptr;/g';

Another factor that led me decide to write the regex this way, is that by replacing std::unique_ptr<int>() with nullptr, it would break code like auto a = std::unique_ptr<int>();.

Note that I’m also matching the ; at the end, because I saw too many places that used a construct like std::unique_ptr().get().

Those can be adjusted with

sed -e 's/return std::unique_ptr<.*>( *nullptr *)\.get() *;/return nullptr;/g' -e 's/return std::unique_ptr<.*>( *)\.get() *;/return nullptr;/g';

without the need to edit the code by hand.

Note that this transformation might still break or change valid code, for example, foo and bar do not have the same function signature

auto foo(){
    return nullptr;
}

auto bar(){
    return std::unique_ptr<int>();
}

thus it is not always possible to simply replace std::unique_ptr<int>() with nullptr, even in return statements. For example, the first lambda would compile, the second not:

auto l1 = [](bool b){
    if(b == 1){
        return std::make_unique<int>(42);
    }
    return std::unique_ptr<int>();
};

auto l2 = [](bool b){
    if(b == 1){
        return std::make_unique<int>(42);
    }
    return nullptr;
};

Return std::unique_ptr pointing somewhere

What about not nullptr? For example, when using new, like

return std::unique_ptr<int>( new int(42) );
// or
return std::unique_ptr<int>( new int );

In those cases, using make_unique should be the preferred approach.

sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g' -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\) *);/return std::make_unique< \1 >();/g';

Note that for new int, using make_unique won’t create a pointer pointing to an uninitialized integer, but a pointer to a zero-initialized integer. So there might be some differences at runtime, for a single int it is negligible, but for more complex structures it could be relevant, and for those cases, std::make_unique_for_overwrite should be preferred.

Thus a possible alternate sed command is

sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g' -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\) *) *;/return std::make_unique_for_overwrite< \1 >();/g';

The interesting part is also \([a-zA-Z :_<>,0-9]*\); a capture group for capturing the type, and paste it in \1. It does not detect all cases, for example, it won’t match A(), where A is defined as #define A() int, or decltype(1+1) but it is good enough.

It is also not strict enough for other situations, for example, int0 can be a type, but 0int not, so [a-zA-Z:_][a-zA-Z :_<>,0-9]* might be more appropriate, but I doubt it has any advantages. It would still fail to recognize that a:b or a< are not valid types.

I’ve also left arrays out. The pattern will not, for example, match std::unique_ptr<int[]>. The main reason is that in most cases, a std::vector is a better replacement, and simpler to use, and also because in the codebases I had to maintain yet, I’ve nearly never encountered such a type, so I did not have yet the need to think how to transform such code.

Note that this regex might change the type of a variable.

struct A0{ virtual ~A0() = default;};
struct A1 : A0{};


auto a = std::unique_ptr<A0>(new A1);

In this case, the sed command would change a from std::unique_ptr<A0> to std::unique_ptr<A1>.

For supporting such cases, one should prefer a slightly more complex expression:

sed -e 's/return *std::unique_ptr<\(.*\)>( *new \([a-zA-Z :_<>,0-9]*\) *[{(]\(.*\)[)}] *) *;/return std::unique_ptr< \1 >( std::make_unique< \2 >(\3));/g' -e 's/return *std::unique_ptr<\(.*\)>( *new \([a-zA-Z :_<>,0-9]*\) *) *;/return std::unique_ptr< \1 >( std::make_unique< \2 >());/g';

Or the variant that uses std::make_unique_for_overwrite.

reset an unique_ptr with new or nullptr

Use

sed -e 's/\.reset( *new *\([a-zA-Z :_<>,0-9]*\)(\(.*\));/ = std::make_unique< \1 >( \2;/g' -e 's/\.reset( *new *\([a-zA-Z :_<>,0-9]*\));/ = std::make_unique< \1 >();/g' -e 's/\.reset( *nullptr *);/ = nullptr/g';

for cases like a.reset(new int(42));.

Using make_unique, like in all other places, has the advantage that it makes it more obvious where possible leaks are.

It can make a difference in the case of a non-default dynamic deleter. I’ve never used a dynamic deleter with std::unique_ptr, but always made them as part of the type.

Thus instead of

auto up = std::unique_ptr<a>(ptr, deleter);

I prefer to use

struct s{
    void operator(a* ptr){
        deleter(ptr);
    }
}
auto up = std::unique_ptr<a, s>(ptr);

This way, it is not possible by accident to use the wrong deleter.

For cases like a.reset(b.release()) use

sed -e 's/\.reset(\(.*\)\.release());/ = std::move(\1);/g';

It also makes sense to combine it with

sed -e 's/std::move(std::make_unique\(.*\));/std::make_unique\1;/g';

If not a.reset(std::make_unique<int>(42).release()); would be left as a = std::move(std::make_unique<int>(42));.

Even better would it to precede it by

sed -e 's/.reset(std::make_unique<\(.*\)> *(\(.*\))\.release());/= std::make_unique<\1>(\2);/g';

so that this particular code smell is fixed as a separate commit.

Other instances of unique_ptr

The regexes applied for the return statement can be reused in other situations by simply removing return. Similarly, the regexes used for .reset can be reused. Depending on how many matches and particular code constructs, it might make sense to make further changes to restrict the matches and reduce the number of false positives.

Return a naked owning pointer

Change return new int (42) to return std::make_unique<int>(42).

Further changes are necessary, like changing the function signature and where the function is used. Applying the regex, and compiling afterwards the code, will give an idea of what and how many changes are necessary at least, and eventually discover some bugs.

sed -e 's/return *new *\([a-zA-Z :_<>,0-9]*\) *( *\(.*\) *) *;/return std::make_unique<\1>(\2)/g';

All things that are previously written for supporting new int need to be applied here too.

Other instances of new

new is a very common word, so it is useful to try to avoid matching commented-out code and comments, those all lines beginning with / or * (for multiline comments).

Assignment

sed -e 's/\(^[^\*/]*\)= *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1 = std::make_unique<\2>(\3);/g' -e 's/\(^[^\*/]*\)= *new *\([a-zA-Z :_<>,0-9]*\);/\1 = std::make_unique<\2>();/g'

Function call

If new is used when calling function, like foo(new int), then something like.

sed -e 's/\(^[^\*/]*\)( *new  *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1(std::make_unique<\2>(\3);/g'  -e 's/\(^[^\*/]*\), *new  *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1,std::make_unique<\2>(\3);/g'

Should do the job.

Note that I have currently no idea how to write a readable regex to support f(new a, c) because "a,c" could denote a type with the current regex rules. At that point, hopefully, new does not appear as many times, and those places can be fixed by hand.

delete

Ideally, a codebase never uses the delete for freeing memory, as resource management is done through constructors and destructors.

Generally delete cannot be transformed into something useful, it could be deleted. Similarly to new it is a very common word, thus it is a good idea to skip comments to reduce the number of matches

grep -e '^[^/=\*]*delete [^\[]*[a-zA-Z0-9:_]';

I’m also skipping lines that have a = before delete, as those are probably something that looks like a deleted function.

For delete[], use

grep -e '^[^/=\*]*delete *\[';

I prefer handling delete and delete[] differently because often a better replacement has a very different look.

Cleanup functions

Common names ( or part of names) for cleanup and acquisition functions are

Since those operations are often best handled with factory functions (like std::make_unique instead of new), constructors, and destructors (like lock_guard instead of lock and unlock), it makes sense to search for those and other similar patterns.

Other actions that are often best handled with constructors and destructors, or somehow encapsulated, are

  • start

  • stop

  • revert

  • undo

  • join (like std::thread::join)

I happen to search for those or similar patterns from time to time (comments excluded), because most of the time (think about new and delete, or locking a mutex) those constructs are error-prone to use and are better encapsulated. Thus I expect to have few matches only a few matches.

std::chrono vs integers

Before std::chrono, most function would take an integral for denoting time. If lucky, a comment would explain if the given parameter denoted secons, milliseconds, or something else.

std::chrono shows how a wrapper can be much more easy to use.

When changing a specific function signature, there might be a lot of code that needs to be adapted, and those function call using a constant can be changed with a regex:

sed -e 's/set_timeout( *\([0-9][0-9]*\) *)/set_timeout( \1ms \)/g'

Conclusion

Most regexes written here have been used only a limited number of times and with some adaptations.

For example, in the beginning, I used something similar to

sed -e 's/std::unique_ptr<.*>( new \([a-zA-Z]*\)(\(.*\));/std::make_unique< \1 >(\2;/g';

After noticing that the difference was so big that it would have made it reviewing too difficult, I searched if there same a more narrow expression I could use. So I narrowed it down to

sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z]*\)(\(.*\));/return std::make_unique< \1 >(\2;/g';

After fixing all return statements, I used the more generic pattern to see where it made sense to continue to change the code.

Similarly, when searching for common cleanup and acquisition functions, in the first run, I use the most generic term, like lock. After giving a look at how many matches there are, I might decide to grep for a less common pattern, like \.[a-z]*lock(, to match member functions like lock and unlock.

Most examples are about pointers, because

  • it occurred to me it might be a good idea to write those regexes down when I was changing some code dealing with pointers

  • the deprecation of std::auto_ptr and addition of std::unique_ptr and nullptr in C++11, and the addition of std::make_unique in C++14 lead to a lot of low-hanging fruits easy to fix with regexes.

It also occurred to me that it is possible to reuse those regexes and integrate with Cppcheck. As Cppcheck normalizes the source code, regexes can be simplified a lot. Instead of writing, for example, = *new `, writing `= new ` is sufficient. Also, all presented regexes fail if someone uses `using namespace std; and unique_ptr without std::, and do not handle a ton of cases with whitespace or tabs. With CppCheck this is a non-issue and makes this much easier to maintain. The main disadvantage is that with Cppcheck it is not possible to transform the code automatically.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me here.