Find and fix code smells in C++ with regexes
Even if everyone knows that search-and-replace is a rudimentary tool for editing source code, it is still one that is very effective and with a very small learning curve.
In particular, it is available in nearly all editors, IDE, and command-line tools. It is also a technique that works with all programming languages that use text for representing source code (one could also use it when programming in Scratch, but I doubt it will be that useful…).
Because of its simplicity, it is the first tool one learns to use when refactoring some code, like renaming variables or functions.
But search and replace, together with regexes, can be used, with an important caveat, for doing more complex refactorings, like detecting and correcting code smells.
Issues with this approach
The caveat is no matter how hard you try, searching and replacing is not an appropriate tool for refactoring C++ source code. There are multiple technical issues, that make it impossible to write an error-free search and replace strategy.
The first is function overload. For example, suppose you have void foo(int)
and void foo(const char*)
but want to rename only the first overload to bar
. Replacing foo
with bar
obviously will not work, as it will rename all overloads, and given the piece of code
int a = 0;
foo(a);
There is no way, without parsing the code, that a regex can decide if we need to replace foo
with bar
or not.
Even if thus this approach cannot replace a more robust tool for refactoring, it is often Good Enough™.
In particular, it is incredibly fast, and together with a revision control system and a compiler, it is invaluable for many tasks.
Advantages of this approach
Some IDE offers more powerful refactoring tools, like renaming a variable. Unfortunately, those are often bundled with the IDE, and cannot be used without it.
Unless you have configured multiple IDE on the same codebase, which is sometimes a pain, and other times not even possible as not all IDE are multiplatform you’ll have only a subset of tools at your disposal. Also, such tools might work only on code that is currently compiled to give a better result. As a side effect, conditionally compiled code, and code commented out (like examples in the documentation) will not be changed, which might or might be not a good thing.
Another disadvantage is that often tools bundled with an IDE cannot be executed automatically, so it is more difficult to explain how to obtain the desired result. With a command line tool, it should be enough to write the command one has used, and the transformation can easily be verified in a code review.
Last but not least, tools that need to parse C++ (or C) and also take macros into account are generally much slower unless all parsing information has been created before.
In more than one project I had search and replace doing a remarkable job on the whole codebase in less than a couple of seconds, and clang-tidy taking nearly an hour for replacing NULL
with nullptr
. To be fair clang-tidy also replaced other constructs with nullptr
, like 0
or {}
. Even if there were a handful of changes, waiting an hour was probably not worth it.
Some considerations before beginning
Unify formatting. It makes it easier to write regexes. Suppose you want, for example, to change a type from foo<bar>
to foo_bar
. If the formatting is not consistent, you might miss places where someone wrote, for example, foo< bar >
.
Also having a formatter ensures that the replaced code can be formatted automatically, this helps to make the whole search-and-replace process automatic.
It is easier if every statement is on a single line, no matter how long. Otherwise, you might miss places that are written on multiple lines.
Code smells with pointers
Code older than C++11 might have used std::auto_ptr
, and replaced most instances with std::unique_ptr
. As the interface between the two functions is slightly different, and as there is the factory function std::make_unique
since C++14, I noticed some code smells that were easy enough to "fix" with a regex.
Those programs should be available for most, if not all platforms, Windows and Android included.
In particular, I’m using \b
for denoting a word boundary.
Some IDE and editors might provide a separate option for defining if a regex should apply to whole words or not.
For exaple in vim
, instead of \b
, one needs to use \<
, for example :%s/\<NULL\>/nullptr
instead of sed -e 's/\bNULL\b/nullptr/g'
.
Note 📝 | One can execute sed inside vim . To execute sed on the current buffer, use :%!sed -e <pattern> . Other editors or IDE might provide similar options. |
Use NULL
instead of nullptr
Simply using something like
sed -e 's/NULL/nullptr/g';
is not enough, as NULL
could be part of another word or macro, like MY_NULL
, or be part of a string literal.
Something more robust would be
sed -e 's/\bNULL\b/nullptr/g';
This fix/enhancement should be one of the first refactorings to be done when refactoring smart pointers. It helps, for example, for a smooth transition from naked pointers and std::auto_ptr
to std::unique_ptr
and std::shared_ptr
where those are useful.
For example std::unique_ptr<int>(NULL);
does not compile, but std::unique_ptr<int>(nullptr);
does.
Note that simply replacing NULL
with nullptr
might break some code, for example
char c = NULL;
// or
void foo(int);
foo(NULL);
but I would argue that this code is misleading, and should be changed (in a separate commit) to
char c = '\0';
foo(0);
Avoid std::auto_ptr
Since std::auto_ptr
has been deprecated, and there are better replacements, it should be removed and replaced with something else.
sed -e 's/std::auto_ptr/std::unique_ptr/g';
should be sufficient for most cases.
One could argue that
sed -e 's/\bstd::auto_ptr\b/std::unique_ptr/g';
might be better, as it avoids matching something like mystd::auto_ptr_int
, but the chances are pretty low.
Other adjustments might be necessary, normally it involves applying std::move
where auto_ptr
made a copy.
Compare with nullptr
A std::auto_ptr a;
can be compared with nullptr
(or NULL
) throught the member function get
: a.get() == nullptr
.
For std::unique_ptr
this is not necessary, as operator==
(and operator!=
) has an appropriate overload.
Using the member function get
is not harmful, but it seems to confuse some developers, some even ends to write a != nullptr && a.get() != nullptr
. (One could blame the developers, I think it’s just easier to make the code consistent, remove the function call, and eventually explain the change during a code review).
For simplicity and consistency, removing all the unnecessary calls to get
helps to avoid this type of confusion.
sed -e 's/\.get( *) *== * nullptr/==nullptr/g';
This pattern has false positives, as .get()
might be a member function of something else returning something comparable to nullptr
, and removing it would make the code invalid.
It also does not match nullptr == a.get()
. It is more difficult to write an appropriate substitution regex for it, as a simple approach would break expressions like nullptr == foo(a.get())
.
A probably good enough pattern is
sed -e 's/nullptr == \([^(]*\)\.get()/nullptr == \1/g';
even if it misses things like nullptr == foo().get()
.
The pattern should then be generalized also for handling operator!=
, the final version looks like
sed -e 's/\.get( *) *\([!=]\)= *nullptr/ \1= nullptr/g' -e 's/nullptr \([!=]\)= \([^(]*\)\.get()/nullptr \1= \2/g';
Return std::unique_ptr
holding nullptr
Instead of writing return std::unique_ptr<int>();
or return std::unique_ptr<int>(nullptr);
, it is much simpler to write (and read) return nullptr
.
The pattern could also be applied not only to return statements but to avoid too many changes at once I preferred to handle return statements in separate commits.
sed -e 's/return std::unique_ptr<.*>( *nullptr *) *;/return nullptr;/g' -e 's/return std::unique_ptr<.*>( *) *;/return nullptr;/g';
Another factor that led me decide to write the regex this way, is that by replacing std::unique_ptr<int>()
with nullptr
, it would break code like auto a = std::unique_ptr<int>();
.
Note that I’m also matching the ;
at the end, because I saw too many places that used a construct like std::unique_ptr().get()
.
Those can be adjusted with
sed -e 's/return std::unique_ptr<.*>( *nullptr *)\.get() *;/return nullptr;/g' -e 's/return std::unique_ptr<.*>( *)\.get() *;/return nullptr;/g';
without the need to edit the code by hand.
Note that this transformation might still break or change valid code, for example, foo
and bar
do not have the same function signature
auto foo(){
return nullptr;
}
auto bar(){
return std::unique_ptr<int>();
}
thus it is not always possible to simply replace std::unique_ptr<int>()
with nullptr
, even in return statements. For example, the first lambda would compile, but the second not:
auto l1 = [](bool b){
if(b == 1){
return std::make_unique<int>(42);
}
return std::unique_ptr<int>();
};
auto l2 = [](bool b){
if(b == 1){
return std::make_unique<int>(42);
}
return nullptr;
};
Return std::unique_ptr
pointing somewhere
What about not nullptr
? For example, when using new
, like
return std::unique_ptr<int>( new int(42) );
// or
return std::unique_ptr<int>( new int );
In those cases, using make_unique
should be the preferred approach.
sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g' -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\) *);/return std::make_unique< \1 >();/g';
Note that for new int
, using make_unique
won’t create a pointer pointing to an uninitialized integer, but a pointer to a zero-initialized integer. So there might be some differences at runtime, for a single int
it is negligible, but for more complex structures it could be relevant, and for those cases, std::make_unique_for_overwrite
should be preferred.
Thus a possible alternate sed command is
sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g' -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\) *) *;/return std::make_unique_for_overwrite< \1 >();/g';
The interesting part is also \([a-zA-Z :_<>,0-9]*\)
; a capture group for capturing the type, and paste it in \1
. It does not detect all cases, for example, it won’t match A()
, where A
is defined as #define A() int
, or decltype(1+1)
but it is good enough.
It is also not strict enough for other situations, for example, int0
can be a type, but 0int
not, so [a-zA-Z:_][a-zA-Z :_<>,0-9]*
might be more appropriate, but I doubt it has any advantages. It would still fail to recognize that a:b
or a<
are not valid types.
I’ve also left arrays out. The pattern will not, for example, match std::unique_ptr<int[]>
. The main reason is that in most cases, a std::vector
is a better replacement, and simpler to use, and also because in the codebases I had to maintain yet, I’ve nearly never encountered such a type, so I did not have yet the need to think how to transform such code.
Note that this regex might change the type of a variable.
struct A0{ virtual ~A0() = default;};
struct A1 : A0{};
auto a = std::unique_ptr<A0>(new A1);
In this case, the sed
command would change a
from std::unique_ptr<A0>
to std::unique_ptr<A1>
.
For supporting such cases, one should prefer a slightly more complex expression:
sed -e 's/return *std::unique_ptr<\(.*\)>( *new \([a-zA-Z :_<>,0-9]*\) *[{(]\(.*\)[)}] *) *;/return std::unique_ptr< \1 >( std::make_unique< \2 >(\3));/g' -e 's/return *std::unique_ptr<\(.*\)>( *new \([a-zA-Z :_<>,0-9]*\) *) *;/return std::unique_ptr< \1 >( std::make_unique< \2 >());/g';
Or the variant that uses std::make_unique_for_overwrite
.
reset an unique_ptr
with new
or nullptr
sed -e 's/\.reset( *new *\([a-zA-Z :_<>,0-9]*\)(\(.*\));/ = std::make_unique< \1 >( \2;/g' -e 's/\.reset( *new *\([a-zA-Z :_<>,0-9]*\));/ = std::make_unique< \1 >();/g' -e 's/\.reset( *nullptr *);/ = nullptr/g';
for cases like a.reset(new int(42));
.
Using make_unique
, like in all other places, has the advantage that it makes it more obvious where possible leaks are.
It can make a difference in the case of a non-default dynamic deleter. I’ve never used a dynamic deleter with std::unique_ptr
, but always made them as part of the type.
Thus instead of
auto up = std::unique_ptr<a>(ptr, deleter);
I prefer to use
struct s{
void operator(a* ptr){
deleter(ptr);
}
};
auto up = std::unique_ptr<a, s>(ptr);
This way, it is not possible by accident to use the wrong deleter.
For cases like a.reset(b.release())
use
sed -e 's/\.reset(\(.*\)\.release());/ = std::move(\1);/g';
It also makes sense to combine it with
sed -e 's/std::move(std::make_unique\(.*\));/std::make_unique\1;/g';
If not a.reset(std::make_unique<int>(42).release());
would be left as a = std::move(std::make_unique<int>(42));
.
Even better would it to precede it by
sed -e 's/.reset(std::make_unique<\(.*\)> *(\(.*\))\.release());/= std::make_unique<\1>(\2);/g';
so that this particular code smell is fixed as a separate commit.
Other instances of unique_ptr
The regexes applied for the return statement can be reused in other situations by simply removing return
. Similarly, the regexes used for .reset
can be reused. Depending on how many matches and particular code constructs, it might make sense to make further changes to restrict the matches and reduce the number of false positives.
Return a naked owning pointer
Change return new int (42)
to return std::make_unique<int>(42)
.
Further changes are necessary, like changing the function signature and where the function is used. Applying the regex, and compiling afterwards the code, will give an idea of what and how many changes are necessary at least, and eventually discover some bugs.
sed -e 's/return *new *\([a-zA-Z :_<>,0-9]*\) *( *\(.*\) *) *;/return std::make_unique<\1>(\2)/g';
All things that are previously written for supporting new int
need to be applied here too.
Other instances of new
new
is a very common word, so it is useful to try to avoid matching commented-out code and comments, those all lines beginning with /
or *
(for multiline comments).
Assignment
Use something like following regex to replace new
with make_unique
:
sed -e 's/\(^[^\*/]*\)= *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1 = std::make_unique<\2>(\3);/g' -e 's/\(^[^\*/]*\)= *new *\([a-zA-Z :_<>,0-9]*\);/\1 = std::make_unique<\2>();/g'
Function call
If new
is used when calling a function, like foo(new int)
, then something like.
sed -e 's/\(^[^\*/]*\)( *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1(std::make_unique<\2>(\3);/g' -e 's/\(^[^\*/]*\), *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1,std::make_unique<\2>(\3);/g'
Should do the job.
Note that I have currently no idea how to write a readable regex to support f(new a, c)
because "a,c" could denote a type with the current regex rules. At that point, hopefully, new
does not appear as many times, and those places can be fixed by hand.
delete
Ideally, a codebase never uses the delete
for freeing memory, as resource management is done through constructors and destructors.
Generally delete
cannot be transformed into something useful, it could be deleted. Similarly to new
it is a very common word, thus it is a good idea to skip comments to reduce the number of matches
grep -e '^[^/=\*]*delete [^\[]*[a-zA-Z0-9:_]';
I’m also skipping lines that have a =
before delete
, as those are probably something that looks like a deleted function.
For delete[]
, use
grep -e '^[^/=\*]*delete *\[';
I prefer handling delete
and delete[]
differently because often a better replacement has a very different look.
Return a value directly instead of a pointer
Especially before C++11 and move semantic, it was common, especially for classes that could not be copied, to return a pointer pointing to the heap.
This adds an unnecessary indirection and a heap allocation. Considering RVO, it should never be necessary to use such techniques anymore, so it would be nice to be able to remove these unnecessary indirections.
Unfortunately changing the function signature is not enough, the user of the functions needs to be changed too.
Consider the following piece of code
class foo{
int bar();
};
std::unique_ptr<foo> fun1();
void fun2(foo& f);
int main(){
auto res = fun1()->bar();
fun2(*fun1());
}
After changing the signature of fun1
, most necessary changes one needs to do are removing dereferences (*
) and replacing ->
with .
.
This can be accomplished with something similar to
sed -e 's/\*\(.*\)\bfun1()/\1fun1()/g' -e 's/\bfun1()->/fun1()./g';
Note that the regex is simple enough to work also for member functions:
class foo{
int bar();
std::unique_ptr<foo> fun1();
};
std::unique_ptr<bar> fun1();
void fun2(foo& f);
int main(){
foo f;
auto res = f.fun1()->bar();
fun2(*f.fun1());
}
Cleanup functions
Common names ( or part of names) for cleanup and acquisition functions are
-
lock (like
std::mutex::lock
,std::mutex::unlock
,LockFile
,UnLockFile
) -
acquire (like
CryptAcquireContext
) -
release (like
std::unique_ptr::release
, CryptReleaseContext) -
free (like
free
) -
detach (like
std::thread::detach
orpthread_detach
) -
new (like
new
,new[]
) -
delete (like
delete
, delete[]` ) -
alloc (like
malloc
,realloc
) -
open (like
fopen
) -
close (like
fclose
,CloseHandle
,FindClose
) -
create (like RegCreateKey)
-
clean
Since those operations are often best handled with factory functions (like std::make_unique
instead of new
), constructors, and destructors (like lock_guard
instead of lock and unlock), it makes sense to search for those and other similar patterns.
Other actions that are often best handled with constructors and destructors, or somehow encapsulated, are
-
start
-
stop
-
revert
-
undo
-
join (like
std::thread::join
)
I happen to search for those or similar patterns from time to time (comments excluded), because most of the time (think about new
and delete
, or locking a mutex) those constructs are error-prone to use and are better encapsulated. Thus I expect to have a few matches only a few matches.
std::chrono
vs integers
Before std::chrono
, most functions would take an integral for denoting time. If lucky, a comment would explain if the given parameter denoted seconds, milliseconds, or something else.
std::chrono
shows how a wrapper can be much easier to use.
When changing a specific function signature, there might be a lot of code that needs to be adapted, and those function call using a constant can be changed with a regex:
sed -e 's/set_timeout( *\([0-9][0-9]*\) *)/set_timeout( \1ms \)/g'
Conclusion
Most regexes written here have been used only a limited number of times and with some adaptations.
For example, in the beginning, I used something similar to
sed -e 's/std::unique_ptr<.*>( new \([a-zA-Z]*\)(\(.*\));/std::make_unique< \1 >(\2;/g';
After noticing that the difference was so big that it would have made it reviewing too difficult, I searched if there same a more narrow expression I could use. So I narrowed it down to
sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z]*\)(\(.*\));/return std::make_unique< \1 >(\2;/g';
After fixing all return statements, I used the more generic pattern to see where it made sense to continue to change the code.
Similarly, when searching for common cleanup and acquisition functions, in the first run, I use the most generic term, like lock
. After looking at how many matches there are, I might decide to grep
for a less common pattern, like \.[a-z]*lock(
, to match member functions like lock
and unlock
.
Most examples are about pointers, because
-
it occurred to me it might be a good idea to write those regexes down when I was changing some code dealing with pointers
-
the deprecation of
std::auto_ptr
and addition ofstd::unique_ptr
andnullptr
in C++11, and the addition ofstd::make_unique
in C++14 lead to a lot of low-hanging fruits easy to fix with regexes.
It also occurred to me that it is possible to reuse those regexes and integrate with Cppcheck. As Cppcheck normalizes the source code, regexes can be simplified a lot. Instead of writing, for example, = *new `, writing `= new ` is sufficient. Also, all presented regexes fail if someone uses `using namespace std;
and unique_ptr
without std::
, and do not handle a ton of cases with whitespace or tabs. With CppCheck this is a non-issue and makes this much easier to maintain. The main disadvantage is that with Cppcheck it is not possible to transform the code automatically.
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.