Find and fix code smells in C++ with regexes
- Issues with this approach
- Advantages of this approach
- Some considerations before beginning
- Code smells with pointers
- Use
nullptr
instead ofNULL
- Avoid
std::auto_ptr
- Compare with nullptr
- Return
std::unique_ptr
holding nullptr - Return
std::unique_ptr
pointing somewhere - Store result of
new
inunique_ptr
- reset an
unique_ptr
withnew
or nullptr - Other instances of unique_ptr
- Return a naked owning pointer
- Other instances of new
- delete
- Use
- Return a value directly instead of a pointer
- Cleanup functions
std::chrono
vs integers- catch by const-ref
- Multiline regexes
- Conclusion
Even if everyone knows that search-and-replace is a rudimentary tool for editing source code, it is still one that is very effective and with a very small learning curve.
In particular, it is available in nearly all editors, IDE, and command-line tools. It is also a technique that works with all programming languages that use text for representing source code (one could also use it when programming in Scratch, but I doubt it will be that useful…).
Because of its simplicity, it is the first tool one learns to use when refactoring some code, like renaming variables or functions.
But search and replace, together with regexes, can be used, with an important caveat, for doing more complex refactorings, like detecting and correcting code smells.
Issues with this approach
Searching and replacing is not an appropriate tool for refactoring C++ source code. There are multiple technical issues, that make it impossible to write an error-free search and replace strategy.
The first is function overload. For example, suppose you have void foo(int)
and void foo(const char*)
but want to rename only the first overload to bar
. Replacing foo
with bar
obviously will not work, as it will rename all overloads, and given the piece of code
int a = 0;
foo(a);
There is no way, without parsing the code, that a regex can decide if we need to replace foo
with bar
or not.
Advantages of this approach
Some IDE offers refactoring tools, like renaming a variable. Unfortunately, those are often bundled with the IDE, and cannot be used without it.
You can try to configure multiple IDEs on the same codebase, most of the time it’s a pain, and other times not even possible as not all IDEs are multiplatform, thus in the end you’ll have only a subset of tools at your disposal. Also, such tools might work only on code that is currently compiled to give a better result. As a side effect, conditionally compiled code, and code commented out (like examples in the documentation) will not be changed, which might or might be not a good thing.
Another disadvantage is that often tools bundled with an IDE cannot be executed automatically, so it is more difficult to explain how to obtain the desired result. With a command line tool, it should be enough to write the command one has used, and the transformation can easily be verified in a code review.
Last but not least, tools that need to parse C++ (or C) and also take macros into account are generally much slower unless all parsing information has been created before.
In more than one project I had "search and replace" doing a remarkable job on the whole codebase in less than a couple of seconds, and clang-tidy taking nearly an hour, for example, to replace NULL
with nullptr
. To be fair clang-tidy also replaced other constructs with nullptr
, like 0
or {}
. Even if there were a handful of changes, waiting a couple of hours was probably not worth it.
Even if search and replace approach cannot replace a more robust tool for refactoring, it is often Good Enough™.
In particular, it is incredibly fast, and together with a revision control system and a compiler, it is invaluable for many tasks.
Some considerations before beginning
Use a code formatter
Unify formatting. It makes it easier to write regexes. Suppose you want, for example, to change a type from foo<bar>
to foo_bar
. If the formatting is not consistent, you might miss places where someone wrote, for example, foo< bar >
.
Also having a code formatter ensures that the replaced code can be formatted automatically, this helps to make the whole search-and-replace process easier.
Since you are using a formatter, you could even format temporarily your code to make it easier to write regexes for it, and then format it back. The most useful thing would be to have every statement on a single line. Otherwise, you will miss places that are written on multiple lines. Since formatting the whole codebase might take some time, this is normally not an option.
Writing regexes is already hard enough. Debugging regexes is even harder. Regexes for C++ adds another layer of difficulty. Writing multiline regexes for C++ is something that you do not want to do!
Warning ⚠️ | In case you skipped this section, read again. A formatter will make your job easier! |
Warning ⚠️ | Do not apply the presented regexes in your code base. Read them, try to understand them, and only then apply them. |
Which regex?
Examples are written for GNU sed 🗄️ and GNU Grep 🗄️, without the extended regular expressions flag (-E
, -r
, --regexp-extended
) so some adjustments might be necessary.
Both programs should be available on all platforms, Windows and Android included.
Different programs might use different regex dialects
In particular, you should pay attention to
-
\b
, which denotes a word boundary -
\1
,\2
, … which is for inserting (or matching) the first, second, … capture -
\(
and\)
for starting and finishing a capture -
a whitespace character instead of
\s
for denoting space, as it is more readable. Again: format your code. -
(
and)
means literally(
and)
, in other dialects you need to escape those, like\(
and\)
-
*
to match zero or more instances -
\{0,1\}
to match zero or one instances -
\{1,\}
to match one or more instances (although in this case, writing one match and then zero or more might be more legible)
For example in vim
, instead of \b
, one needs to use \<
and \>
, for example :%s/\<NULL\>/nullptr
instead of sed -e 's/\bNULL\b/nullptr/g'
.
Note 📝 | It is possible to execute sed from vim . To execute sed on the current buffer, use :%!sed -e <pattern> . Other editors or IDE might provide similar options. |
Code smells with pointers
Code older than C++11 might have used std::auto_ptr
and replaced most instances with std::unique_ptr
. As the interface between the two functions is slightly different, and as there is the factory function std::make_unique
since C++14, I noticed some code smells that were easy enough to "fix" with a regex.
Use nullptr
instead of NULL
Simply using something like
sed -e 's/NULL/nullptr/g';
is not enough, as NULL
could be part of another word or macro, like MY_NULL
, or be part of a string literal.
Something more robust would be
sed -e 's/\bNULL\b/nullptr/g';
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✗, wrong replacement(!) |
This fix/enhancement should be one of the first refactorings to be done when refactoring smart pointers. It helps, for example, for a smooth transition from naked pointers and std::auto_ptr
to std::unique_ptr
and std::shared_ptr
where those are useful.
For example std::unique_ptr<int>(NULL);
does not compile, but std::unique_ptr<int>(nullptr);
does.
Note that simply replacing NULL
with nullptr
might break some code, for example
char c = NULL;
// or
void foo(int);
foo(NULL);
but I would argue that this code is misleading, and should be changed (in a separate commit) to
char c = '\0';
foo(0);
Avoid std::auto_ptr
Since std::auto_ptr
has been deprecated, and there are better replacements, it should be removed and replaced with something else.
sed -e 's/std::auto_ptr/std::unique_ptr/g';
should be sufficient for most cases.
One could argue that
sed -e 's/\bstd::auto_ptr\b/std::unique_ptr/g';
might be better 🗄️, as it avoids matching something like mystd::auto_ptr_int
, but the chances are pretty low.
Other adjustments might be necessary, normally it involves applying std::move
where auto_ptr
made a copy.
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
Compare with nullptr
A std::auto_ptr a;
can be compared with nullptr
(or NULL
) through the member function get
: a.get() == nullptr
.
For std::unique_ptr
this is not necessary, as operator==
(and operator!=
) has an appropriate overload.
Using the member function get
is not harmful, but it seems to confuse some developers, some even end up writing a != nullptr && a.get() != nullptr
. (One could blame the developers, I think it’s just easier to make the code consistent, remove the function call, and eventually explain the change during a code review).
For simplicity and consistency, removing all the unnecessary calls to get
helps to avoid this type of confusion.
sed -e 's/\.get( *) *== * nullptr/==nullptr/g';
This pattern has false positives, as .get()
might be a member function of something else returning something comparable to nullptr
, and removing it would make the code invalid.
It also does not match nullptr == a.get()
. It is more difficult to write an appropriate substitution regex for it, as a simple approach would break expressions like nullptr == foo(a.get())
.
A probably good enough pattern is
sed -e 's/nullptr == \([^(]*\)\.get()/nullptr == \1/g';
even if it misses things like nullptr == foo().get()
.
The pattern should then be generalized also for handling operator!=
, the final version looks like 🗄️:
sed -e 's/\.get( *) *\([!=]\)= *nullptr/ \1= nullptr/g' -e 's/nullptr \([!=]\)= \([^(]*\)\.get()/nullptr \1= \2/g';
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓, as |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✗, wrong replacement(!) |
|
| ✓ |
|
| ✗, not replaced |
|
| ✓ |
|
| ✓ |
Return std::unique_ptr
holding nullptr
Instead of writing return std::unique_ptr<int>();
or return std::unique_ptr<int>(nullptr);
, it is much simpler to write (and read) return nullptr
.
The pattern 🗄️ could also be applied not only to return statements but to avoid too many changes at once I preferred to handle return statements in separate commits.
sed -e 's/return std::unique_ptr<.*>( *nullptr *) *;/return nullptr;/g' -e 's/return std::unique_ptr<.*>( *) *;/return nullptr;/g';
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
Another factor that led me to decide to write the regex this way, is that by replacing std::unique_ptr<int>()
with nullptr
, it would break code like auto a = std::unique_ptr<int>();
.
Note that I’m also matching the ;
at the end because I saw too many places that used a construct like std::unique_ptr().get()
(true story 🤦).
Those can be adjusted with 🗄️:
sed -e 's/return std::unique_ptr<.*>( *nullptr *)\.get() *;/return nullptr;/g' -e 's/return std::unique_ptr<.*>( *)\.get() *;/return nullptr;/g';
without the need to edit the code by hand.
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
Note that this transformation might still break or change valid code, for example, foo
and bar
do not have the same function signature
auto foo(){
return nullptr;
}
auto bar(){
return std::unique_ptr<int>();
}
thus it is not always possible to simply replace std::unique_ptr<int>()
with nullptr
, even in return statements. For example, the first lambda would compile, but the second not:
auto l1 = [](bool b){
if(b == 1){
return std::make_unique<int>(42);
}
return std::unique_ptr<int>();
};
auto l2 = [](bool b){
if(b == 1){
return std::make_unique<int>(42);
}
return nullptr;
};
Return std::unique_ptr
pointing somewhere
What about not nullptr
? For example, when using new
, like
return std::unique_ptr<int>( new int(42) );
// or
return std::unique_ptr<int>( new int );
In those cases, using make_unique
🗄️ should be the preferred approach.
sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g' -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\) *);/return std::make_unique< \1 >();/g';
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓, mostly |
|
| ✓ |
|
| might be problematic in some cases |
|
| ✓, out of scope, not returning an |
|
| ✗, not replaced, but returns an invalid pointer, it should not be relevant, better fix it directly instead of transforming it |
|
| ✓ |
|
| ✗, not replaced |
Note that for new int
, using make_unique
won’t create a pointer pointing to an uninitialized integer, but a pointer to a zero-initialized integer. So there might be some differences at runtime, for a single int
it is negligible, but for more complex structures it could be relevant, and for those cases, std::make_unique_for_overwrite
should be preferred.
Thus a possible alternate sed command 🗄️ is
sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g' -e 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\) *) *;/return std::make_unique_for_overwrite< \1 >();/g';
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| might be problematic in some cases |
|
| ✓, out of scope |
|
| ✗, not replaced, returns an invalid pointer, it should not be relevant, better fix it directly instead of transforming it |
|
| ✓ |
|
| ✗, not replaced |
The interesting part is also \([a-zA-Z :_<>,0-9]*\)
; a capture group for capturing the type, and paste it in \1
. It does not detect all cases, for example, it won’t match A()
, where A
is defined as #define A() int
, or decltype(1+1)
but it is good enough.
It is also not strict enough for other situations, for example, int0
can be a type, but 0int
not, so [a-zA-Z:_][a-zA-Z :_<>,0-9]*
might be more appropriate, but I doubt it has any practical advantages. It would still fail to recognize that a:b
or a<
are not valid types.
I’ve also left arrays out. The pattern will not, for example, match std::unique_ptr<int[]>
. The main reason is that in most cases, a std::vector
is a better replacement, and simpler to use, and also because in the codebases I had to maintain yet, I’ve nearly never encountered such a type, so I did not have yet the need to think how to transform such code.
Note that this regex might change the type of a variable.
struct A0{ virtual ~A0() = default;};
struct A1 : A0{};
auto a = std::unique_ptr<A0>(new A1);
In this case, the sed
command would change the type of a
from std::unique_ptr<A0>
to std::unique_ptr<A1>
.
For supporting such cases, one should prefer a more complex expression 🗄️:
sed -e 's/return *std::unique_ptr<\(.*\)>( *new \([a-zA-Z :_<>,0-9]*\) *[{(]\(.*\)[)}] *) *;/return std::unique_ptr< \1 >( std::make_unique< \2 >(\3));/g' -e 's/return *std::unique_ptr<\(.*\)>( *new \([a-zA-Z :_<>,0-9]*\) *) *;/return std::unique_ptr< \1 >( std::make_unique< \2 >());/g';
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓, mostly |
|
| ✓ |
|
| ✓ |
|
| ✓, out of scope |
|
| ✗, not replaced, returns an invalid pointer |
|
| ✓ |
|
| ✗, not replaced |
Similarly, it is possible to create a variant that supports std::make_unique_for_overwrite
.
There are now fewer chances to break something, but the code is much more verbose. It might be possible to simplify it with another regex (or create a more complex regex to avoid creating such verbose code), but personally, the previous, simpler, regex has always been good enough™.
Store result of new
in unique_ptr
Replace snippets like std::unique_ptr<type> value( new type() );
with std::unique_ptr<type> value = std::make_unique<type>();
.
Note, while normally I would prefer to use auto
instead of typing type
twice, the regex does not distinguish between member variables and others.
Since it is not possible to use auto
for member variables, the regex will not replace the type with auto
sed -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)> \([a-zA-Z :_<>,0-9]\{1,\}\)( new \1(\(.*\)) );/std::unique_ptr<\1> \2 = std::make_unique<\1>(\3);/g'
Note how I’ve used [a-zA-Z :_<>,0-9]\{1,\}
to ensure that there is a variable, name, oterwise the regex would detect and replace incorrectly std::unique_ptr<type>( new type() )
Before | After | Correct? |
---|---|---|
|
| ✗, not replaced |
|
| ✓ |
|
| ✓ |
|
| ✓, not replaced, no variable name |
|
| ✓, not replaced, no variable name |
|
| ✓, not replaced, different types |
|
| ✓, not replaced, different types |
|
| ✓, not replaced, no name and different types |
|
| ✓, not replaced, no name and different types |
It is possible to support all listed examples by adding multiple regexes, but I’m going to leave cases like std::unique_ptr<type> value( new type );
out for brevity, as it would double the regexes, or make the proposed more complex.
For supporting the case of no variable name, something like the following command should do the job:
sed -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)>( new \1(\(.*\)) );/std::make_unique<\1>(\2);/g'
For supporting the example with different types, after the matching types have been handled, can be supported with:
sed -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)> \([a-zA-Z :_<>,0-9]\{1,\}\)( new \([a-zA-Z :_<>,0-9]*\)(\(.*\)) );/std::unique_ptr<\1> \2 = std::make_unique<\3>(\4);/g'
The last use-case is supported with, again, after supporting matching types:
sed -e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\)) );/std::unique_ptr<\1>(std::make_unique<\2>(\3));/g'
Probably it is best to put them together, to ensure that matching types are replaced before working on different types:
sed \
-e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)> \([a-zA-Z :_<>,0-9]\{1,\}\)( new \1(\(.*\)) );/std::unique_ptr<\1> \2 = std::make_unique<\1>(\3);/g' \
-e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)>( new \1(\(.*\)) );/std::make_unique<\1>(\2);/g' \
-e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)> \([a-zA-Z :_<>,0-9]\{1,\}\)( new \([a-zA-Z :_<>,0-9]*\)(\(.*\)) );/std::unique_ptr<\1> \2 = std::make_unique<\3>(\4);/g' \
-e 's/std::unique_ptr<\([a-zA-Z :_<>,0-9]*\)>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\)) );/std::unique_ptr<\1>(std::make_unique<\2>(\3));/g'
Before | After | Correct? |
---|---|---|
|
| ✗, not replaced |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
reset an unique_ptr
with new
or nullptr
This section assumes that you are mostly using smart pointers with the default deleter.
Otherwise changing a.reset( new int(42) );
to a = std::make_unique<int>(42) ;
is a breaking change.
If you are using a custom deleter, i.e something like auto up = std::unique_ptr<a>(ptr, deleter);
, consider making the deleter part of the type
struct s{
void operator(a* ptr){
deleter(ptr);
}
};
auto up = std::unique_ptr<a, s>(ptr);
This makes it impossible to use by accident the wrong deleter.
For most cases, something like 🗄️:
sed -e 's/\.reset( *new *\([a-zA-Z :_<>,0-9]*\)(\(.*\));/ = std::make_unique< \1 >( \2;/g' -e 's/\.reset( *new *\([a-zA-Z :_<>,0-9]*\));/ = std::make_unique< \1 >();/g' -e 's/\.reset( *nullptr *);/ = nullptr;/g' -e 's/\.reset( *);/ = nullptr;/g';
should be sufficient.
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓, mostly |
|
| ✓ |
|
| ✓ |
Using make_unique
, like in all other places, has the advantage that it makes it more obvious where possible leaks are.
For cases like a.reset(b.release())
use
sed -e 's/\.reset(\(.*\)\.release());/ = std::move(\1);/g';
It also makes sense to combine it with
sed -e 's/std::move(std::make_unique\(.*\));/std::make_unique\1;/g';
If not a.reset(std::make_unique<int>(42).release());
would be left as a = std::move(std::make_unique<int>(42));
.
It could be possible to avoid this issue altogether by using
sed -e 's/.reset(std::make_unique<\(.*\)> *(\(.*\))\.release());/= std::make_unique<\1>(\2);/g';
so that this particular code smell is fixed as a separate commit.
Other instances of unique_ptr
The regexes applied for the return statement can be reused in other situations by simply removing return
. Similarly, the regexes used for .reset
that ends with ;
can be reused in other places. Depending on how many matches and particular code constructs, it might make sense to make further changes to restrict the matches and reduce the number of false positives.
Return a naked owning pointer
Change return new int (42)
to return std::make_unique<int>(42)
.
Further changes are necessary, like changing the function signature and where the function is used. Applying the regex 🗄️, and compiling afterwards the code, will give an idea of what and how many changes are necessary at least, and eventually discover some bugs.
sed -e 's/return *new *\([a-zA-Z :_<>,0-9]*\) *( *\(.*\) *) *;/return std::make_unique<\1>(\2);/g';
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✗, not replaced |
|
| ✗, not replaced |
|
| ✓ |
Other instances of new
new
is a very common word, so it is useful to try to avoid matching commented-out code and comments, those all lines beginning with /
or *
(for comments that span over multiple lines).
Assignment
Use something like following regex 🗄️ to replace new
with make_unique
:
sed -e 's/\(^[^\*/]*\)= *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1 = std::make_unique<\2>(\3);/g' -e 's/\(^[^\*/]*\)= *new *\([a-zA-Z :_<>,0-9]*\);/\1 = std::make_unique<\2>();/g'
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
/* * a = new int(42); */ | /* * a = new int(42); */ | ✓ |
Function call
If new
is used when calling a function, like foo(new int)
, then something like 🗄️:
sed -e 's/\(^[^\*/]*\)( *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1(std::make_unique<\2>(\3);/g' -e 's/\(^[^\*/]*\), *new *\([a-zA-Z :_<>,0-9]*\) *(\(.*\));/\1,std::make_unique<\2>(\3);/g'
should do the job (eventually call it more than once)
Note that I have currently no idea how to write a readable regex to support f(new a, c)
because "a,c" could denote a type with the current regex rules. At that point, hopefully, new
does not appear as many times, and those places can be fixed by hand.
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✗, not replaced |
|
| ✓ |
|
| ✓ |
|
| ✗, not replaced, execute regex more than once |
|
| ✓ |
|
| ✗, not replaced, execute regex more than once |
|
| ✓ |
|
| ✗, not replaced |
|
| ✓ |
delete
Ideally, a codebase never uses the delete
for freeing memory, as resource management is done through constructors and destructors.
Generally delete
cannot be transformed into something useful, it could be deleted. Similarly to new
it is a very common word, thus it is a good idea to skip comments to reduce the number of matches 🗄️
grep -e '^[^/=\*]*\bdelete [^\[]*[a-zA-Z0-9:_]';
Code | Match? |
---|---|
| yes, ✓ |
| no, ✓ |
| no, ✓ |
| yes, ✓ |
| yes, ✓ |
| no, ✓ |
// delete a // value from the database | no, ✓ |
| yes, ✗ |
/* * delete a * value from the database */ | no, ✓ |
/* delete a value from the database */ | yes, ✗ |
I’m skipping lines that have a =
before delete
, as those are probably something that looks like a deleted function.
grep -e '^[^/=\*]*\bdelete *\[';
Code | Match? |
---|---|
| no, ✓ |
| yes, ✓ |
| yes, ✓ |
| yes, ✓ |
| no, ✓ |
| no, ✓ |
| no, ✓ |
I prefer handling delete
and delete[]
differently because often a better replacement has a very different look (mostly std::vector
versus a std::unique_ptr
).
Return a value directly instead of a pointer
Especially before C++11 and move semantic, it was common, especially for classes that could not be copied, to return a pointer pointing to the heap.
This adds an unnecessary indirection and a heap allocation. Considering RVO, it should never be necessary to use such techniques anymore, so it would be nice to be able to remove these unnecessary indirections.
Unfortunately changing the function signature is not enough, the user of the functions needs to be changed too.
Consider the following piece of code
struct foo{
int bar();
};
std::unique_ptr<foo> fun1();
void fun2(const foo& f);
int main(){
auto res = fun1()->bar();
fun2(*fun1());
}
After changing the signature of fun1
to foo fun1();
, the changes one needs to do are removing dereferences (*
) and replacing ->
with .
.
This can be accomplished with something similar 🗄️ to
sed -e 's/\*\(.*\)\bfun1()/\1fun1()/g' -e 's/\bfun1()->/fun1()./g';
Note that the regex is simple enough to work also for member functions 🗄️:
struct foo{
int bar();
std::unique_ptr<foo> fun1();
};
void fun2(const foo& f);
int main(){
foo f;
auto res = f.fun1()->bar();
fun2(*f.fun1());
}
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
Cleanup functions
Common names (or part of names) for cleanup and acquisition functions are
-
lock (like
std::mutex::lock
,std::mutex::unlock
,LockFile
🗄️,UnLockFile
🗄️) -
acquire (like
CryptAcquireContext
🗄️) -
release (like
std::unique_ptr::release
,CryptReleaseContext
🗄️) -
free (like
free
) -
detach (like
std::thread::detach
orpthread_detach
) -
new (like
new
,new[]
) -
delete (like
delete
,delete[]
) -
alloc (like
malloc
,realloc
,calloc
) -
open (like
fopen
,dlopen
) -
close (like
fclose
,CloseHandle
🗄️,FindClose
🗄️) -
create (like
RegCreateKey
🗄️) -
clean
-
load (like
LoadLibrary
🗄️)
Since those operations are often best handled with factory functions (like std::make_unique
instead of new
), constructors, and destructors (like lock_guard
instead of lock
and unlock
), it makes sense to search for those and other similar patterns.
Other actions that are often best handled with constructors and destructors, or somehow encapsulated, are
-
start
-
stop
-
revert
-
undo
-
join (like
std::thread::join
)
I happen to search for those or similar patterns from time to time (comments excluded), because most of the time (think about new
and delete
, or locking a mutex) those constructs are error-prone to use and are better encapsulated. Thus I expect to have only a few matches.
std::chrono
vs integers
Before std::chrono
, most functions would take an integral for denoting time. If lucky, a comment would explain if the given parameter denoted seconds, milliseconds, or something else.
std::chrono
is a nice example of how a wrapper can be easier to use and read.
When changing a specific function signature, there might be a lot of code that needs to be adapted. Function call where a literal value is being used can be changed with a regex 🗄️:
sed -e "s/set_timeout( *\([0-9-][0-9']*\) *)/set_timeout( \1ms \)/g"
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
catch by const-ref
It is easy to forget to write const
when catching an exception.
With
sed -e 's/catch( *\([a-zA-Z :_<>,0-9]*\)/catch(const \1/g' -e 's/catch(const const /catch(const /g'
it is possible to add a const
after catch(
. If the capture should not be made constant, the compiler will complain.
The sed
command avoids adding const
when it is already there, but it does not detect patterns where the const
is written on the right catch(std::exception const&)
.
A regex like
sed -e 's/catch *( *\([a-zA-Z :_<>,0-9]*\)\bconst\b\(&\)* /catch(const \1\2 /g'
is also problematic, as const
could be part of the captured type, for example: catch(myexception<const& int> const&)
.
To make the pattern less error-prone, one could use
sed -e 's/catch *( *\([a-zA-Z :_<>,0-9]*\)\(.*\)\bconst\b\(&\)* /catch(const \1\2\3 /g'
to find only the last const
instead of the first one. This means that this pattern works correctly for catch(myexception<const& int> const&)
, but still fails for catch(myexception<const& int>)
.
As variable names cannot contain any of the following characters <>,
, it is possible to make the regex even more foolproof, but I do not think that it would make much sense. First, the compiler is going to complain if a regex broke a catch block, second, I’ve never encountered a templated type in a catch clause.
What I’ve curently used 🗄️ is
sed -e 's/\bcatch *( *\([a-zA-Z :_<>,0-9]*\)\(.*\)\bconst\b\(&\)* /catch(const \1\2\3 /g' -e 's/\bcatch *( *\([a-zA-Z :_<>,0-9]*\)/catch(const \1/g' -e 's/\bcatch(const const /catch(const /g';
Before | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✗, wrong replacement(!) |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
Multiline regexes
Yes, it should be forbidden, but hey, for simple use cases, just like single-line regexes, it can be good enough.
sed
can be used for multiline regexes simply by using -z
as an additional parameter. -z
means that sed
uses \0
and not a newline as a delimiter when applying a regex. This means that a single regex is generally applied to the whole textual file.
When using -z
, one needs to detect newlines too. This makes writing the regex more complex, as the code after a new line is normally formatted with multiple tabs or spaces. Thus you might need to replace a single whitespace with something like \s*
.
As a very dumb example, suppose we want to transform return std::unique_ptr<int>( new int(42) );
to return std::make_unique<int>(42);
The previously presented regex would not work, as it does not find the ending );
and for simplicity, let’s just concentrate on this piece of the regex: 's/return std::unique_ptr<.*>( new \([a-zA-Z :_<>,0-9]*\)(\(.*\) *) *) *;/return std::make_unique< \1 >(\2);/g'
The first approach would be to replace whitespaces with [ \t\n]
or \s
:
sed -z -e 's/return\s*std::unique_ptr<.*>(\s*new\s*\([a-zA-Z :_<>,0-9]*\)(\(.*\)\s*)\s*)\s*;/return std::make_unique< \1 >(\2);/g'
but it is not sufficient, as it would transform
#include <memory>
auto bar(){ return std::unique_ptr<int>(
new int(42) ); }
void foo(){ (int*)nullptr >( new int(44) ); }
to
#include <memory>
auto bar(){ return std::make_unique< int >(44); }
Having statements on single lines makes writing regex easier because they are automatically scoped at most to a single statement.[1]
Since the regex matches multiple statements, it is necessary to let it match at most one statement by hand.
The easiest way to do it is to replace .*
with [^;]*
, thus replacing "any character", with "any character that is not a semicolon". Contrary to other languages, the ;
at the end of a statement is never optional.[2]
With
sed -z -e 's/return\s*std::unique_ptr<[^;]*>(\s*new\s*\([a-zA-Z :_<>,0-9]*\)(\([^;]\)\s*)\s*)\s*;/return std::make_unique< \1 >(\2);/g'
the code is transformed to
#include <memory>
auto bar(){ return std::unique_ptr<int>( new int(42) ); }
void foo(){ (int*)nullptr >( new int(44) ); }
which is the desired output.
Note 📝 | Both transformations are valid code snippets that would compile. I did this by accident, and it shows that multiline regexes are generally too broad-scoped by default and should be used with care. |
TLDR; there are two rules of thumb to apply to a regex when using it on multiple lines
-
replace whitespace with
\s*
; a sequence of multiple whitespaces (whitespace, tab, line feed, …) -
replace
.*
with[^;]*
; a sequence of any character that does not contain;
Replace base uncopyable class with deleted assignment operators
The Uncopyable
class from Scott Meyers Book Effective C++ 🗄️:
class Uncopyable{
protected:
Uncopyable(){}
~Uncopyable(){}
private:
Uncopyable(const Uncopyable&);
Uncopyable& operator=(const Uncopyable&);
};
it can help to suppress copy constructors.
I have seen in the wild implementations with virtual destructors, which missed the following advice (emphasis mine)
The implementation and use of Uncopyable include some subtleties, such as […] and that Uncopyable’s destructor need not be virtual […].
Even with the virtual destructor, the class seems to work, but it is a pessimization.
The boost implementation 🗄️ shows that multiple implementations have another possible drawback not described in "Effective C++": unintended ADL.
Another issue is that a subclass might implement a copy constructor. This code (with Uncopyable
"updated" to C++11) compiles without errors
class Uncopyable{
Uncopyable(const Uncopyable&) = delete;
Uncopyable& operator=(const Uncopyable&) = delete;
protected:
Uncopyable() = default;
~Uncopyable() = default;
};
struct s : private Uncopyable{
int j= 42;
s() = default;
s(const s& other) : j(other.j){
// do eventually something else
}
};
void foo(){
s v1;
s v2 = v1; // copy
}
Also sometimes users of the class miss following advice (emphasis mine)
The […] use of Uncopyable include some subtleties, such as the fact that inheritance from Uncopyable needn’t be public
Since C++11, it should be preferred to delete
the copy constructor (and assignment) operator directly, or follow the "Rule of 0" instead of relying on class hierarchies. Thus the advice of using Uncopyable
did (fortunately) not age well. It is generally less error-prone and generally helps little with legibility.
With the following multiline regular expression it is possible to match subclasses of Uncopyable
and add deleted copy constructors and assignments:
sed -z -e 's/\s*\b\(class\|struct\)\s*\([a-zA-Z0-9]*\)\s*:\s*\(private\|public\|protected\)\?\s*Uncopyable\s*{\n/\n\1 \2\n{\n\2 (const \2 \&)=delete;\n\2\& operator=(const \2 \&)=delete;/g'
This regex changes
class a : private Uncopyable
{
};
class b : private Uncopyable
{
};
class c : Uncopyable
{
};
class d : public Uncopyable
{
};
struct e : public Uncopyable
{
};
to
class a
{
a (const a &)=delete;
a& operator=(const a &)=delete;
};
class b
{
b (const b &)=delete;
b& operator=(const b &)=delete;
};
class c
{
c (const c &)=delete;
c& operator=(const c &)=delete;
};
class d
{
d (const d &)=delete;
d& operator=(const d &)=delete;
};
struct e
{
e (const e &)=delete;
e& operator=(const e &)=delete;
};
Note that this version does not change, by design, the class in case of multiple inheritance.
Replace empty constructor and destructor with default
Instead of writing
myclass::myclass(){
}
myclass::~myclass(){
}
one should prefer the shorter form
myclass::myclass() = default;
myclass::~myclass() = default;
Replacing the destructor is "easy", as destructors have a ~
and do not take any parameter.
sed -z -e 's/~\([^;()]*\)(\s*)\s*{\s*}/~\1() = default;/g'
Note that this regex will not match ~a() override {}
, so let’s add an optional capture for override
sed -z -e 's/~\([^;()]*\)(\s*)\s*\(override\s*\)*{\s*}/~\1() \2= default;/g'
On the other hand, replacing a constructor is more difficult, as there is no "direct" way to recognize if a given function is a constructor.
Is a(){}
a constructor that we would like to set = default;
? Yes if it appears directly inside a class
or struct
named a
, otherwise not.
On the other hand, recognizing a non-inline constructor is easier, as the function name coincides with a class name. Thus a::a(){}
is a constructor we are interested in, while a::b(){}
is not.
sed is able to use a capture group in the expression we want to match, I’m not sure if this feature is widely supported, but it makes it trivial to match outline constructors:
sed -z -e 's/\([^;()]*\)::\1(\s*)\s*{\s*}/\1::\1() = default;/g'
Note that those regexes fail to find functions like a::~a(void){}
and a::a(void){}
.
It is left as an exercise to the reader to adapt the regexes, even if in my opinion you should stop writing unnecessary void
parameters.
Inline constructors are still to be updated, assuming that there will not be many empty nullary functions, searching for (\s*)\s*{\s*}
, might be sufficient for finding the remaining constructors without too many positives.
Before (inline for simplicity) | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✗, extend regex, or do not use |
|
| ✓ |
|
| ✗, wrong replacement. The original code had undefined behavior, the new one does not build ( |
|
| ✗ |
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✗, extend regex, or do not use |
|
| ✓ |
auto
instead of iterator types
When iterating over a container, it is not uncommon to find code similar too
// iteratorytype is either the actual type, or a typedef to make the code more readable for(iteratortype it = c.begin(); it != c.end(); it++){ // ... }
While there are most of the time more expressive ways to write such loop (with a foreach loop, or with an appropriate algorithm) a little upgrade that can be done with a regex is using auto
instead of writing the iterator type explicitly:
for( auto it = c.begin(); it != c.end(); ++it){
// ...
}
The code is generally more compact and less error-prone to maintain.
Note that generally speaking, transforming the first form into the second is not safe.
Maybe c.begin()
does not return an iteratortype
, and the return type of c.begin()
is convertible to iteratortype
.
For example iteratortype
might be a const_iterator
.
Thus blindly replacing one form with the other might have unintended side effects. Even more, if the iterator is compared with something else, after all, begin
and end
do not necessarily have anything to do with iterators, they could return integers, enum
or some other strange type.
For this reason, the presented regex is as strict as possible.
The code matches only if
-
it is inside a
for
statement -
both
.begin() (or `.cbegin()
) and.end()
(or.cend()
) appear -
the iterator is compared with
!=
, and not, for example<
-
the return value of
begin()
is incremented with++
-z -e 's/for\s*(\s*[a-zA-Z :_<>,0-9\*]*\s\s*\([a-zA-Z:_<>,0-9]*\)\s*=\s*\([a-zA-Z:_<>,0-9\.]*\)\.\(c\)\{0,1\}begin();\s*\1\s*!=\s*\2\.end();\s*\(++\1\|\1++\)\s*)/for(auto \1 = \2.\3begin(); \1 != \2.\3end(); ++\1)/g'
Before (inline for simplicity) | After | Correct? |
---|---|---|
|
| ✓ |
|
| ✓ |
|
| ✓ |
|
| ✗, extend regex, but I assume tha using |
|
| ✓, as there might be implicit conversions |
|
| ✗, extend regex, but I assume that this pattern is rare enough that the change is not worth it |
|
| ✗, extend regex, but I assume that this pattern is rare enough that the change is not worth it |
|
| ✓, as there might be implicit conversions |
An alternate approach would be to add a static_assert
, and use the compiler to verify if some unwanted conversions have been introduced:
sed -z -e 's/for\s*(\s*\([a-zA-Z :_<>,0-9\*]*\)\s\s*\([a-zA-Z:_<>,0-9]*\)\s*=\s*\([a-zA-Z:_<>,0-9\.]*\)\.\(c\)\{0,1\}begin();\s*\2\s*!=\s*\3\.end()\s*;\s*\(++\2\|\2++\)\s*)/static_assert(std::is_same<\1, decltype(\3.\4begin())>::value, "");\nfor(auto \2 = \3.\4begin(); \2 != \3.\4end(); ++\2)/g'
This would transform
for(std::vector< int >::iterator it = c.begin(); it != c.end(); it++){
// ...
}
to
static_assert(std::is_same<std::vector< int >::iterator, decltype(a.begin())>::value, "");
for(auto it = a.begin(); it != a.end(); ++it){
// ...
}
This means that for the following example
for(std::vector< int >::const_iterator it = c.begin(); it != c.end(); it++){
// ...
}
the code will not compile after applying the regex instead of slightly changing meaning, and one can inspect manually what changes are necessary to retain the desired behavior (in this case, use cbegin
instead of begin
).
Conclusion
Most regexes written here have been used only a limited number of times and with some adaptations.
For example, in the beginning, I used something similar to
sed -e 's/std::unique_ptr<.*>( new \([a-zA-Z]*\)(\(.*\));/std::make_unique< \1 >(\2;/g';
After noticing that the difference was so big that it would have made reviewing too difficult. I looked if was possible to split the changes into smaller pieces. So I narrowed the scope down with the following regex
sed -e 's/return std::unique_ptr<.*>( new \([a-zA-Z]*\)(\(.*\));/return std::make_unique< \1 >(\2;/g';
After fixing all return statements, I used the more generic pattern to see where it made sense to continue to change the code.
Similarly, when searching for common cleanup and acquisition functions, in the first run, I use the most generic term, like lock
. After looking at how many matches there are, I might decide to grep
for a less common pattern, like \.[a-z]*lock(
, to match member functions like lock
and unlock
.
Most examples are about pointers, because
-
it occurred to me it might be a good idea to write those one-shot regexes down when I was changing some code dealing with pointers
-
the deprecation of
std::auto_ptr
and addition ofstd::unique_ptr
andnullptr
in C++11, and the addition ofstd::make_unique
in C++14 lead to a lot of low-hanging fruits easy to fix quickly with regexes.
It also occurred to me that it is possible to reuse those regexes and integrate them with Cppcheck. As Cppcheck normalizes the source code, regexes can be simplified a lot. Instead of writing, for example, = *new
, writing = new
is sufficient. Also, many presented regexes fail if someone uses using namespace std;
and unique_ptr
without std::
, and does not handle a ton of cases with whitespace or tabs. With Cppcheck this is a non-issue and makes this much easier to maintain. The main disadvantage is that with Cppcheck it is not possible to transform the code automatically.
;
optional under some circumstances. First, it makes the code inconsistent. Second, always having a ;
at the end of a statement makes parsing the code in my head easier/less error-prone
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.