Stop using regex for validation

A regular expression (also known as regex) is a sequence of characters that define a search pattern. Regular expressions are widely supported in search dialogs inside IDEs, text editors, and other tools. They are also supported out of the box in different programming languages, like Java, C++, JavaScript, PHP, Python, Perl, and many others.

Regular expressions are known to be hard to read, but, nonetheless, people ask all the time how to validate passwords, emails, and even XML and certificates!

There are many reasons why you do not want to use regular expressions for validating your input. Readability is only one of them. But there are even more important reasons.

Time complexity is not known

Suppose you want to verify if a string begins with "http://". Which of the following regex’s would be more appropriate? "^http://", "http://", "http://.*", "http:\/\/" or some other combination?

It’s not always easy to see if a specific regex has some drawbacks compared to another, in particular, because most regex engines do not give any guarantees about the performance. On the other side, simple algorithms like find, contains, begins_with, ends_with, compare, split et al., have normally strong guarantees about their performance complexity. If those algorithms are not available in the standard library of the language of your choice they should be easy to implement.

To make things worse, there are different regex dialects and some implementations are locale-aware, whereas others are not. The constraints are not always easy to understand.

What if we want to verify if a string is composed of at least 4 digits and at maximum 9? Probably "^[0-9]{4,9}$". What are the runtime costs?

On the other hand something like (3 < length < 10) && for_each(is_digit) has at most 9 access in most programming languages. No hidden complexity.

Memory usage is unknown

I’ve already mentioned that time complexity is difficult to predict, but what about memory usage?

Let’s reuse the last example: what’s the memory usage for "^[0-9]{4,9}$"?

And what’s the memory usage of (3 < length < 10) && for_each(is_digit) in most programming languages? Verifying if a character is a digit, doesn’t usually have any additional costs, in terms of memory consumption. We have no idea about the regex, whereas the manual verification has no space costs.

Algorithms are easier to write and read

Consider something like ((?=.\d)(?=.[a-z])(?=.[A-Z])(?=.[\W]).{6,20}). What does this regex do?

Apparently, it tests if a string contains one lowercase, one uppercase, and one "special" character and is at least 6 characters long, but not longer than 20.

How difficult would it be to test for those conditions manually?

struct string_matcher{
	int lowercase;
	int uppercase;
	int special;
};

bool is_valid(string_matcher sm){
	return sm.lowercase >=1 && sm.uppercase >=1 && sm.special >=1;
}

if(s.length <6 || s.length >20){
  // fail
}
while(pos < s.length && !is_valid()){
	++lowercase || ++uppercase || ++special
}
if(!is_valid()){
  // fail
}

Yes, those lines of code are strictly tied to the language we are currently using for programming. But it is easy to understand since we encapsulated the condition in variable names, and therefore easy to spot errors compared to the regex. And since it is so trivial, it is also easy to implement in different languages.

It also makes explicit that there might be some problems with encoding and locale. For example, would you consider ΓΆ and Γ– as normal or special characters (those characters are common in Germanic and Scandinavian languages), and what about Γ² and Γ’ (which are common in Latin idioms)? Does our regex take that into account? How does it behave? It depends on the regex engine, and how we wrote our regex. In the regex above those characters might be ruled out since we wrote [a-z] and [A-Z] instead of [[: character:]], but we might also have ruled out characters like tuvwxy if the locale is set to Estonian. The behavior really depends on factors that we do not always control, or are not so well documented.

What is a lowercase/uppercase letter? Some languages do not have a notion for it (some do not have even a notion of a letter), and in others, it gets more complicated. For πŸ—„️ example πŸ—„️ in πŸ—„️ Turkish πŸ—„️ uppercase πŸ—„️ i πŸ—„️ is πŸ—„️ not I πŸ—„️.

All those issues are hidden by the regex engine. Does [a-z][A-Z] do the right thing? Even in Turkish where do we have Δ° and Δ± to consider? And what about other languages? I tried to read the regex documentation of different programming languages, but finding the answer is not easy. Of course, we can test for those edge cases explicitly, but what would they show us? An intended (but undocumented) behavior? A behavior that depends on our environment? An unintended behavior that might change in the future?

Do we also always remember the difference between \d and [0-9]? There are more than 10 digits, even if most of the time we think in base 10 and use the Arabic numerals. Using the is_digit method of the C++ standard library does not solve the problem, unless it is documented very clearly πŸ—„️, if it is locale-dependent or not. On the other side, if someone writes or copies (shudder) a regex that contains a \d, we might never know if he intended [0-9], if he thought that \d and [0-9] where equivalent, or if on the previous system they where, or something else. And we did not even talk about [[: digit:]]! If you want to match digits, for example for searching an IP address, and want maximum portability, the most foolproof method is to write [0123456789]. More information about the differences between those four expressions can be found at unix.stackexchange, where also differences between different programming languages are shown. This only shows how difficult it is to write a correct regex, even in its simplest form.

Did I not convince you? Look at this stackoverflow question: how to validate an ipv4 address. There are so many answers there, and also comments where people propose an alternative similar-looking regex, that always seems to fail for someone else. I guess nearly no one will be able to simply look at them, and say, "This one is wrong", and "This one is right". And no one noticed that many proposed solutions also use \d instead of [0123456789] (or simply [0-9]), probably no one tried any of the solutions with some non-English locale, since grep is locale-aware.

So, if you plan to search (not validate!) for an IPv4 address in some known files, grep, and some regular expressions are a superb quick-and-dirty solution. But if you are writing a program (in any programming language!) and want to validate an input that should represent an IPv4 address, do not even think of using a regex. How difficult would it be to iterate over the characters, verify that it is a number or a dot, and verify the structure? You need to check for four numbers between 0 and 255 separated by a dot after all!

A custom algorithm might be more easy to adapt

What if we need to change the validation? For example, we decide that validating a lowercase or an uppercase letter is sufficient.

How easy is it to update the regex and keep it readable? Changing the manual validation is, for those cases, very easy. Also handling possible merge conflicts while developing is easy, whereas in a regex it is generally more complex.

Better error messages for the end user

Suppose that you are using a regex to verify that passwords are complex enough. If they are not, you need to tell the end-user that they need to provide a stronger password.

Wouldn’t it be nice to tell them why the current password is not good enough? And maybe what they need to change in order to make a good and strong password?

A regex simply tells us if it found a match or not, not what characters it could not find. Of course, it can be enhanced to capture the content, but it will make the regex more complex.

With a custom algorithm (see the string_matcher example) we can tell the user exactly what’s missing, for example, an uppercase letter and a special character, since we collected all the relevant information while parsing the string.

Testing a regex

Since a complex regex is more difficult to read, it gets also more difficult to test them thoroughly and possible edge cases.

A parsing algorithm might not be easy to test either, but most of the time, if the algorithm is complex, it is possible to split it and unit-test its components.

There are some tools for getting text that validate our regex like this one. It might help to understand what a regex is doing by providing data that passes through the regex, or other tools like regex101 that try to explain step by step what the regex is doing.

This is of course not possible with our custom algorithm, but we have other tools that might be a lot more effective. For example a fuzzy testing. Running it through the regex means testing the whole regex engine, which also means that it will need a lot more time than running the fuzzy tester on a custom algorithm.

Escaping

Some characters inside a regex need to be escaped.

In most programming languages other characters need to get escaped too, leading to a situation where a regex is difficult to read because some characters need to get escaped in both languages and therefore multiple times, leading to the leaning toothpick syndrome Forgetting the escapes makes the regex easier to read, but incorrect, and without some tests, you’ll probably have a hard time understanding what is wrong.

On the other side, with a standard or custom algorithm, there is no need for escaping, since we are not using a generic DSL (the regex) inside our programming language.

When to use a regex

Of course, there are still valid use cases where to use a regex. Every rule that blindly states that something should never be done, is probably flawed.

A regex expression is a more generic and powerful algorithm than our custom-made one. The main disadvantage of a regex is that it is too generic, and therefore gives little guarantees about space and time costs and that it will not be readable for all possible use cases.

The probably most useful place where to use regex is for configuration options and configuration files, searches, and simple quick-and-dirty throw-away one-time validations.

When allowing a regex in configuration files, the user has a powerful tool for defining rules, for example, in neomutt I have the following line:

color body brightyellow black (((ht|f)tps?)|mailto):(//)?[^\ "\t]*|www\.[-a-z0-9.]+[^\ .,;\t>">]

neomutt gives the user the possibility to match some strings and colorize them.

This regex should match strings that look like a URL or mailto address, and colorize them in yellow. It is broken beyond repair for any input validation and I have no idea what input it matches exactly without reading it too many times and doing some testing.

It is not intended for validating URLs, it has false positives and negatives, and everyone will hate you for using it as a validation method.

But it gets the job done, as it catches most URLs I get per email. And if it does not catch some URLs it is not that bad, since the colorizing is only to help me identify parts of the email, I’m still reading its content.

Would be using here using a custom algorithm a better alternative? Sure, neomutt could provide an option matchurl, but it’s a very specific use case, that probably does not interest the majority of users. What if we do not want to colorize the mailto? Or what if one wants to colorize https URLs differently?

Giving all these options in a configuration file just to highlight a piece of text would be overwhelming. We would end by creating a DSL for supporting most use cases, that we need to parse and understand. It would probably not even have a better performance or readability compared to a regular expression, and it would be surely less documented.

Nevertheless, it would be great if there would be the possibility to highlight all URLs without writing a regex.

As with other tools, too much flexibility has a cost. And as always, we need to choose the most appropriate tool for the job. Applying a tool blindly rarely does a good job.

Searches inside an editor and searching and replacing text are examples where regexes are commonly used. But also in those cases, they might not always be the best tools. During refactoring tasks, for example, when renaming a variable, a naive regex could find too many references to replace. We need, for example, to pay attention that we are not going to rename variables in other scopes. Nonetheless, a regex will probably do a good job, and we can still fix the edge cases manually.

Conclusion

Do not use regexes for input validation. A custom algorithm will be more maintainable and readable in the long term. It will also be easier to test, to port between different projects and you can analyze its complexity, both space and time.

Regexes are a great tool for end-users and for simple daily tasks. Some of those tasks could be searching or replacing text. Regexes can also be very useful for configuration files, or if we do not need a 100% correct solution, as an alternative to another DSL, or as an option for the end user to extend some programs.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.