Stop using regex for validation

A regular expression (also known as regex) is a sequence of characters that define a search pattern. Regex’s are widely supported in search dialog inside IDE, text editors, and other tools. They are also supported out of the box in different programming languages, like Java, C++, JavaScript, PHP, Python, Perl and many others.

Regular expressions are known to be hard to read, but, nonetheless, people ask all the time how to validate passwords, emails, and even XML and certificates!

There are many reasons why you do not want to use regex’s for validating your input. Readability is only one of them. A few examples below:

Time complexity is not known

Suppose you want to verify if a string begins with "http://". Which of the following regex’s would be more appropriate? "^http://", "http://", "http://.*", "http:\/\/" or some other combination?

It’s not always easy to see if a specific regex has some drawbacks compared to another, in particular because most regex engines do not give any guarantees about the performance. On the other side, simple algorithms like find, contains, begins_with, ends_with, compare, split et al., have normally strong guarantees about their performance complexity. If those algorithms are not available in the standard library of the language of your choice they should be easy to implement.

To make things worse, there are different regex dialects and some implementations are locale-aware, whereas others are not. The constraints are not always easy to understand.

What if we want to verify if a string is composed of at least 4 digits and maximum 9? Probably "^[0-9]{4,9}$". What are the runtime costs?

On the other hand something like (3 < length < 10) && for_each(is_digit) has at most 9 access in most programming languages. No hidden complexity.

Memory usage is unknown

I’ve already mentioned that time complexity is difficult to predict, but what about memory usage?

Let’s reuse the last example: what’s the memory usage for "^[0-9]{4,9}$"?

And what’s the memory usage of (3 < length < 10) && for_each(is_digit) in most programming languages? Verifying if a character is a digit, doesn’t usually have any additional costs. We have no idea for the regex, whereas the manual verification has no space costs.

Algorithms are easier to write and read

Consider something like ((?=.\d)(?=.[a-z])(?=.[A-Z])(?=.[\W]).{6,20}). What does this regex do?

Apparently it tests if a string contains one lowercase, one uppercase, one special character and is at least 6 characters long, but not longer than 20.

How difficult would it be to test for those conditions manually?

struct string_matcher{
	int lowercase;
	int uppercase;
	int special;
};

bool is_valid(string_matcher sm){
	return sm.lowercase >=1 && sm.uppercase >=1 && sm.special >=1;
}

if(s.length <6 || s.length >20){
  // fail
}
while(pos < s.length && !is_valid()){
	++lowercase || ++uppercase || ++special
}
if(!is_valid()){
  // fail
}

Yes, those lines of code are strictly tied to the language we are currently using for programming. But it is easy to understand, since we encapsulated the condition in variable names, and therefore easy to spot errors compared to the regex. And since it is so trivial, it is also easy to implement in different languages.

It also makes explicit that there might be some problems with encoding and locale. For example, would you consider ö and Ö as normal or special characters (those characters are common in germanic and scandinavian languages), and what about ò and Ò (which are common in latin idioms)? Does our regex take that into account? How does it behave? It depends on the regex engine, and how we wrote our regex. In the regex above those characters might be ruled out since we wrote [a-z] and [A-Z] instead of [[: character:]], but we might also have ruled out characters like tuvwxy if the locale is set to Estonian. The behavior really depends on factors that we can’t always control, or are not so well documented.

What is a lowercase/uppercase letter? Some languages do not have a notion for it (some do not have even a notion of letter), and in others it gets more complicated. For example in Turkish uppercase i is not I).

All those issues are hidden by the regex engine. Does [a-z][A-Z] do the right thing? Even in Turkish where we have İ and ı to consider? And what about other languages? I tried to read the regex documentation of different programming languages, but finding the answer is not easy. Of course we can test for those edge cases explicitly, but what would they show us? An intended (but undocumented) behavior? A behavior that depends on our environment or that might change in future with another version?

Do we also always remember the difference between \d and [0-9]? There are more than 10 digits, even if most of the time we think in base 10 and use the Arabic numerals. Of course using the is_digit method of our standard library does not solve the problem, unless it is documented very clearly, if it is locale-dependent or not. On the other side, if someone writes or copies (shudder) a regex that contains a \d, we might never know if he intended [0-9], if he thought that \d and [0-9] where equivalent, or if on the previous system they where, or something else. And we did not even talk about [[: digit:]]! If you really want to match digits, for example for searching an IP-address, and want maximum portability, you should write [0123456789]. More information about the differences with those four expression can be found on unix.stackexchange, where also differences between different programming languages are shown. This only shows how difficult it is to write a correct regex’s, even in it’s simplest form.

Did I not convince you? Look at this stackoverflow question: how to validate an ipv4 address. Look at how many answer there are, and in the comments how many people proposed similar-looking regex’s, that seems to fail for someone else. I guess nearly no one will be able to simply look at them, and say, this one is wrong, and this one is right. And no one noticed that many proposed solution also use \d instead of [0123456789] (or simply [0-9]), probably no one tried any of the solutions with some non-English locale, since grep is locale-aware.

So, if you plan to search (not validate!) for an IPv4 address in some known files, grep, and some regular expression are a superb quick-and-dirty solution. But if you are writing a program (in any programming language!) and want to validate an input that should represent an IPv4 address, do not even think of using a regex. How much difficult would it be to iterate over the characters, verify that it is a number or a dot, and verify the structure? You need to check for four numbers between 0 and 255 separated by a dot after all!

A custom algorithm might be more easy to adapt

What if we need to change the validation, for example we decide that a lowercase or an uppercase letter is sufficient. How easy is it to update the regex and keep it readable? Changing the manual validation is very easy, also handling possible merge conflicts is doable, whereas in a regex it is generally more complex.

Better error messages for the end user

Suppose that you are using a regex to verify that passwords are complex enough. If they are not, you need to tell the end-user that they need to provide a stronger password.

Wouldn’t it be nice to tell them why the current password is not good enough? And maybe what they need to change in order to make a good and strong password?

A regex simply tells us if it found a match or not, not what characters it could not found. Of course it can be enhanced to capture the content, but it will make the regex more complex.

With our custom algorithm (see the string_matcher example) we can tell the user exactly what’s missing, for example an uppercase letter and a special character, since we collected all the relevant information while parsing the string.

Testing a regex

Since complex regex’s are more difficult to read, it gets also more difficult to test them thoroughly and possible edge cases.

A parsing algorithm might not be easy to test either, but most of the time, if the algorithm is complex, it is possible to split it and unit-test its components.

There are some tools for getting text that validates our regex like this one. It might help to understand what a regex is doing by providing data that passes thought the regex, or other tools like regex101 that tries to explain step by step what the regex is doing.

This is of course not possible with our custom algorithm, but we have other tools that might be a lot more effective. For example a fuzzy testing. Running it through the regex means to test the whole regex engine, which also means that it will need a lot more time than running the fuzzy tester on our loop.

Escaping

Some characters inside a regex need to get escaped. In most programming languages other characters needs to get escaped too, leading to a situation where some regex’s are difficult to read because some characters need to get escaped in both languages and therefore multiple times, leading to the leaning toothpick syndrome Forgetting the escapes makes the regex easier to read, but incorrect, and without some tests you’ll probably never notice the issue.

On the other side, with a standard or custom algorithm, there is no need for escaping, since we do not have a generic DSL (the regex) inside our programming language.

When to use regex’s

Of course there are still valid use cases where to use regex’s. Every rule that blindly states that something should never be done, is probably flawed.

A regex expression is a more generic and powerful algorithm than our custom-made one. The main disadvantage of a regex is that it is such generic, and therefore cannot give any guarantees about memory and time costs, and that it can’t be readable for all possible use cases.

The probably most useful place where to use regex’s are for configuration options and configuration files, searches, and simple quick-and-dirty throw-away validations.

When allowing regex’s in configuration files, the user has a powerful tool for defining rules, for example, in neomutt I have the following line:

color body brightyellow black (((ht|f)tps?)|mailto):(//)?[^\ "\t]*|www\.[-a-z0-9.]+[^\ .,;\t>">]

neomutt gives the user the possibility to match some string and colorize them. This regex should match strings that seems to be an URL or mailto, and colorize them in yellow. It is broken beyond repair for any input validation and I have no idea what input it matches exactly without reading it too many times and do some testing. It is not intended for validating URLs, it has false positive and negatives, and everyone will hate you for using it as validation method. But it get’s the job done, since it catches most simple URLs I get per email. And if it does not catch some URLs it is not that bad, since the colorizing is only to help me identify parts of the email, I’m still reading it’s content.

Would be using here using a custom algorithm a better alternative? Sure, neomutt could provide an option matchurl, but it’s a very specific use case, that probably does not interest most user. What if we do not want to colorize the mailto? Or https URL with a different color? Giving all these options in a configuration file just highlight a piece of text would be overwhelming. We would end by creating a DSL (Domain specific language) for supporting most use-cases, that we need to parse and understand. It would’nt probably have a better performance or readability compared to a regular expression, and it would be surely less documented. But of course it would be great if there would be the possibility to highlight all URLs without writing a regex.

As with other tools too much flexibility has a cost. And as always, we need to choose the right tool for the job. Applying a tool blindly makes rarely a good job.

Searches inside an editor and searching and replacing text are examples where regex’s are commonly used. But also in those cases, they might not always be the best tools. During refactoring tasks, for example, when renaming a variable, a naive regex could find to many references to replace. We need, for example, to pay attention that we are not going to rename variables in other scopes. Nonetheless, a regex will probably do a good job, and we can still fix the edge cases manually.

Conclusion

Do not use regex’s for input validation. A custom algorithm, will be more maintainable and readable in the long-term. It will also be easier to test, to port between different projects and you can analyze it’s complexity, both space and time.

Regex’s are a great tool for end-user and for simple daily tasks. Some of those tasks could be searching or replacing text. Regex’s can also be very useful for configuration files, or if we do not need a 100% correct solution, as ana alternative to another DSL, or as an option for the end user to extend some programs.