Regular expressions (often shortened to “regexes”) are powerful tools that can be very useful, but also make it easy to shoot yourself in the foot. Most readers will probably be familiar with the following quote:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions”. Now they have two problems.
While regular expressions aren’t pure evil, a quick search on GitHub shows more than 227,000 issues that are related to the use of regular expressions. This provides at least some evidence that it’s hard to use regular expressions correctly.
The goal of this study is to understand what type of regex-related issues developers address in pull requests and what fixes for such issues typically look like.
The researchers found 356 regex-related bugs in 350 pull requests across 195 different GitHub repositories in the Apache, Mozilla, Google, and Facebook organisations.
All bugs had one of three possible root causes:
An issue in the regular expression itself (61.2%), e.g. incorrect behaviour, compilation errors, or code smells. Most bugs in this category are caused by regular expressions that reject valid strings or accept invalid strings. In some cases simpler solutions exist and a regular expression is not needed. Suboptimally composed regular expressions may also lead to performance issues.
RegExp.exec, which are stateful and may show unexpected behaviour when executed twice.
A bug in other code that was not caused by regular expressions, (29.5%). This often means that regular expressions are part of the solution, not the problem.
The table below lists the three root causes, along with all their subcategories.
|Root cause||Manifestation||Category (and sub-category)||Count (%) in (sub)category||Count (%) in manifestation||Count (%) in root cause|
|Regex||Incorrect behaviour||Rejecting valid strings||102 (61.8%)||165 (75.7%)||218 (61.2%)|
|Accepting invalid strings||36 (21.8%)|
|Rejecting valid and accepting invalid||17 (10.3%)|
|Incorrect extraction||9 (5.5%)|
|Compile error||8 (3.6%)|
|Bad smells||Design smells||Unnecessary regex||11 (24.4%)||45 (20.6%)|
|Code smells||Performance issues||10 (22.2%)|
|Regex representation||10 (22.2%)|
|Unused/duplicated regex||8 (17.8%)|
|Regex API||Incorrect computation||6 (22.2%)||33 (9.3%)|
|Bad smells||Design smells||Alternative regex API||2 (7.4%)||27 (81.8%)|
|Code smells||Unnecessary computation||9 (33.3%)|
|Exception handling||8 (29.6%)|
|Deprecated APIs||5 (18.5%)|
|Other code||New feature||Data processing||22 (37.3%)||59 (56.2%)||105 (29.5%)|
|Regex-like implementation||19 (32.2%)|
|Regex configuration entry||18 (30.5%)|
|Bad smells||19 (18.1%)|
|Other failures||27 (25.7%)|
Regex-related pull requests are significantly different from normal pull requests: they take a longer time to get merged and involve more lines of code, possibly because functionality with regular expressions is harder to test.
Type are four types of regex-related changes in such pull requests: regex additions, regex edits, regex removals, and modifications of regex APIs (method calls). Regex edits are by far the most common type of change across all root causes and manifestations.
Edits are especially common when the regular expression itself is the root cause of a bug, unless the problem is that the usage of a regular expression is a design smell; in such cases it is removed. Issues due to incorrect API usage are resolved by changing how regex APIs are used, while problems caused by other code are often solved by adding regular expressions.
Many of the bug fixes tend to look the same. The researchers identified 10 bug fixes patterns, which are listed here in descending order of occurrences.
Correctly escaping regex literals
Regular expressions are typically defined using strings. This may cause issues
when certain meta-characters, like
. are not properly escaped.
Extend or shrink the character class
This pattern fixes regular expressions that match too few or too many characters.
Replace regex with string methods
Some regular expressions can be replaced with more simpler string-based methods.
Replace regex with existing parser
Some types of strings have very specific syntaxes. Examples include email addresses, IP addresses and URLs. For such strings there are often dedicated parsers that are much easier to use and more likely to work correctly.
Add or remove a regex alternation
This pattern is used to fix regular expressions that check for too few or many different string-like values.
Add or remove a regex to the regex list
This pattern is kind of similar to the previous one, except here the solution involves adding or removing entire regular expressions.
Correct the type of regex representation
In Python regular expressions defined in string literals can be prefixed with
which changes how they are parsed.
Checking null values for regex execution
Regular expressions don’t work on null values, so you’ll either have to make sure that no null values are ever passed to regex methods or be prepared to handle exceptions whenever they occur.
Regex static compilation
By compiling regular expressions statically, they become shareable and don’t have to be compiled more than once.
Conditional checking before regex execution
Sometimes you only need a regular expression in specific situations. If so, you can conditionally check input strings before executing a more expensive regex method.
Most regex-related issues are caused by the regex itself or incorrect usage of regex APIs, or solved by introducing regexes
Regex-related pull requests take longer to get merged and involve more lines of code
Most regex-related bugs can be fixed using one of 10 common bug fix patterns