What do regular expression bugs look like?
Regular expressions (often shortened to “regexes”) are powerful tools that can be very useful, but also make it easy to shoot yourself in the foot. Most readers will probably be familiar with the following quote:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions”. Now they have two problems.
While regular expressions aren’t pure evil, a quick search on GitHub shows more than 227,000 issues that are related to the use of regular expressions. This provides at least some evidence that it’s hard to use regular expressions correctly.
The goal of this study is to understand what type of regex-related issues developers address in pull requests and what fixes for such issues typically look like.
Because mining pull requests from every single repository in existence is infeasible, the researchers focus on active repositories from four large GitHub organisations: Apache, Mozilla, Google, and Facebook. They also limit their scope to repositories that have Java, JavaScript, or Python as their primary language, as these are the three most popular languages on GitHub.
The researchers found 356 regex-related bugs in 350 pull requests across 195 different GitHub repositories in the Apache, Mozilla, Google, and Facebook organisations.
All bugs had one of three possible root causes:
-
An issue in the regular expression itself (61.2%), e.g. incorrect behaviour, compilation errors, or code smells. Most bugs in this category are caused by regular expressions that reject valid strings or accept invalid strings. In some cases simpler solutions exist and a regular expression is not needed. Suboptimally composed regular expressions may also lead to performance issues.
-
A bug due to suboptimal or incorrect usage of a regex API (9.3%), e.g. deprecated APIs, wrong flags, lack of input/output validation, or needless evaluation of regular expressions. The weirdest examples in this category are JavaScript’s
RegExp.test
andRegExp.exec
, which are stateful and may show unexpected behaviour when executed twice. -
A bug in other code that was not caused by regular expressions, (29.5%). This often means that regular expressions are part of the solution, not the problem.
The table below lists the three root causes, along with all their subcategories.
Root cause | Manifestation | Category (and sub-category) | Count (%) in (sub)category | Count (%) in manifestation | Count (%) in root cause | |
---|---|---|---|---|---|---|
Regex | Incorrect behaviour | Rejecting valid strings | 102 (61.8%) | 165 (75.7%) | 218 (61.2%) | |
Accepting invalid strings | 36 (21.8%) | |||||
Rejecting valid and accepting invalid | 17 (10.3%) | |||||
Incorrect extraction | 9 (5.5%) | |||||
Unknown | 1 (0.6%) | |||||
Compile error | 8 (3.6%) | |||||
Bad smells | Design smells | Unnecessary regex | 11 (24.4%) | 45 (20.6%) | ||
Other | 6 (13.3%) | |||||
Code smells | Performance issues | 10 (22.2%) | ||||
Regex representation | 10 (22.2%) | |||||
Unused/duplicated regex | 8 (17.8%) | |||||
Regex API | Incorrect computation | 6 (22.2%) | 33 (9.3%) | |||
Bad smells | Design smells | Alternative regex API | 2 (7.4%) | 27 (81.8%) | ||
Code smells | Unnecessary computation | 9 (33.3%) | ||||
Exception handling | 8 (29.6%) | |||||
Deprecated APIs | 5 (18.5%) | |||||
Performance/security | 3 (11.1%) | |||||
Other code | New feature | Data processing | 22 (37.3%) | 59 (56.2%) | 105 (29.5%) | |
Regex-like implementation | 19 (32.2%) | |||||
Regex configuration entry | 18 (30.5%) | |||||
Bad smells | 19 (18.1%) | |||||
Other failures | 27 (25.7%) |
Regex-related pull requests are significantly different from normal pull requests: they take a longer time to get merged and involve more lines of code, possibly because functionality with regular expressions is harder to test.
Type are four types of regex-related changes in such pull requests: regex additions, regex edits, regex removals, and modifications of regex APIs (method calls). Regex edits are by far the most common type of change across all root causes and manifestations.
Edits are especially common when the regular expression itself is the root cause of a bug, unless the problem is that the usage of a regular expression is a design smell; in such cases it is removed. Issues due to incorrect API usage are resolved by changing how regex APIs are used, while problems caused by other code are often solved by adding regular expressions.
Many of the bug fixes tend to look the same. The researchers identified 10 bug fixes patterns, which are listed here in descending order of occurrences.
Correctly escaping regex literals
Regular expressions are typically defined using strings. This may cause issues
when certain meta-characters, like \
and .
are not properly escaped.
Before
After
Extend or shrink the character class
This pattern fixes regular expressions that match too few or too many characters.
Before
After
Replace regex with string methods
Some regular expressions can be replaced with more simpler string-based methods.
Before
After
Replace regex with existing parser
Some types of strings have very specific syntaxes. Examples include email addresses, IP addresses and URLs. For such strings there are often dedicated parsers that are much easier to use and more likely to work correctly.
Before
After
Add or remove a regex alternation
This pattern is used to fix regular expressions that check for too few or many different string-like values.
Before
After
Add or remove a regex to the regex list
This pattern is kind of similar to the previous one, except here the solution involves adding or removing entire regular expressions.
Before
After
Correct the type of regex representation
In Python regular expressions defined in string literals can be prefixed with r
,
which changes how they are parsed.
Before
After
Checking null values for regex execution
Regular expressions don’t work on null values, so you’ll either have to make sure that no null values are ever passed to regex methods or be prepared to handle exceptions whenever they occur.
Before
After
Regex static compilation
By compiling regular expressions statically, they become shareable and don’t have to be compiled more than once.
Before
After
Conditional checking before regex execution
Sometimes you only need a regular expression in specific situations. If so, you can conditionally check input strings before executing a more expensive regex method.
Before
After
-
Most regex-related issues are caused by the regex itself or incorrect usage of regex APIs, or solved by introducing regexes
-
Regex-related pull requests take longer to get merged and involve more lines of code
-
Most regex-related bugs can be fixed using one of 10 common bug fix patterns