Chuniversiteit.nl
The Toilet Paper

What do regular expression bugs look like?

This study gives us a better understanding of the practical problems faced by developers when using regular expressions.

Two border agents examine a little girl’s backpack that has some Arabic text on it.
Regular expressions are kind of like Arabic, in the sense that they make some people very uncomfortable for the wrong reasons.

Regular expressions (often shortened to “regexes”) are powerful tools that can be very useful, but also make it easy to shoot yourself in the foot. Most readers will probably be familiar with the following quote:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions”. Now they have two problems.

While regular expressions aren’t pure evil, a quick search on GitHub shows more than 227,000 issues that are related to the use of regular expressions. This provides at least some evidence that it’s hard to use regular expressions correctly.

The goal of this study is to understand what type of regex-related issues developers address in pull requests and what fixes for such issues typically look like.

Because mining pull requests from every single repository in existence is infeasible, the researchers focus on active repositories from four large GitHub organisations: Apache, Mozilla, Google, and Facebook. They also limit their scope to repositories that have Java, JavaScript, or Python as their primary language, as these are the three most popular languages on GitHub.

The researchers found 356 regex-related bugs in 350 pull requests across 195 different GitHub repositories in the Apache, Mozilla, Google, and Facebook organisations.

All bugs had one of three possible root causes:

  • An issue in the regular expression itself (61.2%), e.g. incorrect behaviour, compilation errors, or code smells. Most bugs in this category are caused by regular expressions that reject valid strings or accept invalid strings. In some cases simpler solutions exist and a regular expression is not needed. Suboptimally composed regular expressions may also lead to performance issues.

  • A bug due to suboptimal or incorrect usage of a regex API (9.3%), e.g. deprecated APIs, wrong flags, lack of input/output validation, or needless evaluation of regular expressions. The weirdest examples in this category are JavaScript’s RegExp.test and RegExp.exec, which are stateful and may show unexpected behaviour when executed twice.

  • A bug in other code that was not caused by regular expressions, (29.5%). This often means that regular expressions are part of the solution, not the problem.

The table below lists the three root causes, along with all their subcategories.

Root causeManifestationCategory (and sub-category)Count (%) in (sub)categoryCount (%) in manifestationCount (%) in root cause
RegexIncorrect behaviourRejecting valid strings102 (61.8%)165 (75.7%)218 (61.2%)
Accepting invalid strings36 (21.8%)
Rejecting valid and accepting invalid17 (10.3%)
Incorrect extraction9 (5.5%)
Unknown1 (0.6%)
Compile error8 (3.6%)
Bad smellsDesign smellsUnnecessary regex11 (24.4%)45 (20.6%)
Other6 (13.3%)
Code smellsPerformance issues10 (22.2%)
Regex representation10 (22.2%)
Unused/duplicated regex8 (17.8%)
Regex APIIncorrect computation6 (22.2%)33 (9.3%)
Bad smellsDesign smellsAlternative regex API2 (7.4%)27 (81.8%)
Code smellsUnnecessary computation9 (33.3%)
Exception handling8 (29.6%)
Deprecated APIs5 (18.5%)
Performance/security3 (11.1%)
Other codeNew featureData processing22 (37.3%)59 (56.2%)105 (29.5%)
Regex-like implementation19 (32.2%)
Regex configuration entry18 (30.5%)
Bad smells19 (18.1%)
Other failures27 (25.7%)

Regex-related pull requests are significantly different from normal pull requests: they take a longer time to get merged and involve more lines of code, possibly because functionality with regular expressions is harder to test.

Type are four types of regex-related changes in such pull requests: regex additions, regex edits, regex removals, and modifications of regex APIs (method calls). Regex edits are by far the most common type of change across all root causes and manifestations.

Edits are especially common when the regular expression itself is the root cause of a bug, unless the problem is that the usage of a regular expression is a design smell; in such cases it is removed. Issues due to incorrect API usage are resolved by changing how regex APIs are used, while problems caused by other code are often solved by adding regular expressions.

Common bug fix patterns

Many of the bug fixes tend to look the same. The researchers identified 10 bug fixes patterns, which are listed here in descending order of occurrences.

Correctly escaping regex literals

Regular expressions are typically defined using strings. This may cause issues when certain meta-characters, like \ and . are not properly escaped.

Before
After

Extend or shrink the character class

This pattern fixes regular expressions that match too few or too many characters.

Before
After

Replace regex with string methods

Some regular expressions can be replaced with more simpler string-based methods.

Before
After

Replace regex with existing parser

Some types of strings have very specific syntaxes. Examples include email addresses, IP addresses and URLs. For such strings there are often dedicated parsers that are much easier to use and more likely to work correctly.

Before
After

Add or remove a regex alternation

This pattern is used to fix regular expressions that check for too few or many different string-like values.

Before
After

Add or remove a regex to the regex list

This pattern is kind of similar to the previous one, except here the solution involves adding or removing entire regular expressions.

Before
After

Correct the type of regex representation

In Python regular expressions defined in string literals can be prefixed with r, which changes how they are parsed.

Before
After

Checking null values for regex execution

Regular expressions don’t work on null values, so you’ll either have to make sure that no null values are ever passed to regex methods or be prepared to handle exceptions whenever they occur.

Before
After

Regex static compilation

By compiling regular expressions statically, they become shareable and don’t have to be compiled more than once.

Before
After

Conditional checking before regex execution

Sometimes you only need a regular expression in specific situations. If so, you can conditionally check input strings before executing a more expensive regex method.

Before
After

Summary

  1. Most regex-related issues are caused by the regex itself or incorrect usage of regex APIs, or solved by introducing regexes

  2. Regex-related pull requests take longer to get merged and involve more lines of code

  3. Most regex-related bugs can be fixed using one of 10 common bug fix patterns

More about software testing

More about programming