What it would take to use mutation testing in industry – A study at Facebook

Published: 15 Aug 2021
Written by: Chun Fei Lung

Mutation testing is useful, but not many developers use it. What steps can we take to increase adoption?

White-box testing can be done when you have access to internal structures.

Mutation testing is a way to determine the quality of your test suite. It works by generating a large number of changed versions of the code, which are called mutants. Examples of changes include deletions of method calls, disabling if conditions, and replacing magic constants.

If the test suite is good enough, it should be able to “kill” these mutants by having at least one previously succeeding test fail.

About the article

Title	What it would take to use mutation testing in industry – A study at Facebook
Year	2021
Author(s)	Moritz Beller (Facebook) Chu-Pan Wong (Carnegie Mellon University) Johannes Bader (Jane Street Capital) Andrew Scott (Facebook) Mateusz Machalica (Facebook) Satish Chandra (Facebook) Erik Meijer (Facebook)
Venue	International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

Why it matters

The result of mutation tests is a so-called mutation score (side note: The ratio of mutants that a test suite manages to kill). Many researchers and developers argue that mutation scores are superior to traditional code coverage, as it’s actually based on a program’s behaviour.

But mutation testing is not a silver bullet:

Mutants can be generated in many different ways, which means that mutation testing becomes infeasible for anything but the smallest code bases.
It is also not clear to developers what they can do to improve the mutation score, and whether an improved score actually has any practical benefits (other than better-looking metrics).

Is there anything that we can do about this?

How the study was conducted

The authors of the paper built a tool that they call Mutation Monkey. It comes with two pipelines, a training and an application pipeline.

Mutation testing is often very costly – not only because generating all the different mutants takes a lot of time and processing power, but also because many of the generated mutants are easily killed (or not even syntactically valid) and thus useless.

The training pipeline solves this problem by semi-automatically learning bug-inducing patterns from three sources:

Defects4J, a collection of bugs extracted from popular OSS Java projects;
An internal database of fixes for crashes that happened in the production version of the Facebook app. By “reversing” these fixes it becomes possible to reintroduce crashes;
Commits with modifications that made an originally failing test pass.

This process is only partially automated, because experts are still needed to decide which and how many patterns to implement, and for the creation of patch-like templates that implement the patterns.

The application pipeline applies the mutation templates to the production version of the code. To reduce the number of mutants that have to be generated (side note: Remember, building and testing is expensive!), the pipeline tries to avoid “unprofitable” spots, like logging calls, and runs a light-weight syntax checker to catch syntactically invalid mutants.

The remaining mutants are submitted to the code review system outside of peak (office) hours (side note: This makes scaling easier and is cheaper.). Mutants that pass the test suite are then presented to developers. The pipeline also tells developers which tests visited the mutated block of code. This information should make it easier for developers to decide what they want to do.

What discoveries were made

Kill rates were fairly similar across the various mutation patterns. However, some mutations were applied successfully a lot more than others. For instance, the NULL_DEREFERENCE pattern was applied almost 2,000 times, while the REMOVED_SYNCHRONIZED mutations only occurred 143 times within the same period of time.

Interestingly, the REMOVED_SYNCHRONIZED is also the only pattern with a much higher kill rate, which suggests that developers are aware that synchronisation-related bugs are hard to debug and thus spend more time writing tests for them.

The researchers also conducted interviews with 29 developers to learn more about the effectiveness of Mutation Monkey’s approach.

Most – if not all – developers had not heard of mutation testing (side note: The concept has been around since the 70s, but it’s not exactly popular.) prior to the experiment, and needed more information than what was provided by Mutation Monkey.

However, after explanation from the researchers about 85% believed that Mutation Monkey is a useful tool that could help them write (better) tests. Virtually everyone was also positive about the test coverage information that was included with the test reports.

However, less than half of the developers confirmed that they would write a test for the gap that Mutation Monkey had found. When asked why not, developers often gave the following reasons:

they want Mutation Monkey to come up with a test;
the mutated code was of minor importance;
the mutated code was about to be deprecated;
the code was still new and likely to undergo iteration before stabilising; and
the mutated code is in a badly tested part of the code base (side note: ?????).

In other words, this new approach seems to be better than existing approaches, but still yields too many false positives.

Summary

Mutation testing can help you with your test efforts, but also gets very expensive very easily
The costs of mutation testing can be lowered by focussing on mutants that are likely to introduce bugs
Mutation testing can be done more effectively when developers are informed about what it is and which tests should be improved