Proactively detect and fix flaky tests in your test suite

Published: 21 May 2023
Written by: Chun Fei Lung

Proactive detection of NIO tests makes it possible to fix flaky tests before they ever cause any problems.

Wait, are these snow flakes part of the test?

“Flaky” tests are non-deterministic tests that may pass or fail, seemingly for no good reason. A lot of research is focussed on so-called order-dependent tests, whose outcome depends on the order in which tests are run. Such tests may fail unexpectedly when they are run manually in a different order or when regression testing techniques, like test parallelisation, are used.

About the article

Title	Preempting flaky tests via non-idempotent-outcome tests
Year	2022
Author(s)	Anjiang Wei (Stanford University) Pu Yi (Peking University) Zhengxi Li (University of Illinois) Tao Xie (Peking University) Darko Marinov (University of Illinois) Wing Lam (George Mason University)
Venue	International Conference on Software Engineering

The paper introduces some important terminology before its proceeds with a description of the actual study:

A test is called a victim if it fails when run after another test, called a polluter, in the same test suite, but passes when run before that other test. The victim fails because the tests share some state, which is ruined by the polluter.
Test assertions that depend on shared state are called brittle assertions. Each victim has at least one brittle assertion, but not all tests with a brittle assertion are victims.
A test is called a latent-victim if it has a brittle assertion, but may or may not currently be a victim (e.g. because no other test writes to its shared state).
In the same vein, a latent-polluter is a test that modifies the shared state, but may or may not have a victim in the test suite (e.g. because no other test reads from its shared state).

To reduce the risk that flaky tests fail at inopportune times, experts advocate for proactively detecting potentially flaky tests, so that they can be fixed. Regrettably, existing approaches have an obscenely high false positive rate, which renders them useless.

In this study, the authors look at non-idempotent-outcome (NIO) tests. A test is an NIO test if the test outcome (pass or fail) changes after repeated test runs, due to changes to the state shared among runs of the NIO test. What’s special about NIO tests, is that they are simultaneously latent-victims and latent-polluters: this makes it unlikely that reports are false positives and thus should ensure that such tests are useful to developers.

Minimal examples for each type of flaky test are shown in pseudocode below.

// shared variables x, y, z, w are initialized to 0
void t1 () { assert x == 0; } // victim
void t2 () { x = 1; } // polluter
void t3 () { assert z == 0; } // latent-victim
void t4 () { y = 1; } // latent-polluter
void t5 () { assert w == 0; w = 1; } // NIO

Run it twice

Detection of NIO tests is based on a simple idea: each test is run twice in the same test execution environment to check whether the test deterministically (side note: i.e. always) passes in the first run, but fails in the second. This can be done by rerunning…

single test methods;
all test methods from a single class; or
all test methods from the entire test suite.

Each mode comes with accuracy and performance trade-offs. For example, the first mode might overlook a NIO test that is the victim of another test method. Moreover, even though the same amount of tests are run for each mode, the number of language runtime startups differs greatly. For a language like Java, the first mode requires a new JVM run for each method (side note: This includes starting the JVM and loading the required classes.), while the third mode can be completed in a single JVM run.

The researchers used the three detection modes on 127 open-source Java test suites (side note: This detection was also done on Python projects, presumably to demonstrate that NIO tests are common enough that every project should run NIO detection at least once. The researchers found 138 NIO tests in 90 out of 1,006 projects (which is about 9%).) and found 223 NIO tests in 43 of those test suites. It turns out that the first mode is capable of finding all NIO tests, although this comes at a hefty price: the overhead of the first (single-method) mode is more than eight times(!) that of the third (entire-suite) mode, which has a very acceptable false negative rate of only 5.8%.

Based on these findings, it seems advisable to use the entire-suite mode periodically (e.g. as part of a nightly pipeline) and the single-method mode only for newly added or modified tests.

Fixing NIO tests

To learn more about how developers respond to proposed fixes for detected NIO tests, the researchers fixed and opened pull requests all the tests that they could fix.

Most pull requests were accepted. Only a small number (9 out of 268) were rejected. Based on their experiences, the authors recommend that pull requests for NIO test fixes should provide:

steps to reproduce test failures; and
explanations of why fixing NIO is beneficial,

as this increases the likelihood that they will be accepted by project maintainers.

Summary

Find NIO tests by periodically rerunning the test suite or by rerunning single methods when tests are added or modified
When fixing NIO tests, show how test failures can be reproduced and why it is beneficial to fix such tests

Proactively detect and fix flaky tests in your test suite

Run it twice

Fixing NIO tests

Summary

More about software testing

More about code quality