Proactively detect and fix flaky tests in your test suite
“Flaky” tests are non-deterministic tests that may pass or fail, seemingly for no good reason. A lot of research is focussed on so-called order-dependent tests, whose outcome depends on the order in which tests are run. Such tests may fail unexpectedly when they are run manually in a different order or when regression testing techniques, like test parallelisation, are used.
The paper introduces some important terminology before its proceeds with a description of the actual study:
A test is called a victim if it fails when run after another test, called a polluter, in the same test suite, but passes when run before that other test. The victim fails because the tests share some state, which is ruined by the polluter.
Test assertions that depend on shared state are called brittle assertions. Each victim has at least one brittle assertion, but not all tests with a brittle assertion are victims.
A test is called a latent-victim if it has a brittle assertion, but may or may not currently be a victim (e.g. because no other test writes to its shared state).
In the same vein, a latent-polluter is a test that modifies the shared state, but may or may not have a victim in the test suite (e.g. because no other test reads from its shared state).
To reduce the risk that flaky tests fail at inopportune times, experts advocate for proactively detecting potentially flaky tests, so that they can be fixed. Regrettably, existing approaches have an obscenely high false positive rate, which renders them useless.
In this study, the authors look at non-idempotent-outcome (NIO) tests. A test is an NIO test if the test outcome (pass or fail) changes after repeated test runs, due to changes to the state shared among runs of the NIO test. What’s special about NIO tests, is that they are simultaneously latent-victims and latent-polluters: this makes it unlikely that reports are false positives and thus should ensure that such tests are useful to developers.
Minimal examples for each type of flaky test are shown in pseudocode below.
Detection of NIO tests is based on a simple idea: each test is run twice in the same test execution environment to check whether the test passes in the first run, but fails in the second. This can be done by rerunning…
- single test methods;
- all test methods from a single class; or
- all test methods from the entire test suite.
Each mode comes with accuracy and performance trade-offs. For example, the first mode might overlook a NIO test that is the victim of another test method. Moreover, even though the same amount of tests are run for each mode, the number of language runtime startups differs greatly. For a language like Java, the first mode requires , while the third mode can be completed in a single JVM run.
The researchers used the three detection modes on 127 and found 223 NIO tests in 43 of those test suites. It turns out that the first mode is capable of finding all NIO tests, although this comes at a hefty price: the overhead of the first (single-method) mode is more than eight times(!) that of the third (entire-suite) mode, which has a very acceptable false negative rate of only 5.8%.
Based on these findings, it seems advisable to use the entire-suite mode periodically (e.g. as part of a nightly pipeline) and the single-method mode only for newly added or modified tests.
To learn more about how developers respond to proposed fixes for detected NIO tests, the researchers fixed and opened pull requests all the tests that they could fix.
Most pull requests were accepted. Only a small number (9 out of 268) were rejected. Based on their experiences, the authors recommend that pull requests for NIO test fixes should provide:
- steps to reproduce test failures; and
- explanations of why fixing NIO is beneficial,
as this increases the likelihood that they will be accepted by project maintainers.
Find NIO tests by periodically rerunning the test suite or by rerunning single methods when tests are added or modified
When fixing NIO tests, show how test failures can be reproduced and why it is beneficial to fix such tests