Chuniversiteit logomarkChuniversiteit.nl
The Toilet Paper

How software developers deal with flaky tests

Flaky tests can cause CI builds to fail unexpectedly, and should be fixed as quickly as possible. This study shows why.

A printer prints out a sheet of paper with a poop emoji on it
Sometimes it works, sometimes it doesn’t.

Test cases that pass and fail without any changes to the code under test are called “flaky”. Flaky tests disrupt continuous integration, harm productivity, and lead to a loss of confidence in testing, which is why it is important to understand what causes flaky tests and what one can do about them.

This week’s paper describes a study where researchers conducted a survey among 170 software developers and analysed 38 threads on Stack Overflow to understand how developers define and react to flaky tests, and their experiences of the impacts and causes.

Defining flaky tests

Link

Most survey participants agreed with the definition of flaky test that was provided at the start of this summary. However, some find that definition too narrow, and propose that a definition should also include the following themes (in order of prevalence):

  • The definition should extend beyond code. For instance, a test may also fail due to the environment in which it runs.

  • A flaky test can indicate that the code under test is flawed. In this case, the term “flaky” is used inappropriately to blame test cases when the flakiness is caused by non-deterministic code.

  • The definition should extend beyond test outcomes, e.g. test coverage and execution time.

  • Flakiness is an inevitable aspect of testing, so one should learn to live with it.

  • The usefulness of a test depends on its ability to catch bugs. If a test always passes, it should also be called flaky.

Cause and effect

Link

Flaky tests impact the development process in several ways. Most importantly, they hinder continuous integration. This also means a loss in productivity and reduced efficiency of testing. Moreover, flaky tests can lead to anger and frustration, which shows the psychological cost of flaky tests.

A particularly interesting finding is that developers who experience flaky tests more often may be more likely to ignore potentially genuine test failures.

When asked about the causes of flaky tests, survey participants indicated that improper setup and teardown is the most frequent cause of flaky tests, followed by flakiness due to network issues, and .

Time and date appears to be a fairly uncommon cause of flaky tests. Nevertheless, those that did mention time and date appeared to feel very strongly about it. This suggests that if a project relies on time and date, it is likely to be a significant cause of flakiness.

Other possible causes of flaky tests include:

  • An issue in an external artifact like a service or library that is outside the scope and control of the software under test.

  • Environmental differences between local development machines and remote build machines.

  • Host system issues, e.g. changes in hardware that causes tests to yield different results.

  • Test data issues due to test data that have “deteriorated” or “changed”.

  • Resource exhaustion: high system loads may cause low-level timeouts in a test suite. The heightened load may be caused by the test itself or by some other process.

  • Differences between operating systems or different versions of the same operating system.

  • Complications arising from the use of virtual machines or containers.

  • UI testing can often be flaky, for example when the UI is not in the expected state when results are checked or an action is performed. This may happen when a test does not wait for the UI to be in a correct state.

  • Conversion issues, e.g. when data moves between a database or a filesystem and the software.

  • Timeouts when a test suite takes too long to complete.

  • Logic errors in the test code or the code under test due to an oversight or misunderstanding by the author may lead to unexpected results.

  • When a test depends on shared state, one test may fail due to an action that is (not) performed by another test. This is somewhat similar to incorrect setup and teardown.

  • Finally, improper mocking can also be a common cause of flakiness.

Dealing with flaky tests

Link

When a build fails due to a flaky test, the most common remedy is to simply re-run the build. The second most common action is to attempt to repair the flaky test. Other actions are much less common.

Developers who often experience flaky tests are more likely to take no action. On the other hand, those who experience flaky tests less frequently are more likely to repair them. This suggests that flaky tests should be repaired as quickly as possible, otherwise you may end up with a team that tends to ignore test results.

The survey revealed the following themes related to actions taken when a test is flaky:

  • Flaky tests may evoke an emotive response in the form of anger or some other emotion.

  • A developer may .

  • One can try to reorder tests, either to enforce a specific order or to make tests fail faster.

  • Sometimes it helps to repair a resource, like a global state.

  • Other times the solution is to rewrite the code under test, if the flaky test actually highlights an issue in the code.

Analysis of Stack Overflow threads suggests that developers also try to fix logic errors, add explicit waits to tests (especially for UI tests), mocks, new (versions of) libraries, fix setup and teardown, and remove dependencies on a shared state.

Summary

Link
  1. Test cases that pass and fail without any changes to the code under test are called flaky

  2. Tests can be flaky for many different reasons, but mostly due to improper setup and teardown, and network issues

  3. Developers who often experience flaky tests are more likely to take no action in response to test failures